Commit Graph

1414320 Commits

Author SHA1 Message Date
Eric Dumazet e4faaf65a7 ipv4: igmp: annotate data-races around idev->mr_maxdelay
idev->mr_maxdelay is read and written locklessly,
add READ_ONCE()/WRITE_ONCE() annotations.

While we are at it, make this field an u32.

Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: David Ahern <dsahern@kernel.org>
Link: https://patch.msgid.link/20260122172247.2429403-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-01-25 14:45:14 -08:00
Eric Dumazet 0541302227 ipvlan: remove ipvlan_ht_addr_lookup()
ipvlan_ht_addr_lookup() is called four times and not inlined.

Split it to ipvlan_ht_addr_lookup6() and ipvlan_ht_addr_lookup4()
and rework ipvlan_addr_lookup() to call these helpers once,
so that they are (auto)inlined.

After this change, ipvlan_addr_lookup() is faster, and we save
350 bytes of text on x86_64.

$ scripts/bloat-o-meter -t vmlinux.old vmlinux.new
add/remove: 0/2 grow/shrink: 1/0 up/down: 123/-473 (-350)
Function                                     old     new   delta
ipvlan_addr_lookup                           467     590    +123
__pfx_ipvlan_ht_addr_lookup                   16       -     -16
ipvlan_ht_addr_lookup                        457       -    -457
Total: Before=22571833, After=22571483, chg -0.00%

Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: Mahesh Bandewar <maheshb@google.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://patch.msgid.link/20260122165049.2366985-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-01-25 14:35:19 -08:00
Eric Dumazet 37b0ea8fef net: expand NETDEV_RSS_KEY_LEN to 256 bytes
NETDEV_RSS_KEY_LEN has been set to 52 bytes in 2014, until now.

Jakub suggested we bump the size to 128 bytes or more.

Some drivers (like idpf) were already working around the core limit.

Since this change might cause some issues in admin scripts,
bump it directly to 256 in one go.

tjbp26:~# cat /proc/sys/net/core/netdev_rss_key | wc -c
768

tjbp26:~# ethtool -x eth1
RX flow hash indirection table for eth1 with 32 RX ring(s):
...
RSS hash key:
fe:16:5b:2f:93:85:c2:c9:c1:ef:bd:60:c6:e0:2b:99:4d:bf:b7:14:c8:1e:8d:cb:31:17:51:da:55:eb:91:d9:9e:f9:89:9b:44:a1:dc:08:72:3a:b3:d6:31:86:9a:fe:02:3a:0d:eb:a1:7c:f5:a3:51:3b:08:56:c9:3f:71:69:01:ba:70:38
RSS hash function:
    toeplitz: on
    xor: off
    crc32: off

Suggested-by: Jakub Kicinski <kuba@kernel.org>
Link: https://lore.kernel.org/netdev/20260122075206.504ec591@kernel.org/
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Link: https://patch.msgid.link/20260122190349.2771064-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-01-25 13:20:46 -08:00
Jakub Kicinski a113a8ac50 Merge branch 'net-few-critical-helpers-are-inlined-again'
Eric Dumazet says:

====================
net: few critical helpers are inlined again

Recent devmem additions increased stack depth. Some helpers
that were inlined in the past are now out-of-line.
====================

Link: https://patch.msgid.link/20260122045720.1221017-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-01-25 13:19:00 -08:00
Eric Dumazet df7388b3d7 net: inline get_netmem() and put_netmem()
These helpers are used in network fast paths.

Only call out-of-line helpers for netmem case.

We might consider inlining __get_netmem() and __put_netmem()
in the future.

$ scripts/bloat-o-meter -t vmlinux.3 vmlinux.4
add/remove: 6/6 grow/shrink: 22/1 up/down: 2614/-646 (1968)
Function                                     old     new   delta
pskb_carve                                  1669    1894    +225
gro_pull_from_frag0                            -     206    +206
get_page                                     190     380    +190
skb_segment                                 3561    3747    +186
put_page                                     595     765    +170
skb_copy_ubufs                              1683    1822    +139
__pskb_trim_head                             276     401    +125
__pskb_copy_fclone                           734     858    +124
skb_zerocopy                                1092    1215    +123
pskb_expand_head                             892    1008    +116
skb_split                                    828     940    +112
skb_release_data                             297     409    +112
___pskb_trim                                 829     941    +112
__skb_zcopy_downgrade_managed                120     226    +106
tcp_clone_payload                            530     634    +104
esp_ssg_unref                                191     294    +103
dev_gro_receive                             1464    1514     +50
__put_netmem                                   -      41     +41
__get_netmem                                   -      41     +41
skb_shift                                   1139    1175     +36
skb_try_coalesce                             681     714     +33
__pfx_put_page                               112     144     +32
__pfx_get_page                                32      64     +32
__pskb_pull_tail                            1137    1168     +31
veth_xdp_get                                 250     267     +17
__pfx_gro_pull_from_frag0                      -      16     +16
__pfx___put_netmem                             -      16     +16
__pfx___get_netmem                             -      16     +16
__pfx_put_netmem                              16       -     -16
__pfx_gro_try_pull_from_frag0                 16       -     -16
__pfx_get_netmem                              16       -     -16
put_netmem                                   114       -    -114
get_netmem                                   130       -    -130
napi_gro_frags                               929     771    -158
gro_try_pull_from_frag0                      196       -    -196
Total: Before=22565857, After=22567825, chg +0.01%

Signed-off-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20260122045720.1221017-5-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-01-25 13:18:53 -08:00
Eric Dumazet 87918dd4ea net: inline net_is_devmem_iov()
1) Inline this small helper to reduce code size and decrease cpu costs.
2) Constify its argument.
3) Move it to include/net/netmem.h, as a prereq for the following patch.

$ scripts/bloat-o-meter -t vmlinux.2 vmlinux.3
add/remove: 0/2 grow/shrink: 0/4 up/down: 0/-158 (-158)
Function                                     old     new   delta
validate_xmit_skb                            866     857      -9
__pfx_net_is_devmem_iov                       16       -     -16
net_is_devmem_iov                             22       -     -22
get_netmem                                   152     130     -22
put_netmem                                   140     114     -26
tcp_recvmsg_locked                          3860    3797     -63
Total: Before=22566015, After=22565857, chg -0.00%

Signed-off-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20260122045720.1221017-4-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-01-25 13:18:53 -08:00
Eric Dumazet cbe41362be gro: change the BUG_ON() in gro_pull_from_frag0()
Replace the BUG_ON() which never fired with a DEBUG_NET_WARN_ON_ONCE()

$ scripts/bloat-o-meter -t vmlinux.1 vmlinux.2
add/remove: 2/2 grow/shrink: 1/1 up/down: 370/-254 (116)
Function                                     old     new   delta
gro_try_pull_from_frag0                        -     196    +196
napi_gro_frags                               771     929    +158
__pfx_gro_try_pull_from_frag0                  -      16     +16
__pfx_gro_pull_from_frag0                     16       -     -16
dev_gro_receive                             1514    1464     -50
gro_pull_from_frag0                          188       -    -188
Total: Before=22565899, After=22566015, chg +0.00%

Signed-off-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20260122045720.1221017-3-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-01-25 13:18:52 -08:00
Eric Dumazet c75734b585 net: always inline skb_frag_unref() and __skb_frag_unref()
clang is not inlining skb_frag_unref() and __skb_frag_unref()
in gro fast path.

It also does not inline gro_try_pull_from_frag0().

Using __always_inline fixes this issue, makes the
kernel faster _and_ smaller.

Also change __skb_frag_ref(), skb_frag_ref() and skb_page_unref()
to let them inlined for the last patch in this series.

$ scripts/bloat-o-meter -t vmlinux.0 vmlinux.1
add/remove: 2/6 grow/shrink: 1/2 up/down: 218/-511 (-293)
Function                                     old     new   delta
gro_pull_from_frag0                            -     188    +188
__pfx_gro_pull_from_frag0                      -      16     +16
skb_shift                                   1125    1139     +14
__pfx_skb_frag_unref                          16       -     -16
__pfx_gro_try_pull_from_frag0                 16       -     -16
__pfx___skb_frag_unref                        16       -     -16
__skb_frag_unref                              36       -     -36
skb_frag_unref                                59       -     -59
dev_gro_receive                             1608    1514     -94
napi_gro_frags                               892     771    -121
gro_try_pull_from_frag0                      153       -    -153
Total: Before=22566192, After=22565899, chg -0.00%

Signed-off-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/20260122045720.1221017-2-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-01-25 13:18:52 -08:00
Jakub Kicinski 3f7522f36b Merge branch 'u64_stats-introduce-u64_stats_copy'
David Yang says:

====================
u64_stats: Introduce u64_stats_copy()

On 64bit arches, struct u64_stats_sync is empty and provides no help
against load/store tearing. memcpy() should not be considered atomic
against u64 values. Use u64_stats_copy() instead.
====================

Link: https://patch.msgid.link/20260120092137.2161162-1-mmyangfl@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-01-25 13:14:15 -08:00
David Yang 58d4ca5b69 vxlan: vnifilter: fix memcpy with u64_stats
On 64bit arches, struct u64_stats_sync is empty and provides no help
against load/store tearing. memcpy() should not be considered atomic
against u64 values. Use u64_stats_copy() instead.

Signed-off-by: David Yang <mmyangfl@gmail.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Link: https://patch.msgid.link/20260120092137.2161162-5-mmyangfl@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-01-25 13:14:11 -08:00
David Yang 35710f0db8 macsec: fix memcpy with u64_stats
On 64bit arches, struct u64_stats_sync is empty and provides no help
against load/store tearing. memcpy() should not be considered atomic
against u64 values. Use u64_stats_copy() instead.

Signed-off-by: David Yang <mmyangfl@gmail.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Link: https://patch.msgid.link/20260120092137.2161162-4-mmyangfl@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-01-25 13:14:11 -08:00
David Yang dd7fca9f49 net: bridge: mcast: fix memcpy with u64_stats
On 64bit arches, struct u64_stats_sync is empty and provides no help
against load/store tearing. memcpy() should not be considered atomic
against u64 values. Use u64_stats_copy() instead.

Signed-off-by: David Yang <mmyangfl@gmail.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Link: https://patch.msgid.link/20260120092137.2161162-3-mmyangfl@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-01-25 13:14:11 -08:00
David Yang df153517e4 u64_stats: Introduce u64_stats_copy()
The following (anti-)pattern was observed in the code tree:

        do {
                start = u64_stats_fetch_begin(&pstats->syncp);
                memcpy(&temp, &pstats->stats, sizeof(temp));
        } while (u64_stats_fetch_retry(&pstats->syncp, start));

On 64bit arches, struct u64_stats_sync is empty and provides no help
against load/store tearing, especially for memcpy(), for which arches may
provide their highly-optimized implements.

In theory the affected code should convert to u64_stats_t, or use
READ_ONCE()/WRITE_ONCE() properly.

However since there are needs to copy chunks of statistics, instead of
writing loops at random places, we provide a safe memcpy() variant for
u64_stats.

Signed-off-by: David Yang <mmyangfl@gmail.com>
Reviewed-by: Ido Schimmel <idosch@nvidia.com>
Link: https://patch.msgid.link/20260120092137.2161162-2-mmyangfl@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-01-25 13:14:11 -08:00
Dimitri Daskalakis 3085ff59fe Documentation: net: Fix typos in netdevices.rst
Fixes two minor typos. Specifically, on -> or and Devices -> Device.

Signed-off-by: Dimitri Daskalakis <dimitri.daskalakis1@gmail.com>
Link: https://patch.msgid.link/20260122225723.2368698-1-dimitri.daskalakis1@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-01-25 13:03:43 -08:00
Jakub Kicinski eac026ff97 Merge branch 'net-rds-rds-tcp-state-machine-and-message-loss-improvements'
Allison Henderson says:

====================
net/rds: RDS-TCP state machine and message loss improvements

This is subset 2 of the larger RDS-TCP patch series I posted last
Oct.  The greater series aims to correct multiple rds-tcp issues that
can cause dropped or out of sequence messages.  I've broken it down into
smaller sets to make reviews more manageable.

In this set, we correct a few RDS/TCP connection handling issues, and
message loss issues.
====================

Link: https://patch.msgid.link/20260122055213.83608-1-achender@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-01-23 11:51:38 -08:00
Gerd Rausch db69e9b838 net/rds: rds_tcp_accept_one ought to not discard messages
RDS/TCP differs from RDS/RDMA in that message acknowledgment
is done based on TCP sequence numbers:
As soon as the last byte of a message has been acknowledged
by the TCP stack of a peer, "rds_tcp_write_space()" goes on
to discard prior messages from the send queue.

Which is fine, for as long as the receiver never throws any messages away.

Unfortunately, that is *not* the case since the introduction of MPRDS:
commit 1a0e100fb2 "RDS: TCP: Enable multipath RDS for TCP"

A new function "rds_tcp_accept_one_path" was introduced,
which is entitled to return "NULL", if no connection path is currently
available.

Unfortunately, this happens after the "->accept()" call, and the new socket
often already contains messages, since the peer already transitioned
to "RDS_CONN_UP" on behalf of "TCP_ESTABLISHED".

That's also the case after this [1]:
commit 1a0e100fb2 "RDS: TCP: Force every connection to be initiated by
numerically smaller IP address"

which tried to address the situation of pending data by only transitioning
connections from a smaller IP address to "RDS_CONN_UP".

But even in those cases, and in particular if the "RDS_EXTHDR_NPATHS"
handshake has not occurred yet, and therefore we're working with
"c_npaths <= 1", "c_conn[0]" may be in a state distinct from
"RDS_CONN_DOWN", and therefore all messages on the just accepted socket
will be tossed away.

This fix changes "rds_tcp_accept_one":

* If connected from a peer with a larger IP address, the new socket
  will continue to get closed right away.
  With commit [1] above, there should not be any messages
  in the socket receive buffer, since the peer never transitioned
  to "RDS_CONN_UP".
  Therefore it should be okay to not make any efforts to dispatch
  the socket receive buffer.

* If connected from a peer with a smaller IP address,
  we call "rds_tcp_accept_one_path" to find a free slot/"path".
  If found, business goes on as usual.
  If none was found, we save/stash the newly accepted socket
  into "rds_tcp_accepted_sock", in order to not lose any
  messages that may have arrived already.
  We then return from "rds_tcp_accept_one" with "-ENOBUFS".
  Later on, when a slot/"path" does become available again
  (e.g. state transitioned to "RDS_CONN_DOWN",
   or HS extension header was received with "c_npaths > 1")
  we call "rds_tcp_conn_slots_available" that simply re-issues
  a "rds_tcp_accept_one_path" worker-callback and picks
  up the new socket from "rds_tcp_accepted_sock", and thereby
  continuing where it left with "-ENOBUFS" last time.
  Since a new slot has become available, those messages
  won't be lost, since processing proceeds as if that slot
  had been available the first time around.

Signed-off-by: Gerd Rausch <gerd.rausch@oracle.com>
Signed-off-by: Jack Vogel <jack.vogel@oracle.com>
Signed-off-by: Allison Henderson <allison.henderson@oracle.com>
Link: https://patch.msgid.link/20260122055213.83608-3-achender@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-01-23 11:51:31 -08:00
Gerd Rausch ad22d24be6 net/rds: No shortcut out of RDS_CONN_ERROR
RDS connections carry a state "rds_conn_path::cp_state"
and transitions from one state to another and are conditional
upon an expected state: "rds_conn_path_transition."

There is one exception to this conditionality, which is
"RDS_CONN_ERROR" that can be enforced by "rds_conn_path_drop"
regardless of what state the condition is currently in.

But as soon as a connection enters state "RDS_CONN_ERROR",
the connection handling code expects it to go through the
shutdown-path.

The RDS/TCP multipath changes added a shortcut out of
"RDS_CONN_ERROR" straight back to "RDS_CONN_CONNECTING"
via "rds_tcp_accept_one_path" (e.g. after "rds_tcp_state_change").

A subsequent "rds_tcp_reset_callbacks" can then transition
the state to "RDS_CONN_RESETTING" with a shutdown-worker queued.

That'll trip up "rds_conn_init_shutdown", which was
never adjusted to handle "RDS_CONN_RESETTING" and subsequently
drops the connection with the dreaded "DR_INV_CONN_STATE",
which leaves "RDS_SHUTDOWN_WORK_QUEUED" on forever.

So we do two things here:

a) Don't shortcut "RDS_CONN_ERROR", but take the longer
   path through the shutdown code.

b) Add "RDS_CONN_RESETTING" to the expected states in
  "rds_conn_init_shutdown" so that we won't error out
  and get stuck, if we ever hit weird state transitions
  like this again."

Signed-off-by: Gerd Rausch <gerd.rausch@oracle.com>
Signed-off-by: Allison Henderson <allison.henderson@oracle.com>
Link: https://patch.msgid.link/20260122055213.83608-2-achender@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-01-23 11:51:31 -08:00
Jakub Kicinski 6a9e8b60f0 Merge branch 'net-restore-the-structure-of-driver-facing-qcfg-api'
Jakub Kicinski says:

====================
net: restore the structure of driver-facing qcfg API

The goal of qcfg objects is to let us seamlessly support new use cases
without modifying all the drivers. We want to pull all the logic of
combining configuration supplied via different interfaces into the core
and present the drivers with a flat queue-by-queue configuration.
Additionally we want to separate the current effective configuration
from the user intent (default vs user setting vs memory provider setting).

Restructure the recently added code to re-introduce the pieces that
are missing compared to the old RFC:
https://lore.kernel.org/20250421222827.283737-1-kuba@kernel.org
Namely:
 - the netdev_queue_config() helper
 - queue config validation callback

I hopefully removed all the more "out there" parts of the RFC.
====================

Link: https://patch.msgid.link/20260122005113.2476634-1-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-01-23 11:49:58 -08:00
Jakub Kicinski b33006ebb7 eth: bnxt: plug bnxt_validate_qcfg() into qops
Plug bnxt_validate_qcfg() back into qops, where it was in my old RFC.

Link: https://patch.msgid.link/20260122005113.2476634-7-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-01-23 11:49:02 -08:00
Jakub Kicinski 8e3245cb30 net: add queue config validation callback
I imagine (tm) that as the number of per-queue configuration
options grows some of them may conflict for certain drivers.
While the drivers can obviously do all the validation locally
doing so is fairly inconvenient as the config is fed to drivers
piecemeal via different ops (for different params and NIC-wide
vs per-queue).

Add a centralized callback for validating the queue config
in queue ops. The callback gets invoked before memory provider
is installed, and in the future should also be called when ring
params are modified.

The validation is done after each layer of configuration.
Since we can't fail MP un-binding we must make sure that
the config is valid both before and after MP overrides are
applied. This is moot for now since the set of MP and device
configs are disjoint. It will matter significantly in the future,
so adding it now so that we don't forget..

Link: https://patch.msgid.link/20260122005113.2476634-6-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-01-23 11:49:02 -08:00
Jakub Kicinski fc1a78a25c net: use netdev_queue_config() for mp restart
We should follow the prepare/commit approach for queue configuration.
The qcfg struct should be added to dev->cfg rather than directly to
queue objects so that we can clone and discard the pending config
easily.

Remove the qcfg in struct netdev_rx_queue, and switch remaining callers
to netdev_queue_config(). netdev_queue_config() will construct the qcfg
on the fly based on device defaults and state of the queue.

ndo_default_qcfg becomes optional because having the callback itself
does not have any meaningful semantics to us.

Reviewed-by: Dragos Tatulea <dtatulea@nvidia.com>
Link: https://patch.msgid.link/20260122005113.2476634-5-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-01-23 11:49:02 -08:00
Jakub Kicinski da7772a2b4 net: move mp->rx_page_size validation to __net_mp_open_rxq()
Move mp->rx_page_size validation where the rest of MP input
validation lives. No other caller is modifying mp params so
validation logic in queue restarts is out of place.

Link: https://patch.msgid.link/20260122005113.2476634-4-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-01-23 11:49:02 -08:00
Jakub Kicinski b9ac2c60a3 net: introduce a trivial netdev_queue_config()
We may choose to extend or reimplement the logic which renders
the per-queue config. The drivers should not poke directly into
the queue state. Add a helper for drivers to use when they want
to query the config for a specific queue.

Link: https://patch.msgid.link/20260122005113.2476634-3-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-01-23 11:49:01 -08:00
Jakub Kicinski 1410c7416d eth: bnxt: always set the queue mgmt ops
Core provides a centralized callback for validating per-queue settings
but the callback is part of the queue management ops. Having the ops
conditionally set complicates the parts of the driver which could
otherwise lean on the core to feed it the correct settings.

Always set the queue ops, but provide no restart-related callbacks if
queue ops are not supported by the device. This should maintain current
behavior, the check in netdev_rx_queue_restart() looks both at op struct
and individual ops.

Reviewed-by: Subbaraya Sundeep <sbhatta@marvell.com>
Link: https://patch.msgid.link/20260122005113.2476634-2-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-01-23 11:49:01 -08:00
Jakub Kicinski 56f9058eb3 Merge branch 'selftest-extend-tun-virtio-coverage-for-gso-over-udp-tunnel'
Xu Du says:

====================
selftest: Extend tun/virtio coverage for GSO over UDP tunnel

The design strategy is to extend the existing tun testing infrastructure
to support this new use-case, rather than introducing a new or parallel framework.
This allows for better integration and re-use of existing test logic.
====================

Link: https://patch.msgid.link/cover.1768979440.git.xudu@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-01-23 11:43:31 -08:00
Xu Du 87db2fdef5 selftest: tun: Add test data for success and failure paths
To improve the robustness and coverage of the TUN selftests, this
patch expands the set of test data.

Signed-off-by: Xu Du <xudu@redhat.com>
Link: https://patch.msgid.link/5054f3ad9f3dbfe33b827183fccc5efeb8fd0da7.1768979440.git.xudu@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-01-23 11:43:29 -08:00
Xu Du 6bdd7ae605 selftest: tun: Add test for receiving gso packet from tun
The test validate that GSO information are correctly exposed
when reading packets from a TUN device.

Signed-off-by: Xu Du <xudu@redhat.com>
Link: https://patch.msgid.link/fe75ac66466380490eba858eef50596a1bfbd071.1768979440.git.xudu@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-01-23 11:43:29 -08:00
Xu Du 400e658aa0 selftest: tun: Add test for sending gso packet into tun
The test constructs a raw packet, prepends a virtio_net_hdr,
and writes the result to the TUN device. This mimics the behavior
of a vm forwarding a guest's packet to the host networking stack.

Signed-off-by: Xu Du <xudu@redhat.com>
Link: https://patch.msgid.link/a988dbc9ca109e4f1f0b33858c5035bce8ebede3.1768979440.git.xudu@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-01-23 11:43:28 -08:00
Xu Du 24e59f26ee selftest: tun: Add helpers for GSO over UDP tunnel
In preparation for testing GSO over UDP tunnels, enhance the test
infrastructure to support a more complex data path involving a TUN
device and a GENEVE udp tunnel.

This patch introduces a dedicated setup/teardown topology that creates
both a GENEVE tunnel interface and a TUN interface. The TUN device acts
as the VTEP (Virtual Tunnel Endpoint), allowing it to send and receive
virtio-net packets. This setup effectively tests the kernel's data path
for encapsulated traffic.

Note that after adding a new address to the UDP tunnel, we need to wait
a bit until the associated route is available.

Additionally, a new data structure is defined to manage test parameters.
This structure is designed to be extensible, allowing different test
data and configurations to be easily added in subsequent patches.

Signed-off-by: Xu Du <xudu@redhat.com>
Link: https://patch.msgid.link/b5787b8c269f43ce11e1756f1691cc7fd9a1e901.1768979440.git.xudu@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-01-23 11:43:28 -08:00
Xu Du 82cfdcfa20 selftest: tun: Refactor tun_delete to use tuntap_helpers
The previous patch introduced common tuntap helpers to simplify
tun test code. This patch refactors the tun_delete function to
use these new helpers.

Signed-off-by: Xu Du <xudu@redhat.com>
Link: https://patch.msgid.link/ecc7c0c2d75d87cb814e97579e731650339703ab.1768979440.git.xudu@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-01-23 11:43:28 -08:00
Xu Du a942fcd72e selftest: tun: Introduce tuntap_helpers.h header for TUN/TAP testing
Introduce rtnetlink manipulation and packet construction helpers that
will simplify the later creation of more related test cases. This avoids
duplicating logic across different test cases.

This new header will contain:
 - YNL-based netlink management utilities.
 - Helpers for ip link, ip address, ip neighbor and ip route operations.
 - Packet construction and manipulation helpers.

Signed-off-by: Xu Du <xudu@redhat.com>
Link: https://patch.msgid.link/91f905715c69c75f7bf72d43388921fde6c34989.1768979440.git.xudu@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-01-23 11:43:28 -08:00
Xu Du e073c118db selftest: tun: Format tun.c existing code
In preparation for adding new tests for GSO over UDP tunnels,
apply consistently the kernel style to the existing code.

Signed-off-by: Xu Du <xudu@redhat.com>
Link: https://patch.msgid.link/d797de1e5a3d215dd78cb46775772ef682bab60e.1768979440.git.xudu@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-01-23 11:43:28 -08:00
Jakub Kicinski 35527de54f Merge branch 'geneve-introduce-double-tunnel-gso-gro-support'
Paolo Abeni says:

====================
geneve: introduce double tunnel GSO/GRO support

This is the [belated] incarnation of topic discussed in the last Neconf
[1].

In container orchestration in virtual environments there is a consistent
usage of double UDP tunneling - specifically geneve. Such setup lack
support of GRO and GSO for inter VM traffic.

After commit b430f6c38d ("Merge branch 'virtio_udp_tunnel_08_07_2025'
of https://github.com/pabeni/linux-devel") and the qemu cunter-part, VMs
are able to send/receive GSO over UDP aggregated packets.

This series introduces the missing bit for full end-to-end aggregation
in the above mentioned scenario. Specifically:

- introduces a new netdev feature set to generalize existing per device
driver GSO admission check.1
- adds GSO partial support for the geneve and vxlan drivers
- introduces and use a geneve option to assist double tunnel GRO
- adds some simple functional tests for the above.

The new device features set is not strictly needed for the following
work, but avoids the introduction of trivial `ndo_features_check` to
support GSO partial and thus possible performance regression due to the
additional indirect call. Such feature set could be leveraged by a
number of existing drivers (intel, meta and possibly wangxun) to avoid
duplicate code/tests. Such part has been omitted here to keep the series
small.

Both GSO partial support and double GRO support have some downsides.
With the first in place, GSO partial packets will traverse the network
stack 'downstream' the outer geneve UDP tunnel and will be visible by
the udp/IP/IPv6 and by netfilter. Currently only H/W NICs implement GSO
partial support and such packets are visible only via software taps.

Double UDP tunnel GRO will cook 'GSO partial' like aggregate packets,
i.e. the inner UDP encapsulation headers set will still carry the
wire-level lengths and csum, so that segmentation considering such
headers parts of a giant, constant encapsulation header will yield the
correct result.

The correct GSO packet layout is applied when the packet traverse the
outermost geneve encapsulation.

Both GSO partial and double UDP encap are disabled by default and must
be explicitly enabled via, respectively ethtool and geneve device
configuration.

Finally note that the GSO partial feature could potentially be applied
to all the other UDP tunnels, but this series limits its usage to geneve
and vxlan devices.

Link: https://netdev.bots.linux.dev/netconf/2024/paolo.pdf [1]
====================

Link: https://patch.msgid.link/cover.1769011015.git.pabeni@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-01-23 11:31:15 -08:00
Paolo Abeni 40146bf755 selftests: net: tests for add double tunneling GRO/GSO
Create a simple, netns-based topology with double, nested UDP tunnels and
perform TSO transfers on top.

Explicitly enable GSO and/or GRO and check the skb layout consistency with
different configuration allowing (or not) GSO frames to be delivered on
the other end.

The trickest part is account in a robust way the aggregated/unaggregated
packets with double encapsulation: use a classic bpf filter for it.

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Petr Machata <petrm@nvidia.com>
Link: https://patch.msgid.link/61f2c98ba0f73057c2d6f6cb62eb807abd90bf6b.1769011015.git.pabeni@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-01-23 11:31:14 -08:00
Paolo Abeni fd0dd79657 geneve: use GRO hint option in the RX path
At the GRO stage, when a valid hint option is found, try match the whole
nested headers and try to aggregate on the inner protocol; in case of hdr
mismatch extract the nested address and port to properly flush on a
per-inner flow basis.

On GRO completion, the (unmodified) nested headers will be considered part
of the (constant) outer geneve encap header so that plain UDP tunnel
segmentation will yield valid wire packets.

In the geneve RX path, when processing a GSO packet carrying a GRO hint
option, update the nested header length fields from the wire packet size to
the GSO-packet one. If the nested header additionally carries a checksum,
convert it to CSUM-partial.

Finally, when the RX path leverages the GRO hints, skip the additional GRO
stage done by GRO cells: otherwise the already set skb->encapsulation flag
will foul the GRO cells complete step to use touch the innermost IP header
when it should update the nested csum, corrupting the packet.

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Link: https://patch.msgid.link/4a9a390588a429191e0ffe48ccdd288bb69e567e.1769011015.git.pabeni@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-01-23 11:31:14 -08:00
Paolo Abeni 0eaf63b3fc geneve: extract hint option at GRO stage
Add helpers for finding a GRO hint option in the geneve header, performing
basic sanitization of the option offsets vs the actual packet layout,
validate the option for GRO aggregation and check the nested header
checksum.

The validation helper closely mirrors similar check performed by the ipv4
and ipv6 gro callbacks, with the additional twist of accessing the
relevant network header via the GRO hint offset.

To validate the nested UDP checksum, leverage the csum completed of the
outer header, similarly to LCO, with the main difference that in this case
we have the outer checksum available.

Use the helpers to extract the hint info at the GRO stage.

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Link: https://patch.msgid.link/cd0e9dc42ba83f388b604097cffe268ffcb53351.1769011015.git.pabeni@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-01-23 11:31:14 -08:00
Paolo Abeni e0a12cbf26 geneve: add GRO hint output path
If a geneve egress packet contains nested UDP encap headers, add a geneve
option including the information necessary on the RX side to perform GRO
aggregation of the whole packets: the nested network and transport headers,
and the nested protocol type.

Use geneve option class `netdev`, already registered in the Network
Virtualization Overlay (NVO3) IANA registry:

https://www.iana.org/assignments/nvo3/nvo3.xhtml#Linux-NetDev.

To pass the GRO hint information across the different xmit path functions,
store them in the skb control buffer, to avoid adding additional arguments.

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Link: https://patch.msgid.link/aa614567f7bdb776d693041375bede4990a19649.1769011015.git.pabeni@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-01-23 11:31:14 -08:00
Paolo Abeni 1da80d91bd geneve: pass the geneve device ptr to geneve_build_skb()
Instead of handing to it the geneve configuration in multiple arguments.
This already avoids some code duplication and we are going to pass soon
more arguments to such function.

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Link: https://patch.msgid.link/761f05690646181fffc533ee4db59b68e5c3a0c3.1769011015.git.pabeni@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-01-23 11:31:14 -08:00
Paolo Abeni 759b8d3cef geneve: constify geneve_hlen()
Such helper does not modify the argument; constifying it will additionally
simplify later patches.

Additionally move the definition earlier, still for later's patchesi sake.

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Link: https://patch.msgid.link/ea9e279b9544e8644194508dd9a4320ee455fa95.1769011015.git.pabeni@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-01-23 11:31:14 -08:00
Paolo Abeni ba1b8c97b9 geneve: add netlink support for GRO hint
Allow configuring and dumping the new device option, and cache its value
into the geneve socket itself.
The new option is not tie to it any code yet.

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Link: https://patch.msgid.link/2295d4e4d1e919a3189425141bbc71c7850a2de0.1769011015.git.pabeni@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-01-23 11:31:14 -08:00
Paolo Abeni e0b73b5a92 vxlan: expose gso partial features for tunnel offload
Similar to the previous patch, reuse the same helpers to add tunnel GSO
partial capabilities to vxlan devices.

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Link: https://patch.msgid.link/93d916c11b3a790a8bfccad77d9e85ee6e533042.1769011015.git.pabeni@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-01-23 11:31:14 -08:00
Paolo Abeni 0c09e89f6c geneve: expose gso partial features for tunnel offload
GSO partial features for tunnels do not require any kind of support from
the underlying device: we can safely add them to the geneve UDP tunnel.

The only point of attention is the skb required features propagation in
the device xmit op: partial features must be stripped, except for
UDP_TUNNEL*.

Keep partial features disabled by default.

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Link: https://patch.msgid.link/d851ca8e928cf05d68310bcbaeaa5e9e0b01e058.1769011015.git.pabeni@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-01-23 11:31:14 -08:00
Paolo Abeni 31c5a71d98 net: introduce mangleid_features
Some/most devices implementing gso_partial need to disable the GSO partial
features when the IP ID can't be mangled; to that extend each of them
implements something alike the following[1]:

	if (skb->encapsulation && !(features & NETIF_F_TSO_MANGLEID))
		features &= ~NETIF_F_TSO;

in the ndo_features_check() op, which leads to a bit of duplicate code.

Later patch in the series will implement GSO partial support for virtual
devices, and the current status quo will require more duplicate code and
a new indirect call in the TX path for them.

Introduce the mangleid_features mask, allowing the core to disable NIC
features based on/requiring MANGLEID, without any further intervention
from the driver.

The same functionality could be alternatively implemented adding a single
boolean flag to the struct net_device, but would require an additional
checks in ndo_features_check().

Also note that [1] is incorrect if the NIC additionally implements
NETIF_F_GSO_UDP_L4, mangleid_features transparently handle even such a
case.

Signed-off-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://patch.msgid.link/5a7cdaeea40b0a29b88e525b6c942d73ed3b8ce7.1769011015.git.pabeni@redhat.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-01-23 11:31:05 -08:00
Jakub Kicinski 8811df1dd4 Merge branch 'net-convert-drivers-to-get_rx_ring_count-last-part'
Breno Leitao says:

====================
net: convert drivers to .get_rx_ring_count (last part)

Commit 84eaf4359c ("net: ethtool: add get_rx_ring_count callback to
optimize RX ring queries") added specific support for GRXRINGS callback,
simplifying .get_rxnfc.

Remove the handling of GRXRINGS in .get_rxnfc() by moving it to the new
.get_rx_ring_count().

This simplifies the RX ring count retrieval and aligns the following
drivers with the new ethtool API for querying RX ring parameters.

 * sfc
 * ionic
 * sfc/siena
 * sfc/ef100
 * fbnic
 * mana
 * nfp
 * atlantic
 * benet (this is v2 in fact, where v1 had some discussions that
   required a v2). See link [0]

Link: https://lore.kernel.org/all/20260119094514.5b12a097@kernel.org/ [0]

This is covering the last drivers, and as soon as this lands, I will
change the ethtool framework to avoid calling .get_rx_ring_count for
ETHTOOL_GRXRINGS, simplifying the ethtool core framework.
====================

Link: https://patch.msgid.link/20260122-grxring_big_v4-v2-0-94dbe4dcaa10@debian.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-01-23 10:53:08 -08:00
Breno Leitao f584347c1a net: sfc: falcon: convert to use .get_rx_ring_count
Use the newly introduced .get_rx_ring_count ethtool ops callback instead
of handling ETHTOOL_GRXRINGS directly in .get_rxnfc().

Signed-off-by: Breno Leitao <leitao@debian.org>
Reviewed-by: Brett Creeley <brett.creeley@amd.com>
Link: https://patch.msgid.link/20260122-grxring_big_v4-v2-9-94dbe4dcaa10@debian.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-01-23 10:53:06 -08:00
Breno Leitao c9e4688b2e net: sfc: siena: convert to use .get_rx_ring_count
Use the newly introduced .get_rx_ring_count ethtool ops callback instead
of handling ETHTOOL_GRXRINGS directly in .get_rxnfc().

Signed-off-by: Breno Leitao <leitao@debian.org>
Reviewed-by: Brett Creeley <brett.creeley@amd.com>
Link: https://patch.msgid.link/20260122-grxring_big_v4-v2-8-94dbe4dcaa10@debian.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-01-23 10:53:06 -08:00
Breno Leitao 67f16fba55 net: sfc: efx: convert to use .get_rx_ring_count
Use the newly introduced .get_rx_ring_count ethtool ops callback instead
of handling ETHTOOL_GRXRINGS directly in .get_rxnfc().

Signed-off-by: Breno Leitao <leitao@debian.org>
Reviewed-by: Brett Creeley <brett.creeley@amd.com>
Link: https://patch.msgid.link/20260122-grxring_big_v4-v2-7-94dbe4dcaa10@debian.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-01-23 10:53:06 -08:00
Breno Leitao 8bd5cee989 net: ionic: convert to use .get_rx_ring_count
Use the newly introduced .get_rx_ring_count ethtool ops callback instead
of handling ETHTOOL_GRXRINGS directly in .get_rxnfc().

Since ETHTOOL_GRXRINGS was the only command handled by ionic_get_rxnfc(),
remove the function entirely.

Signed-off-by: Breno Leitao <leitao@debian.org>
Reviewed-by: Brett Creeley <brett.creeley@amd.com>
Link: https://patch.msgid.link/20260122-grxring_big_v4-v2-6-94dbe4dcaa10@debian.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-01-23 10:53:05 -08:00
Breno Leitao ea28b02da8 net: fbnic: convert to use .get_rx_ring_count
Use the newly introduced .get_rx_ring_count ethtool ops callback instead
of handling ETHTOOL_GRXRINGS directly in .get_rxnfc().

Signed-off-by: Breno Leitao <leitao@debian.org>
Reviewed-by: Brett Creeley <brett.creeley@amd.com>
Link: https://patch.msgid.link/20260122-grxring_big_v4-v2-5-94dbe4dcaa10@debian.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-01-23 10:53:05 -08:00
Breno Leitao 3eb7225718 net: mana: convert to use .get_rx_ring_count
Use the newly introduced .get_rx_ring_count ethtool ops callback instead
of handling ETHTOOL_GRXRINGS directly in .get_rxnfc().

Since ETHTOOL_GRXRINGS was the only command handled by mana_get_rxnfc(),
remove the function entirely.

Reviewed-by: Subbaraya Sundeep <sbhatta@marvell.com>
Signed-off-by: Breno Leitao <leitao@debian.org>
Reviewed-by: Brett Creeley <brett.creeley@amd.com>
Link: https://patch.msgid.link/20260122-grxring_big_v4-v2-4-94dbe4dcaa10@debian.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2026-01-23 10:53:05 -08:00