linux/net/core
Eric Dumazet 68822bdf76 net: generalize skb freeing deferral to per-cpu lists
Logic added in commit f35f821935 ("tcp: defer skb freeing after socket
lock is released") helped bulk TCP flows to move the cost of skbs
frees outside of critical section where socket lock was held.

But for RPC traffic, or hosts with RFS enabled, the solution is far from
being ideal.

For RPC traffic, recvmsg() has to return to user space right after
skb payload has been consumed, meaning that BH handler has no chance
to pick the skb before recvmsg() thread. This issue is more visible
with BIG TCP, as more RPC fit one skb.

For RFS, even if BH handler picks the skbs, they are still picked
from the cpu on which user thread is running.

Ideally, it is better to free the skbs (and associated page frags)
on the cpu that originally allocated them.

This patch removes the per socket anchor (sk->defer_list) and
instead uses a per-cpu list, which will hold more skbs per round.

This new per-cpu list is drained at the end of net_action_rx(),
after incoming packets have been processed, to lower latencies.

In normal conditions, skbs are added to the per-cpu list with
no further action. In the (unlikely) cases where the cpu does not
run net_action_rx() handler fast enough, we use an IPI to raise
NET_RX_SOFTIRQ on the remote cpu.

Also, we do not bother draining the per-cpu list from dev_cpu_dead()
This is because skbs in this list have no requirement on how fast
they should be freed.

Note that we can add in the future a small per-cpu cache
if we see any contention on sd->defer_lock.

Tested on a pair of hosts with 100Gbit NIC, RFS enabled,
and /proc/sys/net/ipv4/tcp_rmem[2] tuned to 16MB to work around
page recycling strategy used by NIC driver (its page pool capacity
being too small compared to number of skbs/pages held in sockets
receive queues)

Note that this tuning was only done to demonstrate worse
conditions for skb freeing for this particular test.
These conditions can happen in more general production workload.

10 runs of one TCP_STREAM flow

Before:
Average throughput: 49685 Mbit.

Kernel profiles on cpu running user thread recvmsg() show high cost for
skb freeing related functions (*)

    57.81%  [kernel]       [k] copy_user_enhanced_fast_string
(*) 12.87%  [kernel]       [k] skb_release_data
(*)  4.25%  [kernel]       [k] __free_one_page
(*)  3.57%  [kernel]       [k] __list_del_entry_valid
     1.85%  [kernel]       [k] __netif_receive_skb_core
     1.60%  [kernel]       [k] __skb_datagram_iter
(*)  1.59%  [kernel]       [k] free_unref_page_commit
(*)  1.16%  [kernel]       [k] __slab_free
     1.16%  [kernel]       [k] _copy_to_iter
(*)  1.01%  [kernel]       [k] kfree
(*)  0.88%  [kernel]       [k] free_unref_page
     0.57%  [kernel]       [k] ip6_rcv_core
     0.55%  [kernel]       [k] ip6t_do_table
     0.54%  [kernel]       [k] flush_smp_call_function_queue
(*)  0.54%  [kernel]       [k] free_pcppages_bulk
     0.51%  [kernel]       [k] llist_reverse_order
     0.38%  [kernel]       [k] process_backlog
(*)  0.38%  [kernel]       [k] free_pcp_prepare
     0.37%  [kernel]       [k] tcp_recvmsg_locked
(*)  0.37%  [kernel]       [k] __list_add_valid
     0.34%  [kernel]       [k] sock_rfree
     0.34%  [kernel]       [k] _raw_spin_lock_irq
(*)  0.33%  [kernel]       [k] __page_cache_release
     0.33%  [kernel]       [k] tcp_v6_rcv
(*)  0.33%  [kernel]       [k] __put_page
(*)  0.29%  [kernel]       [k] __mod_zone_page_state
     0.27%  [kernel]       [k] _raw_spin_lock

After patch:
Average throughput: 73076 Mbit.

Kernel profiles on cpu running user thread recvmsg() looks better:

    81.35%  [kernel]       [k] copy_user_enhanced_fast_string
     1.95%  [kernel]       [k] _copy_to_iter
     1.95%  [kernel]       [k] __skb_datagram_iter
     1.27%  [kernel]       [k] __netif_receive_skb_core
     1.03%  [kernel]       [k] ip6t_do_table
     0.60%  [kernel]       [k] sock_rfree
     0.50%  [kernel]       [k] tcp_v6_rcv
     0.47%  [kernel]       [k] ip6_rcv_core
     0.45%  [kernel]       [k] read_tsc
     0.44%  [kernel]       [k] _raw_spin_lock_irqsave
     0.37%  [kernel]       [k] _raw_spin_lock
     0.37%  [kernel]       [k] native_irq_return_iret
     0.33%  [kernel]       [k] __inet6_lookup_established
     0.31%  [kernel]       [k] ip6_protocol_deliver_rcu
     0.29%  [kernel]       [k] tcp_rcv_established
     0.29%  [kernel]       [k] llist_reverse_order

v2: kdoc issue (kernel bots)
    do not defer if (alloc_cpu == smp_processor_id()) (Paolo)
    replace the sk_buff_head with a single-linked list (Jakub)
    add a READ_ONCE()/WRITE_ONCE() for the lockless read of sd->defer_list

Signed-off-by: Eric Dumazet <edumazet@google.com>
Acked-by: Paolo Abeni <pabeni@redhat.com>
Link: https://lore.kernel.org/r/20220422201237.416238-1-eric.dumazet@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2022-04-26 17:05:59 -07:00
..
Makefile net: kunit: add a test for dev_addr_lists 2021-11-20 12:25:57 +00:00
bpf_sk_storage.c bpf: Check for NULL return from bpf_get_btf_vmlinux 2022-03-20 19:21:38 -07:00
datagram.c net: remove noblock parameter from skb_recv_datagram() 2022-04-06 13:45:26 +01:00
datagram.h net/core: Allow the compiler to verify declaration and definition consistency 2019-03-27 13:49:44 -07:00
dev.c net: generalize skb freeing deferral to per-cpu lists 2022-04-26 17:05:59 -07:00
dev.h net: extract a few internals from netdevice.h 2022-04-07 20:32:09 -07:00
dev_addr_lists.c net: extract a few internals from netdevice.h 2022-04-07 20:32:09 -07:00
dev_addr_lists_test.c net: kunit: add a test for dev_addr_lists 2021-11-20 12:25:57 +00:00
dev_ioctl.c net: extract a few internals from netdevice.h 2022-04-07 20:32:09 -07:00
devlink.c devlink: introduce line card device info infrastructure 2022-04-25 10:42:28 +01:00
drop_monitor.c drop_monitor: remove quadratic behavior 2022-02-23 12:39:58 +00:00
dst.c net: dst: add net device refcount tracking to dst_entry 2021-12-06 16:05:10 -08:00
dst_cache.c wireguard: device: reset peer src endpoint when netns exits 2021-11-29 19:50:45 -08:00
failover.c net: failover: add net device refcount tracker 2021-12-06 16:06:02 -08:00
fib_notifier.c net: fib_notifier: propagate extack down to the notifier block callback 2019-10-04 11:10:56 -07:00
fib_rules.c fib: expand fib_rule_policy 2021-12-16 07:18:35 -08:00
filter.c ipv6: Use ipv6_only_sock() helper in condition. 2022-04-22 12:47:50 +01:00
flow_dissector.c flow_dissector: Add number of vlan tags dissector 2022-04-20 11:09:13 +01:00
flow_offload.c flow_offload: add reoffload process to update hw_count 2021-12-19 14:08:48 +00:00
gen_estimator.c net: sched: Remove Qdisc::running sequence counter 2021-10-18 12:54:41 +01:00
gen_stats.c net: stats: Read the statistics in ___gnet_stats_copy_basic() instead of adding. 2021-10-21 12:47:56 +01:00
gro.c Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net 2022-03-10 17:16:56 -08:00
gro_cells.c net: add per-cpu storage and net->core_stats 2022-03-11 23:17:24 -08:00
hwbm.c net: hwbm: Make the hwbm_pool lock a mutex 2019-06-09 19:40:10 -07:00
link_watch.c net: extract a few internals from netdevice.h 2022-04-07 20:32:09 -07:00
lwt_bpf.c net: Don't include filter.h from net/sock.h 2021-12-29 08:48:14 -08:00
lwtunnel.c lwtunnel: Validate RTA_ENCAP_TYPE attribute length 2021-12-31 14:31:59 +00:00
neighbour.c net: neigh: use kfree_skb_reason() for __neigh_event_send() 2022-02-26 12:53:59 +00:00
net-procfs.c net: extract a few internals from netdevice.h 2022-04-07 20:32:09 -07:00
net-sysfs.c net: extract a few internals from netdevice.h 2022-04-07 20:32:09 -07:00
net-sysfs.h net-sysfs: add netdev_change_owner() 2020-02-26 20:07:25 -08:00
net-traces.c tcp: add tracepoint for checksum errors 2021-05-14 15:26:03 -07:00
net_namespace.c net: initialize init_net earlier 2022-02-06 11:04:29 +00:00
netclassid_cgroup.c bpf, cgroups: Fix cgroup v2 fallback on v1/v2 mixed mode 2021-09-13 16:35:58 -07:00
netevent.c net: core: Correct function name netevent_unregister_notifier() in the kerneldoc 2021-03-28 17:56:56 -07:00
netpoll.c netpoll: add net device refcount tracker to struct netpoll 2021-12-06 16:06:02 -08:00
netprio_cgroup.c bpf, cgroups: Fix cgroup v2 fallback on v1/v2 mixed mode 2021-09-13 16:35:58 -07:00
of_net.c Revert "of: net: support NVMEM cells with MAC in text format" 2022-01-12 14:14:36 +00:00
page_pool.c net: page_pool: introduce ethtool stats 2022-04-15 10:43:47 +01:00
pktgen.c proc: remove PDE_DATA() completely 2022-01-22 08:33:37 +02:00
ptp_classifier.c ptp: Add generic PTP is_sync() function 2022-03-07 11:31:34 +00:00
request_sock.c tcp: add rcu protection around tp->fastopen_rsk 2019-10-13 10:13:08 -07:00
rtnetlink.c Revert "rtnetlink: return EINVAL when request cannot succeed" 2022-04-22 11:02:00 +01:00
scm.c memcg: enable accounting for scm_fp_list objects 2021-07-20 06:00:38 -07:00
secure_seq.c net: align static siphash keys 2021-11-16 19:07:54 -08:00
selftests.c net: core: constify mac addrs in selftests 2021-10-24 13:59:44 +01:00
skbuff.c net: generalize skb freeing deferral to per-cpu lists 2022-04-26 17:05:59 -07:00
skmsg.c bpf, sockmap: Fix memleak in tcp_bpf_sendmsg while sk msg is full 2022-03-15 16:43:31 +01:00
sock.c net: generalize skb freeing deferral to per-cpu lists 2022-04-26 17:05:59 -07:00
sock_destructor.h skb_expand_head() adjust skb->truesize incorrectly 2021-10-22 12:35:51 -07:00
sock_diag.c net: Don't include filter.h from net/sock.h 2021-12-29 08:48:14 -08:00
sock_map.c bpf: support BPF_PROG_QUERY for progs attached to sockmap 2022-01-20 21:30:58 -08:00
sock_reuseport.c tcp: Add stats for socket migration. 2021-06-23 12:56:08 -07:00
stream.c net: stream: don't purge sk_error_queue in sk_stream_kill_queues() 2021-10-16 09:06:09 +01:00
sysctl_net_core.c net: extract a few internals from netdevice.h 2022-04-07 20:32:09 -07:00
timestamping.c net: Introduce a new MII time stamping interface. 2019-12-25 19:51:33 -08:00
tso.c net: tso: add UDP segmentation support 2020-06-18 20:46:23 -07:00
utils.c net: core: Use csum_replace_by_diff() and csum_sub() instead of opencoding 2022-02-21 11:40:44 +00:00
xdp.c Merge https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next 2022-03-22 11:18:49 -07:00