linux/include/net/netfilter
Florian Westphal 2d72afb340 netfilter: nf_conntrack: fix crash due to removal of uninitialised entry
A crash in conntrack was reported while trying to unlink the conntrack
entry from the hash bucket list:
    [exception RIP: __nf_ct_delete_from_lists+172]
    [..]
 #7 [ff539b5a2b043aa0] nf_ct_delete at ffffffffc124d421 [nf_conntrack]
 #8 [ff539b5a2b043ad0] nf_ct_gc_expired at ffffffffc124d999 [nf_conntrack]
 #9 [ff539b5a2b043ae0] __nf_conntrack_find_get at ffffffffc124efbc [nf_conntrack]
    [..]

The nf_conn struct is marked as allocated from slab but appears to be in
a partially initialised state:

 ct hlist pointer is garbage; looks like the ct hash value
 (hence crash).
 ct->status is equal to IPS_CONFIRMED|IPS_DYING, which is expected
 ct->timeout is 30000 (=30s), which is unexpected.

Everything else looks like normal udp conntrack entry.  If we ignore
ct->status and pretend its 0, the entry matches those that are newly
allocated but not yet inserted into the hash:
  - ct hlist pointers are overloaded and store/cache the raw tuple hash
  - ct->timeout matches the relative time expected for a new udp flow
    rather than the absolute 'jiffies' value.

If it were not for the presence of IPS_CONFIRMED,
__nf_conntrack_find_get() would have skipped the entry.

Theory is that we did hit following race:

cpu x 			cpu y			cpu z
 found entry E		found entry E
 E is expired		<preemption>
 nf_ct_delete()
 return E to rcu slab
					init_conntrack
					E is re-inited,
					ct->status set to 0
					reply tuplehash hnnode.pprev
					stores hash value.

cpu y found E right before it was deleted on cpu x.
E is now re-inited on cpu z.  cpu y was preempted before
checking for expiry and/or confirm bit.

					->refcnt set to 1
					E now owned by skb
					->timeout set to 30000

If cpu y were to resume now, it would observe E as
expired but would skip E due to missing CONFIRMED bit.

					nf_conntrack_confirm gets called
					sets: ct->status |= CONFIRMED
					This is wrong: E is not yet added
					to hashtable.

cpu y resumes, it observes E as expired but CONFIRMED:
			<resumes>
			nf_ct_expired()
			 -> yes (ct->timeout is 30s)
			confirmed bit set.

cpu y will try to delete E from the hashtable:
			nf_ct_delete() -> set DYING bit
			__nf_ct_delete_from_lists

Even this scenario doesn't guarantee a crash:
cpu z still holds the table bucket lock(s) so y blocks:

			wait for spinlock held by z

					CONFIRMED is set but there is no
					guarantee ct will be added to hash:
					"chaintoolong" or "clash resolution"
					logic both skip the insert step.
					reply hnnode.pprev still stores the
					hash value.

					unlocks spinlock
					return NF_DROP
			<unblocks, then
			 crashes on hlist_nulls_del_rcu pprev>

In case CPU z does insert the entry into the hashtable, cpu y will unlink
E again right away but no crash occurs.

Without 'cpu y' race, 'garbage' hlist is of no consequence:
ct refcnt remains at 1, eventually skb will be free'd and E gets
destroyed via: nf_conntrack_put -> nf_conntrack_destroy -> nf_ct_destroy.

To resolve this, move the IPS_CONFIRMED assignment after the table
insertion but before the unlock.

Pablo points out that the confirm-bit-store could be reordered to happen
before hlist add resp. the timeout fixup, so switch to set_bit and
before_atomic memory barrier to prevent this.

It doesn't matter if other CPUs can observe a newly inserted entry right
before the CONFIRMED bit was set:

Such event cannot be distinguished from above "E is the old incarnation"
case: the entry will be skipped.

Also change nf_ct_should_gc() to first check the confirmed bit.

The gc sequence is:
 1. Check if entry has expired, if not skip to next entry
 2. Obtain a reference to the expired entry.
 3. Call nf_ct_should_gc() to double-check step 1.

nf_ct_should_gc() is thus called only for entries that already failed an
expiry check. After this patch, once the confirmed bit check passes
ct->timeout has been altered to reflect the absolute 'best before' date
instead of a relative time.  Step 3 will therefore not remove the entry.

Without this change to nf_ct_should_gc() we could still get this sequence:

 1. Check if entry has expired.
 2. Obtain a reference.
 3. Call nf_ct_should_gc() to double-check step 1:
    4 - entry is still observed as expired
    5 - meanwhile, ct->timeout is corrected to absolute value on other CPU
      and confirm bit gets set
    6 - confirm bit is seen
    7 - valid entry is removed again

First do check 6), then 4) so the gc expiry check always picks up either
confirmed bit unset (entry gets skipped) or expiry re-check failure for
re-inited conntrack objects.

This change cannot be backported to releases before 5.19. Without
commit 8a75a2c174 ("netfilter: conntrack: remove unconfirmed list")
|= IPS_CONFIRMED line cannot be moved without further changes.

Cc: Razvan Cojocaru <rzvncj@gmail.com>
Link: https://lore.kernel.org/netfilter-devel/20250627142758.25664-1-fw@strlen.de/
Link: https://lore.kernel.org/netfilter-devel/4239da15-83ff-4ca4-939d-faef283471bb@gmail.com/
Fixes: 1397af5bfd ("netfilter: conntrack: remove the percpu dying list")
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2025-07-17 11:23:33 +02:00
..
ipv4 netfilter: disable defrag once its no longer needed 2021-04-26 03:20:07 +02:00
ipv6 netfilter: conntrack: fix boot failure with nf_conntrack.enable_hooks=1 2021-09-28 13:04:55 +02:00
br_netfilter.h netfilter: remove CONFIG_NETFILTER checks from headers. 2019-09-13 12:47:36 +02:00
nf_bpf_link.h bpf: minimal support for programs hooked into netfilter framework 2023-04-21 11:34:14 -07:00
nf_conntrack.h netfilter: nf_conntrack: fix crash due to removal of uninitialised entry 2025-07-17 11:23:33 +02:00
nf_conntrack_acct.h netfilter: conntrack: Remove unused function declarations 2023-08-08 13:02:00 +02:00
nf_conntrack_act_ct.h net/sched: act_ct: Always fill offloading tuple iifidx 2023-11-08 17:47:08 -08:00
nf_conntrack_bpf.h net: netfilter: move bpf_ct_set_nat_info kfunc in nf_nat_bpf.c 2022-10-03 09:17:32 -07:00
nf_conntrack_bridge.h netfilter: remove CONFIG_NETFILTER checks from headers. 2019-09-13 12:47:36 +02:00
nf_conntrack_core.h netfilter: conntrack: fix wrong ct->timeout value 2023-04-19 12:08:38 +02:00
nf_conntrack_count.h netfilter: move nf_ct_netns_get out of nf_conncount_init 2024-08-19 18:44:51 +02:00
nf_conntrack_ecache.h netfilter: conntrack: add conntrack event timestamp 2025-01-09 14:42:16 +01:00
nf_conntrack_expect.h netfilter: allow exp not to be removed in nf_ct_find_expectation 2023-07-20 10:06:36 +02:00
nf_conntrack_extend.h netfilter: extensions: introduce extension genid count 2022-05-13 18:52:16 +02:00
nf_conntrack_helper.h netfilter: helper: Remove unused function declarations 2023-08-08 13:01:59 +02:00
nf_conntrack_l4proto.h netfilter: conntrack: pass hook state to log functions 2021-06-18 14:47:43 +02:00
nf_conntrack_labels.h netfilter: conntrack: switch connlabels to atomic_t 2023-10-24 13:16:30 +02:00
nf_conntrack_seqadj.h netfilter: conntrack: remove extension register api 2022-02-04 06:30:28 +01:00
nf_conntrack_synproxy.h netfilter: conntrack: wrap two inline functions in config checks. 2019-09-13 12:47:10 +02:00
nf_conntrack_timeout.h netfilter: nf_conntrack: add missing __rcu annotations 2022-07-11 16:25:15 +02:00
nf_conntrack_timestamp.h netfilter: conntrack: remove extension register api 2022-02-04 06:30:28 +01:00
nf_conntrack_tuple.h netfilter: conntrack: don't fold port numbers into addresses before hashing 2023-07-05 14:42:16 +02:00
nf_conntrack_zones.h netfilter: conntrack: remove CONFIG_NF_CONNTRACK checks from nf_conntrack_zones.h. 2019-09-13 12:47:41 +02:00
nf_dup_netdev.h netfilter: nft_{fwd,dup}_netdev: add offload support 2019-09-10 22:44:29 +02:00
nf_flow_table.h netfilter: flowtable: account for Ethernet header in nf_flow_pppoe_proto() 2025-07-10 17:12:28 -07:00
nf_hooks_lwtunnel.h sysctl: treewide: constify the ctl_table argument of proc_handlers 2024-07-24 20:59:29 +02:00
nf_log.h netfilter: nf_log_common: merge with nf_log_syslog 2021-03-31 22:34:10 +02:00
nf_nat.h net: move the nat function to nf_nat_ovs for ovs and tc 2022-12-12 10:14:03 +00:00
nf_nat_helper.h netfilter: nat: move repetitive nat port reserve loop to a helper 2022-09-07 16:46:04 +02:00
nf_nat_masquerade.h netfilter: update include directives. 2019-09-13 12:33:06 +02:00
nf_nat_redirect.h netfilter: nft_redir: use `struct nf_nat_range2` throughout and deduplicate eval call-backs 2023-03-22 21:48:59 +01:00
nf_queue.h netfilter: move nf_reinject into nfnetlink_queue modules 2024-02-21 12:03:22 +01:00
nf_reject.h netfilter: conntrack: skip verification of zero UDP checksum 2022-05-13 18:56:28 +02:00
nf_socket.h netfilter: Decrease code duplication regarding transparent socket option 2018-06-03 00:02:01 +02:00
nf_synproxy.h netfilter: remove CONFIG_NETFILTER checks from headers. 2019-09-13 12:47:36 +02:00
nf_tables.h Revert "netfilter: nf_tables: Add notifications for hook changes" 2025-07-14 15:22:47 +02:00
nf_tables_core.h netfilter: nft_inner: incorrect percpu area handling under softirq 2024-12-03 22:10:58 +01:00
nf_tables_ipv4.h netfilter: nf_tables: restore IP sanity checks for netdev/egress 2024-08-26 13:05:28 +02:00
nf_tables_ipv6.h netfilter: nf_tables_ipv6: consider network offset in netdev/egress validation 2024-08-27 18:11:56 +02:00
nf_tables_offload.h netfilter: nf_tables: bail out early if hardware offload is not supported 2022-06-06 19:19:15 +02:00
nf_tproxy.h net: reformat kdoc return statements 2024-12-09 14:44:59 -08:00
nft_fib.h netfilter: nf_tables: nft_fib: consistent l3mdev handling 2025-05-23 13:57:09 +02:00
nft_meta.h netfilter: nf_tables: drop unused 3rd argument from validate callback ops 2024-09-03 10:47:17 +02:00
nft_reject.h netfilter: nf_tables: drop unused 3rd argument from validate callback ops 2024-09-03 10:47:17 +02:00
xt_rateest.h net: sched: Merge Qdisc::bstats and Qdisc::cpu_bstats data types 2021-10-18 12:54:41 +01:00