Commit Graph

243 Commits

Author SHA1 Message Date
Linus Torvalds e0b38d286e for-7.0-rc3-tag
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmmytygACgkQxWXV+ddt
 WDuiKw/+IyoS2IE95Jd4G2hezcsaqJNAvv5q90vVyLepT/7jzRbYrHytk7z/0v56
 +Pjc1JHgp+TidYKZa52E/R09eEvYCyuZvEq4bnClOnO3CAJDCixqTKWV70CcYiHX
 HoSCuPML2CJMLZY3u6MxATgk46y1ey3UkbmQ4oufoUHrjAE+D3pDQsYQ9BWR/P6J
 4LbyZ214uIUPvbp0wPJ14cVUMpNxs116JmWvK9dxWQNZSzFcq2IVuLwQUcZBPAdb
 dursd0kt6HXXmNCITmUD8O+ipG4U1akn8FuCjaLqo4LF1AH2f6OzNjucsisfa1tE
 4MD+ATzmNsew5q6dtyfcvv8Btl+olbTP4KGibLl9PCCM9vlkzl3EN5GPUllGi6S9
 T8jqe2pILXZHEx1DIQjHaXJsnuHG9aEkUbkSHUxIa15QDj66omrZZkZG3EF5Buy9
 TogJJgESYU19dt/y110Q/vD/LPOgJhB3NBXwIx1FDYA1OwaAjt6hUcVRVRI5PtpL
 moAIG4nNrPz5Pa875NvguFmrXFIudpTqyANHCPVio4eqDR7LeSg9vVHj9zeOfeRD
 gmn3WGuGNb6L+86TNet/nFQhjxhkKqsnSbzO2zoPQKhG6FS+HSLnUwvJYL4/YnkW
 QEeveeI/6Duiei4DHuw7FXmZVNQxeBEWW6MTFDHA1UNV3HMKewA=
 =TgSD
 -----END PGP SIGNATURE-----

Merge tag 'for-7.0-rc3-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux

Pull btrfs fixes from David Sterba:

 - detect possible file name hash collision earlier so it does not lead
   to transaction abort

 - handle b-tree leaf overflows when snapshotting a subvolume with set
   received UUID, leading to transaction abort

 - in zoned mode, reorder relocation block group initialization after
   the transaction kthread start

 - fix orphan cleanup state tracking of subvolume, this could lead to
   invalid dentries under some conditions

 - add locking around updates of dynamic reclain state update

 - in subpage mode, add missing RCU unlock when trying to releae extent
   buffer

 - remap tree fixes:
     - add missing description strings for the newly added remap tree
     - properly update search key when iterating backrefs

* tag 'for-7.0-rc3-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
  btrfs: remove duplicated definition of btrfs_printk_in_rcu()
  btrfs: remove unnecessary transaction abort in the received subvol ioctl
  btrfs: abort transaction on failure to update root in the received subvol ioctl
  btrfs: fix transaction abort on set received ioctl due to item overflow
  btrfs: fix transaction abort when snapshotting received subvolumes
  btrfs: fix transaction abort on file creation due to name hash collision
  btrfs: read key again after incrementing slot in move_existing_remaps()
  btrfs: add missing RCU unlock in error path in try_release_subpage_extent_buffer()
  btrfs: set BTRFS_ROOT_ORPHAN_CLEANUP during subvol create
  btrfs: zoned: move btrfs_zoned_reserve_data_reloc_bg() after kthread start
  btrfs: hold space_info->lock when clearing periodic reclaim ready
  btrfs: print-tree: add remap tree definitions
2026-03-12 12:15:27 -07:00
Sun YangKai b8883b61f2 btrfs: hold space_info->lock when clearing periodic reclaim ready
btrfs_set_periodic_reclaim_ready() requires space_info->lock to be held,
as enforced by lockdep_assert_held(). However, btrfs_reclaim_sweep() was
calling it after do_reclaim_sweep() returns, at which point
space_info->lock is no longer held.

Fix this by explicitly acquiring space_info->lock before clearing the
periodic reclaim ready flag in btrfs_reclaim_sweep().

Reported-by: Chris Mason <clm@meta.com>
Link: https://lore.kernel.org/linux-btrfs/20260208182556.891815-1-clm@meta.com/
Fixes: 19eff93dc7 ("btrfs: fix periodic reclaim condition")
Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Sun YangKai <sunk67188@gmail.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2026-03-03 15:54:00 +01:00
Kees Cook 69050f8d6d treewide: Replace kmalloc with kmalloc_obj for non-scalar types
This is the result of running the Coccinelle script from
scripts/coccinelle/api/kmalloc_objs.cocci. The script is designed to
avoid scalar types (which need careful case-by-case checking), and
instead replace kmalloc-family calls that allocate struct or union
object instances:

Single allocations:	kmalloc(sizeof(TYPE), ...)
are replaced with:	kmalloc_obj(TYPE, ...)

Array allocations:	kmalloc_array(COUNT, sizeof(TYPE), ...)
are replaced with:	kmalloc_objs(TYPE, COUNT, ...)

Flex array allocations:	kmalloc(struct_size(PTR, FAM, COUNT), ...)
are replaced with:	kmalloc_flex(*PTR, FAM, COUNT, ...)

(where TYPE may also be *VAR)

The resulting allocations no longer return "void *", instead returning
"TYPE *".

Signed-off-by: Kees Cook <kees@kernel.org>
2026-02-21 01:02:28 -08:00
Filipe Manana 5eb01bf4a9 btrfs: remove out label in btrfs_init_space_info()
There is no point in having the label since all it does is return the
value in the 'ret' variable. Instead make every goto return directly
and remove the label.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2026-02-03 07:56:23 +01:00
Sun YangKai 19eff93dc7 btrfs: fix periodic reclaim condition
Problems with current implementation:

1. reclaimable_bytes is signed while chunk_sz is unsigned, causing
   negative reclaimable_bytes to trigger reclaim unexpectedly

2. The "space must be freed between scans" assumption breaks the
   two-scan requirement: first scan marks block groups, second scan
   reclaims them. Without the second scan, no reclamation occurs.

Instead, track actual reclaim progress: pause reclaim when block groups
will be reclaimed, and resume only when progress is made. This ensures
reclaim continues until no further progress can be made. And resume
periodic reclaim when there's enough free space.

And we take care if reclaim is making any progress now, so it's
unnecessary to set periodic_reclaim_ready to false when failed to reclaim
a block group.

Fixes: 813d4c6422 ("btrfs: prevent pathological periodic reclaim loops")
CC: stable@vger.kernel.org # 6.12+
Suggested-by: Boris Burkov <boris@bur.io>
Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Sun YangKai <sunk67188@gmail.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2026-02-03 07:54:37 +01:00
Mark Harmstone b56f35560b btrfs: handle setting up relocation of block group with remap-tree
Handle the preliminary work for relocating a block group in a filesystem
with the remap-tree flag set.

If the block group is SYSTEM btrfs_relocate_block_group() proceeds as it
does already, as bootstrapping issues mean that these block groups have
to be processed the existing way. Similarly with METADATA_REMAP blocks, which
are dealt with in a later patch.

Otherwise we walk the free-space tree for the block group in question,
recording any holes. These get converted into identity remaps and placed
in the remap tree, and the block group's REMAPPED flag is set. From now
on no new allocations are possible within this block group, and any I/O
to it will be funnelled through btrfs_translate_remap(). We store the
number of identity remaps in `identity_remap_count`, so that we know
when we've removed the last one and the block group is fully remapped.

The change in btrfs_read_roots() is because data relocations no longer
rely on the data reloc tree as a hidden subvolume in which to do
snapshots.

(Thanks to Sun YangKai for his suggestions.)

Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Mark Harmstone <mark@harmstone.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2026-02-03 07:54:35 +01:00
Mark Harmstone 0b4d29fa98 btrfs: add METADATA_REMAP chunk type
Add a new METADATA_REMAP chunk type, which is a metadata chunk that holds the
remap tree.

This is needed for bootstrapping purposes: the remap tree can't itself
be remapped, and must be relocated the existing way, by COWing every
leaf. The remap tree can't go in the SYSTEM chunk as space there is
limited, because a copy of the chunk item gets placed in the superblock.

The changes in fs/btrfs/volumes.h are because we're adding a new block
group type bit after the profile bits, and so can no longer rely on the
const_ilog2 trick.

The sizing to 32MB per chunk, matching the SYSTEM chunk, is an estimate
here, we can adjust it later if it proves to be too big or too small.
This works out to be ~500,000 remap items, which for a 4KB block size
covers ~2GB of remapped data in the worst case and ~500TB in the best case.

Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Mark Harmstone <mark@harmstone.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2026-02-03 07:54:27 +01:00
Filipe Manana c208aa0ef6 btrfs: add and use helper to compute the available space for a block group
We have currently three places that compute how much available space a
block group has. Add a helper function for this and use it in those
places.

Reviewed-by: Boris Burkov <boris@bur.io>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2026-02-03 07:51:44 +01:00
Johannes Thumshirn 2ef2e97fe7 btrfs: move space_info_flag_to_str() to space-info.h
Move space_info_flag_to_str() to space-info.h and as it now isn't static
to space-info.c any more prefix it with 'btrfs_'.

This way it can be re-used in other places.

Reviewed-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: Naohiro Aota <naohiro.aota@wdc.com>
Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2026-02-03 07:49:12 +01:00
Jiasheng Jiang a11224a016 btrfs: fix memory leaks in create_space_info() error paths
In create_space_info(), the 'space_info' object is allocated at the
beginning of the function. However, there are two error paths where the
function returns an error code without freeing the allocated memory:

1. When create_space_info_sub_group() fails in zoned mode.
2. When btrfs_sysfs_add_space_info_type() fails.

In both cases, 'space_info' has not yet been added to the
fs_info->space_info list, resulting in a memory leak. Fix this by
adding an error handling label to kfree(space_info) before returning.

Fixes: 2be12ef79f ("btrfs: Separate space_info create/update")
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Jiasheng Jiang <jiashengjiangcool@gmail.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2026-01-12 16:21:55 +01:00
David Sterba 1c094e6cce btrfs: make a few more ASSERTs verbose
We have support for optional string to be printed in ASSERT() (added in
19468a623a ("btrfs: enhance ASSERT() to take optional format
string")), it's not yet everywhere it could be so add a few more files.

Signed-off-by: David Sterba <dsterba@suse.com>
2025-11-24 22:42:24 +01:00
Filipe Manana fe1e50031f btrfs: move struct reserve_ticket definition to space-info.c
It's not used anywhere outside space-info.c so move it from space-info.h
into space-info.c.

Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-11-24 22:42:23 +01:00
Qu Wenruo c5667f9c8e btrfs: headers cleanup to remove unnecessary local includes
[BUG]
When I tried to remove btrfs_bio::fs_info and use btrfs_bio::inode to
grab the fs_info, the header "btrfs_inode.h" is needed to access the
full btrfs_inode structure.

Then btrfs will fail to compile.

[CAUSE]
There is a recursive including chain:

  "bio.h" -> "btrfs_inode.h" -> "extent_map.h" -> "compression.h" ->
  "bio.h"

That recursive including is causing problems for btrfs.

[ENHANCEMENT]
To reduce the risk of recursive including:

- Remove unnecessary local includes from btrfs headers
  Either the included header is pulled in by other headers, or is
  completely unnecessary.

- Remove btrfs local includes if the header only requires a pointer
  In that case let the implementing C file to pull the required header.

  This is especially important for headers like "btrfs_inode.h" which
  pulls in a lot of other btrfs headers, thus it's a mine field of
  recursive including.

- Remove unnecessary temporary structure definition
  Either if we have included the header defining the structure, or
  completely unused.

Now including "btrfs_inode.h" inside "bio.h" is completely fine,
although "btrfs_inode.h" still includes "extent_map.h", but that header
only includes "fs.h", no more chain back to "bio.h".

Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-11-24 22:34:52 +01:00
Filipe Manana 38e03b820e btrfs: annotate as unlikely fs aborted checks in space flushing code
It's not expected to have the fs in an aborted state, so surround the
abortion checks with unlikely to make it clear it's unexpected and to
hint the compiler to generate better code.

Also at maybe_fail_all_tickets() untangle all repeated checks for the
abortion into a single if-then-else. This makes things more readable
and makes the compiler generate less code. On x86_64 with gcc 14.2.0-19
from Debian I got the following object size differences.

Before this change:

  $ size fs/btrfs/btrfs.ko
     text	   data	    bss	    dec	    hex	filename
  2021606	 179704	  25088	2226398	 21f8de	fs/btrfs/btrfs.ko

After this change:

  $ size fs/btrfs/btrfs.ko
     text	   data	    bss	    dec	    hex	filename
  2021458	 179704	  25088	2226250	 21f84a	fs/btrfs/btrfs.ko

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-11-24 22:20:29 +01:00
Filipe Manana f912f0af13 btrfs: avoid space_info locking when checking if tickets are served
When checking if a ticket was served, we take the space_info's spinlock.
If the ticket was served (its ->bytes is 0) or had an error (its ->error
it not 0) then we just unlock the space_info and return.

This however causes contention on the space_info's spinlock, which is
heavily used (space reservation, space flushing, allocating and
deallocating an extent from a block group (btrfs_update_block_group()),
etc).

Instead of using the space_info's spinlock to check if a ticket was
served, use a per ticket spinlock which isn't used by anyone other than
the task that created the ticket (stack allocated) and the task that
serves the ticket (a reclaim task or any task deallocating space that
ends up at btrfs_try_granting_tickets()).

After applying this patch and all previous patches from the same patchset
(many attempt to reduce space_info critical sections), lockstat showed
some improvements for a fs_mark test regarding the space_info's spinlock
'lock'. The lockstat results:

Before patchset:

  con-bounces:     13733858
  contentions:     15902322
  waittime-total:  264902529.72
  acq-bounces:     28161791
  acquisitions:    38679282

After patchset:

  con-bounces:     12032220
  contentions:     13598034
  waittime-total:  221806127.28
  acq-bounces:     24717947
  acquisitions:    34103281

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-11-24 22:18:21 +01:00
Filipe Manana 50a51b5378 btrfs: move ticket wakeup and finalization to remove_ticket()
Instead of repeating the wakeup and setup of the ->bytes or ->error field,
move those steps to remove_ticket() to avoid duplication. This is also
needed for the next patch in the series, so that we avoid duplicating more
logic.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-11-24 22:18:17 +01:00
Filipe Manana cdf8a566ee btrfs: add data_race() in btrfs_account_ro_block_groups_free_space()
Surround the intentional empty list check with the data_race() annotation
so that tools like KCSAN don't report a data race. The race is intentional
as it's harmless and we want to avoid lock contention of the space_info
since its lock is heavily used (space reservation, space flushing, extent
allocation and deallocation, etc).

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-11-24 22:16:27 +01:00
Filipe Manana b70c32f10a btrfs: remove double underscore prefix from __reserve_bytes()
The use of a double underscore prefix is discouraged and we have no
justification at all for it all since there's no reserved_bytes() counter
part. So remove the prefix.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-11-24 22:13:00 +01:00
Filipe Manana 189db25105 btrfs: process ticket outside global reserve critical section
In steal_from_global_rsv() there's no need to process the ticket inside
the critical section of the global reserve. Move the ticket processing to
happen after the critical section. This helps reduce contention on the
global reserve's spinlock.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-11-24 22:12:37 +01:00
Filipe Manana 5ca7725ddf btrfs: assign booleans to global reserve's full field
We have a couple places that are assigning 0 and 1 to the full field of
the global reserve. This is harmless since 0 is converted to false and
1 converted to true, but for better readability, replace these with true
and false since the field is of type bool.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-11-24 22:12:07 +01:00
Filipe Manana f18a203a1b btrfs: assert space_info is locked in steal_from_global_rsv()
The caller is supposed to have locked the space_info, so assert that.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-11-24 22:11:40 +01:00
Filipe Manana afbc047ab0 btrfs: avoid unnecessary reclaim calculation in priority_reclaim_metadata_space()
If the given ticket was already served (its ->bytes is 0), then we wasted
time calculating the metadata reclaim size. So calculate it only after we
checked the ticket was not yet served.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-11-24 22:10:18 +01:00
Filipe Manana 4ddb077378 btrfs: shorten critical section in btrfs_preempt_reclaim_metadata_space()
We are doing a lot of small calculations and assignments while holding the
space_info's spinlock, which is a heavily used lock for space reservation
and flushing. There's no point in holding the lock for so long when all we
want is to call need_preemptive_reclaim() and get a consistent value for a
couple of counters from the space_info. Instead, grab the counters into
local variables, release the lock and then use the local variables.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-11-24 22:09:46 +01:00
Filipe Manana 8ab2b8bdbe btrfs: increment loop count outside critical section during metadata reclaim
In btrfs_preempt_reclaim_metadata_space() there's no need to increment the
local variable that tracks the number of iterations of the while loop
while inside the critical section delimited by the space_info's spinlock.
That spinlock is heavily used by space reservation and flushing code, so
it's desirable to have its critical sections as short as possible.
So move the loop count incremented outside the critical section.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-11-24 22:09:14 +01:00
Filipe Manana 49f204be22 btrfs: bail out earlier from need_preemptive_reclaim() if we have tickets
Instead of doing some calculations and then return false if it turns out
we have queued tickets, check first if we have tickets and return false
immediately if we have tickets, without wasting time on doing those
computations.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-11-24 22:08:46 +01:00
Filipe Manana 6f4779faa0 btrfs: inline btrfs_space_info_used()
The function is simple enough to be inlined and in fact doing it even
reduces the object code. In x86_64 with gcc 14.2.0-19 from Debian the
results were the following:

  Before this change

    $ size fs/btrfs/btrfs.ko
      text	   data	    bss	    dec	    hex	filename
    1919410	 161703	  15592	2096705	 1ffe41	fs/btrfs/btrfs.ko

  After this change

    $ size fs/btrfs/btrfs.ko
      text	   data	    bss	    dec	    hex	filename
    1918991	 161675	  15592	2096258	 1ffc82	fs/btrfs/btrfs.ko

Also remove the ASSERT() that checks the space_info argument is not NULL,
as it's odd to be there since it can never be NULL and in case that ever
happens during development, a stack trace from a NULL pointer dereference
will be obvious. It was originally added when btrfs_space_info_used() was
introduced in commit 4136135b08 ("Btrfs: use helper to get used bytes
of space_info").

Also add a lockdep assertion to check the space_info's lock is being held
by the calling task.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-11-24 22:08:14 +01:00
Filipe Manana 0ce6300fec btrfs: avoid used space computation when reserving space
In __reserve_bytes() we have 3 repeated calls to btrfs_space_info_used(),
one early on as soon as take the space_info's spinlock, another one when
we call btrfs_can_overcommit(), which calls btrfs_space_info_used() again,
and a final one when we are reserving for a flush emergency.

During all these calls we are holding the space_info's spinlock, which is
heavily used by the space reservation and flushing code, so it's desirable
to make the critical sections as short as possible.

So make this more efficient by:

1) Instead of calling btrfs_can_overcommit() call the new variant
   can_overcommit() which takes the space_info's used space as an argument
   and pass the value we already computed and have in the 'used' variable;

2) Instead of calling btrfs_space_info_used() with its second argument as
   false when we are doing a flush emergency, decrement the space_info's
   bytes_may_use counter from the 'used' variable, as the difference
   between passing true or false as the second argument to
   btrfs_space_info_used() is whether or not to include the space_info's
   bytes_may_use counter in the computation.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-11-24 22:07:36 +01:00
Filipe Manana a5f8f64aa3 btrfs: avoid used space computation when trying to grant tickets
In btrfs_try_granting_tickets(), we call btrfs_can_overcommit() and that
calls btrfs_space_info_used(). But we already keep track, in the 'used'
local variable, of the used space in the space_info, so we are just
repeating the same computation and doing an extra function call while we
are holding the space_info's spinlock, which is heavily used by the space
reservation and flushing code.

So add a local variant of btrfs_can_overcommit() that takes in the used
space as an argument and therefore does not call btrfs_space_info_used(),
and use it in btrfs_try_granting_tickets().

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-11-24 22:07:26 +01:00
Filipe Manana 563ef2befb btrfs: make btrfs_can_overcommit() return bool instead of int
It's a boolean function, so switch its return type to bool.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-11-24 22:06:26 +01:00
Filipe Manana 60532c2136 btrfs: avoid recomputing used space in btrfs_try_granting_tickets()
In every iteration of the loop we call btrfs_space_info_used() which sums
a bunch of fields from a space_info object. This implies doing a function
call besides the sum, and we are holding the space_info's spinlock while
we do this, so we want to keep the critical section as short as possible
since that spinlock is used in all the code for space reservation and
flushing (therefore it's heavily used).

So call btrfs_try_granting_tickets() only once, before entering the loop,
and then update it as we remove tickets.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-11-24 22:05:06 +01:00
Filipe Manana 063171a4f0 btrfs: return real error when failing tickets in maybe_fail_all_tickets()
In case we had a transaction abort we set a ticket's error to -EIO, but we
have the real error that caused the transaction to be aborted returned by
the macro BTRFS_FS_ERROR(). So use that real error instead of -EIO.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-11-24 22:04:30 +01:00
Filipe Manana 771af6ff72 btrfs: remove fs_info argument from btrfs_sysfs_add_space_info_type()
We don't need it since we can grab fs_info from the given space_info.
So remove the fs_info argument.

Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-11-24 22:02:30 +01:00
Filipe Manana a1359d06d7 btrfs: remove fs_info argument from btrfs_reserve_metadata_bytes()
We don't need it since we can grab fs_info from the given space_info.
So remove the fs_info argument.

Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Anand Jain <asj@kernel.org>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-11-24 21:59:11 +01:00
Filipe Manana 30b87a2319 btrfs: remove fs_info argument from __reserve_bytes()
We don't need it since we can grab fs_info from the given space_info.
So remove the fs_info argument.

Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Anand Jain <asj@kernel.org>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-11-24 21:59:11 +01:00
Filipe Manana 09d0f28531 btrfs: fix parameter documentation for btrfs_reserve_data_bytes()
We don't have a fs_info argument anymore since commit 5d39fda880
("btrfs: pass btrfs_space_info to btrfs_reserve_data_bytes()"), it
was replaced by a space_info argument. So update the documentation.

Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Anand Jain <asj@kernel.org>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-11-24 21:59:11 +01:00
Filipe Manana 5495cbe920 btrfs: remove fs_info argument from maybe_clamp_preempt()
We don't need it since we can grab fs_info from the given space_info.
So remove the fs_info argument.

Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Anand Jain <asj@kernel.org>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-11-24 21:59:11 +01:00
Filipe Manana e182eca6ed btrfs: remove fs_info argument from handle_reserve_ticket()
We don't need it since we can grab fs_info from the given space_info.
So remove the fs_info argument.

Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Anand Jain <asj@kernel.org>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-11-24 21:59:11 +01:00
Filipe Manana ddeac2a12b btrfs: remove fs_info argument from steal_from_global_rsv()
We don't need it since we can grab fs_info from the given space_info.
So remove the fs_info argument.

Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Anand Jain <asj@kernel.org>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-11-24 21:59:10 +01:00
Filipe Manana d77b22de56 btrfs: remove fs_info argument from need_preemptive_reclaim()
We don't need it since we can grab fs_info from the given space_info.
So remove the fs_info argument.

Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Anand Jain <asj@kernel.org>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-11-24 21:59:10 +01:00
Filipe Manana 4199eb2761 btrfs: remove fs_info argument from btrfs_calc_reclaim_metadata_size()
We don't need it since we can grab fs_info from the given space_info.
So remove the fs_info argument.

Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Anand Jain <asj@kernel.org>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-11-24 21:59:10 +01:00
Filipe Manana 3ee1246536 btrfs: remove fs_info argument from shrink_delalloc() and flush_space()
We don't need it since we can grab fs_info from the given space_info.
So remove the fs_info argument.

Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Anand Jain <asj@kernel.org>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-11-24 21:59:10 +01:00
Filipe Manana e96059c9d7 btrfs: remove fs_info argument from btrfs_dump_space_info()
We don't need it since we can grab fs_info from the given space_info.
So remove the fs_info argument.

Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Anand Jain <asj@kernel.org>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-11-24 21:59:10 +01:00
Filipe Manana 78a77f4da4 btrfs: remove fs_info argument from btrfs_can_overcommit()
We don't need it since we can grab fs_info from the given space_info.
So remove the fs_info argument.

Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Anand Jain <asj@kernel.org>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-11-24 21:59:10 +01:00
Filipe Manana 302b4b69c4 btrfs: remove fs_info argument from calc_available_free_space()
We don't need it since we can grab fs_info from the given space_info.
So remove the fs_info argument.

Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Anand Jain <asj@kernel.org>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-11-24 21:59:09 +01:00
Filipe Manana 1b809e3055 btrfs: remove fs_info argument from maybe_fail_all_tickets()
We don't need it since we can grab fs_info from the given space_info.
So remove the fs_info argument.

Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Anand Jain <asj@kernel.org>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-11-24 21:59:09 +01:00
Filipe Manana cf3ae29caf btrfs: remove fs_info argument from priority_reclaim_metadata_space()
We don't need it since we can grab fs_info from the given space_info.
So remove the fs_info argument.

Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Anand Jain <asj@kernel.org>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-11-24 21:59:09 +01:00
Filipe Manana f63b36686b btrfs: remove fs_info argument from priority_reclaim_data_space()
We don't need it since we can grab fs_info from the given space_info.
So remove the fs_info argument.

Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Anand Jain <asj@kernel.org>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-11-24 21:59:09 +01:00
Filipe Manana e3df6408b1 btrfs: remove fs_info argument from btrfs_try_granting_tickets()
We don't need it since we can grab fs_info from the given space_info.
So remove the fs_info argument.

Reviewed-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Reviewed-by: Anand Jain <asj@kernel.org>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-11-24 21:59:09 +01:00
Boris Burkov 38e818718c btrfs: fix racy bitfield write in btrfs_clear_space_info_full()
From the memory-barriers.txt document regarding memory barrier ordering
guarantees:

 (*) These guarantees do not apply to bitfields, because compilers often
     generate code to modify these using non-atomic read-modify-write
     sequences.  Do not attempt to use bitfields to synchronize parallel
     algorithms.

 (*) Even in cases where bitfields are protected by locks, all fields
     in a given bitfield must be protected by one lock.  If two fields
     in a given bitfield are protected by different locks, the compiler's
     non-atomic read-modify-write sequences can cause an update to one
     field to corrupt the value of an adjacent field.

btrfs_space_info has a bitfield sharing an underlying word consisting of
the fields full, chunk_alloc, and flush:

struct btrfs_space_info {
        struct btrfs_fs_info *     fs_info;              /*     0     8 */
        struct btrfs_space_info *  parent;               /*     8     8 */
        ...
        int                        clamp;                /*   172     4 */
        unsigned int               full:1;               /*   176: 0  4 */
        unsigned int               chunk_alloc:1;        /*   176: 1  4 */
        unsigned int               flush:1;              /*   176: 2  4 */
        ...

Therefore, to be safe from parallel read-modify-writes losing a write to
one of the bitfield members protected by a lock, all writes to all the
bitfields must use the lock. They almost universally do, except for
btrfs_clear_space_info_full() which iterates over the space_infos and
writes out found->full = 0 without a lock.

Imagine that we have one thread completing a transaction in which we
finished deleting a block_group and are thus calling
btrfs_clear_space_info_full() while simultaneously the data reclaim
ticket infrastructure is running do_async_reclaim_data_space():

          T1                                             T2
btrfs_commit_transaction
  btrfs_clear_space_info_full
  data_sinfo->full = 0
  READ: full:0, chunk_alloc:0, flush:1
                                              do_async_reclaim_data_space(data_sinfo)
                                              spin_lock(&space_info->lock);
                                              if(list_empty(tickets))
                                                space_info->flush = 0;
                                                READ: full: 0, chunk_alloc:0, flush:1
                                                MOD/WRITE: full: 0, chunk_alloc:0, flush:0
                                                spin_unlock(&space_info->lock);
                                                return;
  MOD/WRITE: full:0, chunk_alloc:0, flush:1

and now data_sinfo->flush is 1 but the reclaim worker has exited. This
breaks the invariant that flush is 0 iff there is no work queued or
running. Once this invariant is violated, future allocations that go
into __reserve_bytes() will add tickets to space_info->tickets but will
see space_info->flush is set to 1 and not queue the work. After this,
they will block forever on the resulting ticket, as it is now impossible
to kick the worker again.

I also confirmed by looking at the assembly of the affected kernel that
it is doing RMW operations. For example, to set the flush (3rd) bit to 0,
the assembly is:
  andb    $0xfb,0x60(%rbx)
and similarly for setting the full (1st) bit to 0:
  andb    $0xfe,-0x20(%rax)

So I think this is really a bug on practical systems.  I have observed
a number of systems in this exact state, but am currently unable to
reproduce it.

Rather than leaving this footgun lying around for the future, take
advantage of the fact that there is room in the struct anyway, and that
it is already quite large and simply change the three bitfield members to
bools. This avoids writes to space_info->full having any effect on
writes to space_info->flush, regardless of locking.

Fixes: 957780eb27 ("Btrfs: introduce ticketed enospc infrastructure")
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Boris Burkov <boris@bur.io>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
2025-11-24 21:38:49 +01:00
Linus Torvalds f3827213ab for-6.18-tag
-----BEGIN PGP SIGNATURE-----
 
 iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmjTk4MACgkQxWXV+ddt
 WDvOHA//ajYvH7DIoFgQ09Q+UCfdawhWs/b4aW2ePpNK061tF6hvGgmGVe/Ugy8W
 297kSBVxpnaLfedHkm3m91SAft6VKSfdV3oV2DNn9sxUXQoa9hC6n9qIaqeOpfd8
 Nk+OvgSWpqonAHHMbsNev4C+vKZO534VRg09eFfIV7ATpQO7wxc1DKXFT5hgYP3m
 nosRc0f/4gx0EGHjiXyfuG5una1A/vry4+EP7jrvzvKHY9VzYMLRXH+glNUi5X5E
 GOwFXd6ADUpKDKN9Ove/Bm4DSz9jrTNu81qm/1i1mTpxS80sxBFIrD4KOil+hQDX
 B82n01KS8yJkBYH32Qnpg+9Cij/ZR/0OOg88wBLGeQiDoDw7J8D9mJe1/RHWHHTC
 rQ1C50CDlVGIPpnB1BftbvvdYlAPKgpnnznaaKg9Mdy3T5FtFQ3MqwZYOW/jubtY
 Zo7shxrDjSvPb7MHG6GlLBNxZ8JXXGyc+seEfjZ8iiEeMGsE9vIQ1L18c0GZSmgc
 /m/nQV/akycoNg/9J84HqClGLUWUApdMPaXrvOwC5CjpgOgJZ+rdUqhexqcNwmsl
 O+s9fwQidtAr5fAgl6SjwqaPauqBd4VSybs7IkGbz+zyaZeRdWo5gsg5t5Hjuyd5
 gJiIAztzI8bOPI1T/EheGVwSkmJTEkhnJDQvMRQcpEpo5D5K3YY=
 =9wY+
 -----END PGP SIGNATURE-----

Merge tag 'for-6.18-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux

Pull btrfs updates from David Sterba:
 "There are no new features, the changes are in the core code, notably
  tree-log error handling and reporting improvements, and initial
  support for block size > page size.

  Performance improvements:

   - search data checksums in the commit root (previous transaction) to
     avoid locking contention, this improves parallelism of read
     heavy/low write workloads, and also reduces transaction commit
     time; on real and reproducer workload the sync time went from
     minutes to tens of seconds (workload and numbers are in the
     changelog)

  Core:

   - tree-log updates:
      - error handling improvements, transaction aborts
      - add new error state 'O' (printed in status messages) when log
        replay fails and is aborted
      - reduced number of btrfs_path allocations when traversing the
        tree

   - 'block size > page size' support
      - basic implementation with limitations, under experimental build
      - limitations: no direct io, raid56, encoded read (standalone and
        in send ioctl), encoded write
      - preparatory work for compression, removing implicit assumptions
        of page and block sizes
      - compression workspaces are now per-filesystem, we cannot assume
        common block size for work memory among different filesystems

   - tree-checker now verifies INODE_EXTREF item (which is implementing
     hardlinks)

   - tree leaf pretty printer updates, there were missing data from
     items, keys/items

   - move config option CONFIG_BTRFS_REF_VERIFY to CONFIG_BTRFS_DEBUG,
     it's a debugging feature and not needed to be enabled separately

   - more struct btrfs_path auto free updates

   - use ref_tracker API for tracking delayed inodes, enabled by mount
     option 'ref_verify', allowing to better pinpoint leaking references

   - in zoned mode, avoid selecting data relocation zoned for ordinary
     data block groups

   - updated and enhanced error messages

   - lots of cleanups and refactoring"

* tag 'for-6.18-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: (113 commits)
  btrfs: use smp_mb__after_atomic() when forcing COW in create_pending_snapshot()
  btrfs: add unlikely annotations to branches leading to transaction abort
  btrfs: add unlikely annotations to branches leading to EIO
  btrfs: add unlikely annotations to branches leading to EUCLEAN
  btrfs: more trivial BTRFS_PATH_AUTO_FREE conversions
  btrfs: zoned: don't fail mount needlessly due to too many active zones
  btrfs: use kmalloc_array() for open-coded arithmetic in kmalloc()
  btrfs: enable experimental bs > ps support
  btrfs: add extra ASSERT()s to catch unaligned bios
  btrfs: fix symbolic link reading when bs > ps
  btrfs: prepare scrub to support bs > ps cases
  btrfs: prepare zlib to support bs > ps cases
  btrfs: prepare lzo to support bs > ps cases
  btrfs: prepare zstd to support bs > ps cases
  btrfs: prepare compression folio alloc/free for bs > ps cases
  btrfs: fix the incorrect max_bytes value for find_lock_delalloc_range()
  btrfs: remove pointless key offset setup in create_pending_snapshot()
  btrfs: annotate btrfs_is_testing() as unlikely and make it return bool
  btrfs: make the rule checking more readable for should_cow_block()
  btrfs: simplify inline extent end calculation at replay_one_extent()
  ...
2025-09-30 08:14:49 -07:00