The bulk of the patches for this release are clean-ups and minor bug
fixes.
There is one significant revert to mention: support for RDMA Read
operations in the server's RPC-over-RDMA transport implementation
has been fixed so it waits for Read completion in a way that avoids
tying up an nfsd thread. This prevents a possible DoS vector if an
RPC-over-RDMA client should become unresponsive during RDMA Read
operations.
As always I am grateful to NFSD contributors, reviewers, and
testers.
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEEKLLlsBKG3yQ88j7+M2qzM29mf5cFAmWdW34ACgkQM2qzM29m
f5fKmw/+PcjoNDWR55kTmOo8j0h4HF8rhunvP2C50svnnsX63y1WKkLaxyAFN/Hl
UFucJDQBjJvwi+PEbGOXcjkizuG5mhRBFvFIYDJYGWsE1s7B/v3E/Servvt1wSek
UjoTjknYrqH6R3YfA8zBaWRJUXwvVQW3Bzo4mShrQK7He9/7nBHdUe0aWbAA9oW3
QgzKH/FzqCS03MvuxQv74KgBcl3diIrDaj041A3CtSnXzSKqwc3LaUAd5B4BL+oq
GnxpV1rtZla50M4Ntddi+vSjUvHWZySQ1GEJj7rKLTwpGXkxM2NuMkGx676WR4Iv
sYDX0fsica2elKbqJem8pk68qi6XEdZVAdoOHdgNJRClmYHby8xkrL/TYKiQZf42
IN9FogoVSZ+vSdI158Weim9+0Jqf+ffIh57ZtOyQQQAGZkdhB6GhcbdHJhQ9eOgB
LAiAL7bsoWvDmBh5m9KnBmQYGpZoDUa6AT0bIvGD2O4/MdpHBkyT8Xwt+210nPOK
mpBtxe5O8cUcg7A5/TwnVRg5jKp4CF8VWh2R8sGDhcYV8UfRthB38h4rHNhv4vxt
l6ZUgmtTxrs1rCeh6aoiWTKXeQmI8meWlcet7cxw/axAsaTXkYPi5mslxF9f4O8u
nQ8q7LuZQy2CKZO/t98STwx7s9OJcDOwcy51rnKK85TlCwnxFWg=
=mIKg
-----END PGP SIGNATURE-----
Merge tag 'nfsd-6.8' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux
Pull nfsd updates from Chuck Lever:
"The bulk of the patches for this release are clean-ups and minor bug
fixes.
There is one significant revert to mention: support for RDMA Read
operations in the server's RPC-over-RDMA transport implementation has
been fixed so it waits for Read completion in a way that avoids tying
up an nfsd thread. This prevents a possible DoS vector if an
RPC-over-RDMA client should become unresponsive during RDMA Read
operations.
As always I am grateful to NFSD contributors, reviewers, and testers"
* tag 'nfsd-6.8' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux: (56 commits)
nfsd: rename nfsd_last_thread() to nfsd_destroy_serv()
SUNRPC: discard sv_refcnt, and svc_get/svc_put
svc: don't hold reference for poolstats, only mutex.
SUNRPC: remove printk when back channel request not found
svcrdma: Implement multi-stage Read completion again
svcrdma: Copy construction of svc_rqst::rq_arg to rdma_read_complete()
svcrdma: Add back svcxprt_rdma::sc_read_complete_q
svcrdma: Add back svc_rdma_recv_ctxt::rc_pages
svcrdma: Clean up comment in svc_rdma_accept()
svcrdma: Remove queue-shortening warnings
svcrdma: Remove pointer addresses shown in dprintk()
svcrdma: Optimize svc_rdma_cc_init()
svcrdma: De-duplicate completion ID initialization helpers
svcrdma: Move the svc_rdma_cc_init() call
svcrdma: Remove struct svc_rdma_read_info
svcrdma: Update the synopsis of svc_rdma_read_special()
svcrdma: Update the synopsis of svc_rdma_read_call_chunk()
svcrdma: Update synopsis of svc_rdma_read_multiple_chunks()
svcrdma: Update synopsis of svc_rdma_copy_inline_range()
svcrdma: Update the synopsis of svc_rdma_read_data_item()
...
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEEqG5UsNXhtOCrfGQP+7dXa6fLC2sFAmWYJ6kACgkQ+7dXa6fL
C2v6YBAAkDdqgWN96h2KOcd+El13Uxa3WNDjTHtzc0ZhjEDkzkU42sSF2yE0nerS
6kX18vibXC+TPnbBn1gOSGrVoFIC1kh/vUjrz/UQYfxXN19P8LE2wSdl+bC4nPT1
Qkrxkr+q4GSSJoYg9QUUAu0Hh2PvXMeDE/XyED6XiAkuDUbISO9yDeu+wo3wZM5L
1e8vRlg/2EQl2v1Crh5nC0tgJZbGULc2mCqi/rU5A9wdlKHFzwjU+2PTsbQNKE0m
0ueLblFeFRwBZpOfAUNNVAt3bwaSfhYpqUiiSldrU/JXhnx5CgY1kHzI3OPVedQt
WMfp/epwO848i3qVM8dHJXc93NJeC3gTBK7gYRrH07MuK3Of1KRH3D8YBsE0/r0q
NVcDQ6/eoni06CA8VMfSIEQ2+Q0m4xxUzAQURsOxRPY/FktzCKXMfpYTDZqbQfow
SXrKmsPnMZe4DUnvdcTSU8B3+vybJH/JgEnZXRtCPOYNDSyMcPhKPG2ioOz4UV+M
amQmpYfG4hzi1VmRrH57dwlXejBX16+zc9pLdZC5c0/phk3caYrJVMA8pwCOP4HM
AvB5Yl6gH2aGj1kKjffL7nWnQ2QbD7VWUn98TqLPezOX7DwQHMMKvlfPnv6R87sy
0HMmj9VxCgOvGLOf1JdQoTxtb49ndM4Y5fPvKYK2awW5FkAacLM=
=bHoG
-----END PGP SIGNATURE-----
Merge tag 'afs-fix-rotation-20240105' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs
Pull afs updates from David Howells:
"The majority of the patches are aimed at fixing and improving the AFS
filesystem's rotation over server IP addresses, but there are also
some fixes from Oleg Nesterov for the use of read_seqbegin_or_lock().
- Fix fileserver probe handling so that the next round of probes
doesn't break ongoing server/address rotation by clearing all the
probe result tracking. This could occasionally cause the rotation
algorithm to drop straight through, give a 'successful' result
without actually emitting any RPC calls, leaving the reply buffer
in an undefined state.
Instead, detach the probe results into a separate struct and
allocate a new one each time we start probing and update the
pointer to it. Probes are also sent in order of address preference
to try and improve the chance that the preferred one will complete
first.
- Fix server rotation so that it uses configurable address
preferences across on the probes that have completed so far than
ranking them by RTT as the latter doesn't necessarily give the best
route. The preference list can be altered by writing into
/proc/net/afs/addr_prefs.
- Fix the handling of Read-Only (and Backup) volume callbacks as
there is one per volume, not one per file, so if someone performs a
command that, say, offlines the volume but doesn't change it, when
it comes back online we don't spam the server with a status fetch
for every vnode we're using. Instead, check the Creation timestamp
in the VolSync record when prompted by a callback break.
- Handle volume regression (ie. a RW volume being restored from a
backup) by scrubbing all cache data for that volume. This is
detected from the VolSync creation timestamp.
- Adjust abort handling and abort -> error mapping to match better
with what other AFS clients do.
- Fix offline and busy volume state handling as they only apply to
individual server instances and not entire volumes and the rotation
algorithm should go and look at other servers if available. Also
make it sleep briefly before each retry if all the volume instances
are unavailable"
* tag 'afs-fix-rotation-20240105' of git://git.kernel.org/pub/scm/linux/kernel/git/dhowells/linux-fs: (40 commits)
afs: trace: Log afs_make_call(), including server address
afs: Fix offline and busy message emission
afs: Fix fileserver rotation
afs: Overhaul invalidation handling to better support RO volumes
afs: Parse the VolSync record in the reply of a number of RPC ops
afs: Don't leave DONTUSE/NEWREPSITE servers out of server list
afs: Fix comment in afs_do_lookup()
afs: Apply server breaks to mmap'd files in the call processor
afs: Move the vnode/volume validity checking code into its own file
afs: Defer volume record destruction to a workqueue
afs: Make it possible to find the volumes that are using a server
afs: Combine the endpoint state bools into a bitmask
afs: Keep a record of the current fileserver endpoint state
afs: Dispatch vlserver probes in priority order
afs: Dispatch fileserver probes in priority order
afs: Mark address lists with configured priorities
afs: Provide a way to configure address priorities
afs: Remove the unimplemented afs_cmp_addr_list()
afs: Add some more info to /proc/net/afs/servers
rxrpc: Create a procfile to display outstanding client conn bundles
...
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmWYTmMACgkQxWXV+ddt
WDvPRg/+KgS5LV3nNC0MguYcTMQxmgeutIgXZIMfeA3v6EnFS7nj8leP4EPc6+bj
JPSkwj4u2vHVwpnTVuEAuJUXnmFY+Qu70nVy6bM2uOHOYTVBQ8zRVK4cErNNLWCp
OekDaADR53RrZ/xprlQ7b7Ph0Ch2uq9OrpH50IcyquEsH1ffkxlqwyrvth4/8dxC
6zgsFHWrbtVKJf0DYoQPpjEPz5tpdQ+xHZwtmf1cNlUgI1objODr/ZTqXtZqTfw4
/GwrtDPbEri53K/qjgr0dDH7pBVqD6PtnbgoHfYkiizZ0G7UkmlaK6rZIurtATJb
Yk/RCqCUp9tPC4yeFSewFMm1Y8Ae3rkUBG7rnYkvMmBspMqyh/kQAWSBimF5yk/y
vFEdFTe9AbdvP19Nw0CqovLzaO6RrOXCL1usnFvCmBgvF5gZAv63ZW1njP3ZoNta
wB8Rs6hxdRkph8Dk7yvYf54uUR+JyKqjHY6egg2qkKTjz0CSf6qQFyFZXpr81m97
gK4WN5SeP/P2ukRbBKKyzZ5IljUxZuVatvJa0tktd7kAbU26WLzofOJ7pX+iqimM
F2G7gKGJZykLY1WPntXBp9Dg97Ras2O5iViQ7ZKwRdOx1yZS5zzTYlIznHBAmXbL
UgXfVnpJH1xFdkvedNTn+Fz9BHNV1K2a2AT7VITj7sxz23z3aJA=
=4sw3
-----END PGP SIGNATURE-----
Merge tag 'for-6.8-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs updates from David Sterba:
"There are no exciting changes for users, it's been mostly API
conversions and some fixes or refactoring.
The mount API conversion is a base for future improvements that would
come with VFS. Metadata processing has been converted to folios, not
yet enabling the large folios but it's one patch away once everything
gets tested enough.
Core changes:
- convert extent buffers to folios:
- direct API conversion where possible
- performance can drop by a few percent on metadata heavy
workloads, the folio sizes are not constant and the calculations
add up in the item helpers
- both regular and subpage modes
- data cannot be converted yet, we need to port that to iomap and
there are some other generic changes required
- convert mount to the new API, should not be user visible:
- options deprecated long time ago have been removed: inode_cache,
recovery
- the new logic that splits mount to two phases slightly changes
timing of device scanning for multi-device filesystems
- LSM options will now work (like for selinux)
- convert delayed nodes radix tree to xarray, preserving the
preload-like logic that still allows to allocate with GFP_NOFS
- more validation of sysfs value of scrub_speed_max
- refactor chunk map structure, reduce size and improve performance
- extent map refactoring, smaller data structures, improved
performance
- reduce size of struct extent_io_tree, embedded in several
structures
- temporary pages used for compression are cached and attached to a
shrinker, this may slightly improve performance
- in zoned mode, remove redirty extent buffer tracking, zeros are
written in case an out-of-order is detected and proper data are
written to the actual write pointer
- cleanups, refactoring, error message improvements, updated tests
- verify and update branch name or tag
- remove unwanted text"
* tag 'for-6.8-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: (89 commits)
btrfs: pass btrfs_io_geometry into btrfs_max_io_len
btrfs: pass struct btrfs_io_geometry to set_io_stripe
btrfs: open code set_io_stripe for RAID56
btrfs: change block mapping to switch/case in btrfs_map_block
btrfs: factor out block mapping for single profiles
btrfs: factor out block mapping for RAID5/6
btrfs: reduce scope of data_stripes in btrfs_map_block
btrfs: factor out block mapping for RAID10
btrfs: factor out block mapping for DUP profiles
btrfs: factor out RAID1 block mapping
btrfs: factor out block-mapping for RAID0
btrfs: re-introduce struct btrfs_io_geometry
btrfs: factor out helper for single device IO check
btrfs: migrate btrfs_repair_io_failure() to folio interfaces
btrfs: migrate eb_bitmap_offset() to folio interfaces
btrfs: migrate various end io functions to folios
btrfs: migrate subpage code to folio interfaces
btrfs: migrate get_eb_page_index() and get_eb_offset_in_page() to folios
btrfs: don't double put our subpage reference in alloc_extent_buffer
btrfs: cleanup metadata page pointer usage
...
are included in this merge do the following:
- Peng Zhang has done some mapletree maintainance work in the
series
"maple_tree: add mt_free_one() and mt_attr() helpers"
"Some cleanups of maple tree"
- In the series "mm: use memmap_on_memory semantics for dax/kmem"
Vishal Verma has altered the interworking between memory-hotplug
and dax/kmem so that newly added 'device memory' can more easily
have its memmap placed within that newly added memory.
- Matthew Wilcox continues folio-related work (including a few
fixes) in the patch series
"Add folio_zero_tail() and folio_fill_tail()"
"Make folio_start_writeback return void"
"Fix fault handler's handling of poisoned tail pages"
"Convert aops->error_remove_page to ->error_remove_folio"
"Finish two folio conversions"
"More swap folio conversions"
- Kefeng Wang has also contributed folio-related work in the series
"mm: cleanup and use more folio in page fault"
- Jim Cromie has improved the kmemleak reporting output in the
series "tweak kmemleak report format".
- In the series "stackdepot: allow evicting stack traces" Andrey
Konovalov to permits clients (in this case KASAN) to cause
eviction of no longer needed stack traces.
- Charan Teja Kalla has fixed some accounting issues in the page
allocator's atomic reserve calculations in the series "mm:
page_alloc: fixes for high atomic reserve caluculations".
- Dmitry Rokosov has added to the samples/ dorectory some sample
code for a userspace memcg event listener application. See the
series "samples: introduce cgroup events listeners".
- Some mapletree maintanance work from Liam Howlett in the series
"maple_tree: iterator state changes".
- Nhat Pham has improved zswap's approach to writeback in the
series "workload-specific and memory pressure-driven zswap
writeback".
- DAMON/DAMOS feature and maintenance work from SeongJae Park in
the series
"mm/damon: let users feed and tame/auto-tune DAMOS"
"selftests/damon: add Python-written DAMON functionality tests"
"mm/damon: misc updates for 6.8"
- Yosry Ahmed has improved memcg's stats flushing in the series
"mm: memcg: subtree stats flushing and thresholds".
- In the series "Multi-size THP for anonymous memory" Ryan Roberts
has added a runtime opt-in feature to transparent hugepages which
improves performance by allocating larger chunks of memory during
anonymous page faults.
- Matthew Wilcox has also contributed some cleanup and maintenance
work against eh buffer_head code int he series "More buffer_head
cleanups".
- Suren Baghdasaryan has done work on Andrea Arcangeli's series
"userfaultfd move option". UFFDIO_MOVE permits userspace heap
compaction algorithms to move userspace's pages around rather than
UFFDIO_COPY'a alloc/copy/free.
- Stefan Roesch has developed a "KSM Advisor", in the series
"mm/ksm: Add ksm advisor". This is a governor which tunes KSM's
scanning aggressiveness in response to userspace's current needs.
- Chengming Zhou has optimized zswap's temporary working memory
use in the series "mm/zswap: dstmem reuse optimizations and
cleanups".
- Matthew Wilcox has performed some maintenance work on the
writeback code, both code and within filesystems. The series is
"Clean up the writeback paths".
- Andrey Konovalov has optimized KASAN's handling of alloc and
free stack traces for secondary-level allocators, in the series
"kasan: save mempool stack traces".
- Andrey also performed some KASAN maintenance work in the series
"kasan: assorted clean-ups".
- David Hildenbrand has gone to town on the rmap code. Cleanups,
more pte batching, folio conversions and more. See the series
"mm/rmap: interface overhaul".
- Kinsey Ho has contributed some maintenance work on the MGLRU
code in the series "mm/mglru: Kconfig cleanup".
- Matthew Wilcox has contributed lruvec page accounting code
cleanups in the series "Remove some lruvec page accounting
functions".
-----BEGIN PGP SIGNATURE-----
iHUEABYIAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCZZyF2wAKCRDdBJ7gKXxA
jjWjAP42LHvGSjp5M+Rs2rKFL0daBQsrlvy6/jCHUequSdWjSgEAmOx7bc5fbF27
Oa8+DxGM9C+fwqZ/7YxU2w/WuUmLPgU=
=0NHs
-----END PGP SIGNATURE-----
Merge tag 'mm-stable-2024-01-08-15-31' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull MM updates from Andrew Morton:
"Many singleton patches against the MM code. The patch series which are
included in this merge do the following:
- Peng Zhang has done some mapletree maintainance work in the series
'maple_tree: add mt_free_one() and mt_attr() helpers'
'Some cleanups of maple tree'
- In the series 'mm: use memmap_on_memory semantics for dax/kmem'
Vishal Verma has altered the interworking between memory-hotplug
and dax/kmem so that newly added 'device memory' can more easily
have its memmap placed within that newly added memory.
- Matthew Wilcox continues folio-related work (including a few fixes)
in the patch series
'Add folio_zero_tail() and folio_fill_tail()'
'Make folio_start_writeback return void'
'Fix fault handler's handling of poisoned tail pages'
'Convert aops->error_remove_page to ->error_remove_folio'
'Finish two folio conversions'
'More swap folio conversions'
- Kefeng Wang has also contributed folio-related work in the series
'mm: cleanup and use more folio in page fault'
- Jim Cromie has improved the kmemleak reporting output in the series
'tweak kmemleak report format'.
- In the series 'stackdepot: allow evicting stack traces' Andrey
Konovalov to permits clients (in this case KASAN) to cause eviction
of no longer needed stack traces.
- Charan Teja Kalla has fixed some accounting issues in the page
allocator's atomic reserve calculations in the series 'mm:
page_alloc: fixes for high atomic reserve caluculations'.
- Dmitry Rokosov has added to the samples/ dorectory some sample code
for a userspace memcg event listener application. See the series
'samples: introduce cgroup events listeners'.
- Some mapletree maintanance work from Liam Howlett in the series
'maple_tree: iterator state changes'.
- Nhat Pham has improved zswap's approach to writeback in the series
'workload-specific and memory pressure-driven zswap writeback'.
- DAMON/DAMOS feature and maintenance work from SeongJae Park in the
series
'mm/damon: let users feed and tame/auto-tune DAMOS'
'selftests/damon: add Python-written DAMON functionality tests'
'mm/damon: misc updates for 6.8'
- Yosry Ahmed has improved memcg's stats flushing in the series 'mm:
memcg: subtree stats flushing and thresholds'.
- In the series 'Multi-size THP for anonymous memory' Ryan Roberts
has added a runtime opt-in feature to transparent hugepages which
improves performance by allocating larger chunks of memory during
anonymous page faults.
- Matthew Wilcox has also contributed some cleanup and maintenance
work against eh buffer_head code int he series 'More buffer_head
cleanups'.
- Suren Baghdasaryan has done work on Andrea Arcangeli's series
'userfaultfd move option'. UFFDIO_MOVE permits userspace heap
compaction algorithms to move userspace's pages around rather than
UFFDIO_COPY'a alloc/copy/free.
- Stefan Roesch has developed a 'KSM Advisor', in the series 'mm/ksm:
Add ksm advisor'. This is a governor which tunes KSM's scanning
aggressiveness in response to userspace's current needs.
- Chengming Zhou has optimized zswap's temporary working memory use
in the series 'mm/zswap: dstmem reuse optimizations and cleanups'.
- Matthew Wilcox has performed some maintenance work on the writeback
code, both code and within filesystems. The series is 'Clean up the
writeback paths'.
- Andrey Konovalov has optimized KASAN's handling of alloc and free
stack traces for secondary-level allocators, in the series 'kasan:
save mempool stack traces'.
- Andrey also performed some KASAN maintenance work in the series
'kasan: assorted clean-ups'.
- David Hildenbrand has gone to town on the rmap code. Cleanups, more
pte batching, folio conversions and more. See the series 'mm/rmap:
interface overhaul'.
- Kinsey Ho has contributed some maintenance work on the MGLRU code
in the series 'mm/mglru: Kconfig cleanup'.
- Matthew Wilcox has contributed lruvec page accounting code cleanups
in the series 'Remove some lruvec page accounting functions'"
* tag 'mm-stable-2024-01-08-15-31' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (361 commits)
mm, treewide: rename MAX_ORDER to MAX_PAGE_ORDER
mm, treewide: introduce NR_PAGE_ORDERS
selftests/mm: add separate UFFDIO_MOVE test for PMD splitting
selftests/mm: skip test if application doesn't has root privileges
selftests/mm: conform test to TAP format output
selftests: mm: hugepage-mmap: conform to TAP format output
selftests/mm: gup_test: conform test to TAP format output
mm/selftests: hugepage-mremap: conform test to TAP format output
mm/vmstat: move pgdemote_* out of CONFIG_NUMA_BALANCING
mm: zsmalloc: return -ENOSPC rather than -EINVAL in zs_malloc while size is too large
mm/memcontrol: remove __mod_lruvec_page_state()
mm/khugepaged: use a folio more in collapse_file()
slub: use a folio in __kmalloc_large_node
slub: use folio APIs in free_large_kmalloc()
slub: use alloc_pages_node() in alloc_slab_page()
mm: remove inc/dec lruvec page state functions
mm: ratelimit stat flush from workingset shrinker
kasan: stop leaking stack trace handles
mm/mglru: remove CONFIG_TRANSPARENT_HUGEPAGE
mm/mglru: add dummy pmd_dirty()
...
- Energy scheduling:
- Consolidate how the max compute capacity is
used in the scheduler and how we calculate
the frequency for a level of utilization.
- Rework interface between the scheduler and
the schedutil governor
- Simplify the util_est logic
- Deadline scheduler:
- Work more towards reducing SCHED_DEADLINE
starvation of low priority tasks (e.g., SCHED_OTHER)
tasks when higher priority tasks monopolize CPU
cycles, via the introduction of 'deadline servers'
(nested/2-level scheduling).
"Fair servers" to make use of this facility are
not introduced yet.
- EEVDF:
- Introduce O(1) fastpath for EEVDF task selection
- NUMA balancing:
- Tune the NUMA-balancing vma scanning logic some more,
to better distribute the probability
of a particular vma getting scanned.
- Plus misc fixes, cleanups and updates.
Signed-off-by: Ingo Molnar <mingo@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmWcASMRHG1pbmdvQGtl
cm5lbC5vcmcACgkQEnMQ0APhK1jLbg/+NOwF18M6klF1/3jUaV1PU09vRzYnnA7w
oF7Tru7JLV+/vZK+rwI1zxzj5Nj3sVBQPIyp1embEHx7Z/QH8MIaIVpcSFsDDCYY
Q8n6ZVRB+lKWEo5+Ti6JEJftDAWuLHXwFWDa57oWPuR0Tc736+zYHUfj7jdKk0RI
nT/lnOT6hXU8q26O4QFrBrrhvCCxc4byo7buKPQfqie0bDA70ppIWkFQoQME6mvQ
US9jvOyUipOiPV06DPwFvPDJUQBGq2VdJNk+5zCEtcqEfLREuo/Xq1Ww1x1BWaZI
761532EuDo73iMK4IFZrvVmj1ioz957qbje11MSSkDdKj692xxjXyvnY0NBvZuho
Ueog/jQ4D4I2qu7pPSCF8UfnI/Hw4Q+KJ89j3pcywRm4hmCTf9k3MGpAaVLVxH7G
e5REZ5MSsFZi4Cs+zF87Of5KCKLhTr1qSetNtShinKahg06WZ+MZ8tW4jb52qy0j
F8PMlvfBI3f7SOtA8s2P26mDGQ21YQehN2d5P+Fbwj/U3fjIlSTOyx6NwLpFwYaS
Vf+fctchGFV1Sh7c2JjCh+ecYfXx3ghT/pvyPOImJtxtCKSRUQ8c26ApC1OsWfOE
FdHv4f2dPqcyswCZzIv/2fyDXc9eaS2E05EMDNqVuMCGnzidzSs81n7hBioNMrnH
ZgHK90TmEbw=
=wTVh
-----END PGP SIGNATURE-----
Merge tag 'sched-core-2024-01-08' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler updates from Ingo Molnar:
"Energy scheduling:
- Consolidate how the max compute capacity is used in the scheduler
and how we calculate the frequency for a level of utilization.
- Rework interface between the scheduler and the schedutil governor
- Simplify the util_est logic
Deadline scheduler:
- Work more towards reducing SCHED_DEADLINE starvation of low
priority tasks (e.g., SCHED_OTHER) tasks when higher priority tasks
monopolize CPU cycles, via the introduction of 'deadline servers'
(nested/2-level scheduling).
"Fair servers" to make use of this facility are not introduced yet.
EEVDF:
- Introduce O(1) fastpath for EEVDF task selection
NUMA balancing:
- Tune the NUMA-balancing vma scanning logic some more, to better
distribute the probability of a particular vma getting scanned.
Plus misc fixes, cleanups and updates"
* tag 'sched-core-2024-01-08' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (30 commits)
sched/fair: Fix tg->load when offlining a CPU
sched/fair: Remove unused 'next_buddy_marked' local variable in check_preempt_wakeup_fair()
sched/fair: Use all little CPUs for CPU-bound workloads
sched/fair: Simplify util_est
sched/fair: Remove SCHED_FEAT(UTIL_EST_FASTUP, true)
arm64/amu: Use capacity_ref_freq() to set AMU ratio
cpufreq/cppc: Set the frequency used for computing the capacity
cpufreq/cppc: Move and rename cppc_cpufreq_{perf_to_khz|khz_to_perf}()
energy_model: Use a fixed reference frequency
cpufreq/schedutil: Use a fixed reference frequency
cpufreq: Use the fixed and coherent frequency for scaling capacity
sched/topology: Add a new arch_scale_freq_ref() method
freezer,sched: Clean saved_state when restoring it during thaw
sched/fair: Update min_vruntime for reweight_entity() correctly
sched/doc: Update documentation after renames and synchronize Chinese version
sched/cpufreq: Rework iowait boost
sched/cpufreq: Rework schedutil governor performance estimation
sched/pelt: Avoid underestimation of task utilization
sched/timers: Explain why idle task schedules out on remote timer enqueue
sched/cpuidle: Comment about timers requirements VS idle handler
...
- Various preparatory cleanups & enhancements of the timer-wheel code,
in preparation for the WIP 'pull timers at expiry' timer migration model
series (which will replace the current 'push timers at enqueue' migration
model), by Anna-Maria Behnsen:
- Update comments and clean up confusing variable names
- Add debug check to warn about time travel
- Improve/expand timer-wheel tracepoints
- Optimize away unnecessary IPIs for deferrable timers
- Restructure & clean up next_expiry_recalc()
- Clean up forward_timer_base()
- Introduce __forward_timer_base() and use it to simplify
and micro-optimize get_next_timer_interrupt()
- Restructure the get_next_timer_interrupt()'s idle logic
for better readability and to enable a minor optimization.
- Fix the nextevt calculation when no timers are pending
- Fix the sysfs_get_uname() prototype declaration
Signed-off-by: Ingo Molnar <mingo@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmWb0XIRHG1pbmdvQGtl
cm5lbC5vcmcACgkQEnMQ0APhK1h9kg/9FpjbiogIKrDXb/pJHyhYkK6jzN4aNrQo
wsOz4FDKyvioqLfr5ndpFE++DwsyzUibPfHJzfwD5IilTyolm2jW44VSCBzNdm72
lI6NGIcIxmIeCuO4bLmJj/fuQAugQ+ajmA2pyC/aBSO4Q2jtnxjYMGiV9zMWmOsa
E816CK5zp6IVx+w0GWwK5yW5YR5dscSQCD+mBYVAdTWYoRNTy6xonsMTRuNi0ePx
donetpu0eWG9NGwUdax/65oKVLZMR/rKAI/3pInhkOS9BsL2o8Rt4o2Y+9aBFi2t
2m+ZbFg5hngJwhP8Mfc7A+I3qiWgCOMGNGrebyzlwb+0PnNBPzrwnNPveW3R9QRx
LMxSU3aH66bXeX+YCF4y2tjWSmYooAnztPstUGrs+sq36+NF0wyY6ip/36S6MRGk
zjedqWnrHQeeZlzOLiKNzB+FIBnOt6bhZEh1Wk1/zwi9UWxw+7+I6tR0b57NqRxZ
VHq38fp+O2OEAj5JvwJ6FomOd+onqQ2wwveG5OMCa+hwM2ZCuVXQRYgM2ohMfwl3
BMSd3KMZsBiHT0zyun3k/uJ7CaIjArPh016baSS10ArSl9sE64aJj7ELtuSLqtaD
idJFXu3tv6VgDT2rMhLWNHvzQoK+gb8+/qnms4Ea+wY2f7nubi0aH20qHfugkgis
4KOkw9cQw0U=
=n40J
-----END PGP SIGNATURE-----
Merge tag 'timers-core-2024-01-08' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull timer subsystem updates from Ingo Molnar:
- Various preparatory cleanups & enhancements of the timer-wheel code,
in preparation for the WIP 'pull timers at expiry' timer migration
model series (which will replace the current 'push timers at enqueue'
migration model), by Anna-Maria Behnsen:
- Update comments and clean up confusing variable names
- Add debug check to warn about time travel
- Improve/expand timer-wheel tracepoints
- Optimize away unnecessary IPIs for deferrable timers
- Restructure & clean up next_expiry_recalc()
- Clean up forward_timer_base()
- Introduce __forward_timer_base() and use it to simplify and
micro-optimize get_next_timer_interrupt()
- Restructure the get_next_timer_interrupt()'s idle logic for better
readability and to enable a minor optimization.
- Fix the nextevt calculation when no timers are pending
- Fix the sysfs_get_uname() prototype declaration
* tag 'timers-core-2024-01-08' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
timers: Fix nextevt calculation when no timers are pending
timers: Rework idle logic
timers: Use already existing function for forwarding timer base
timers: Split out forward timer base functionality
timers: Clarify check in forward_timer_base()
timers: Move store of next event into __next_timer_interrupt()
timers: Do not IPI for deferrable timers
tracing/timers: Add tracepoint for tracking timer base is_idle flag
tracing/timers: Enhance timer_start tracepoint
tick-sched: Warn when next tick seems to be in the past
tick/sched: Cleanup confusing variables
tick-sched: Fix function names in comments
time: Make sysfs_get_uname() function visible in header
Once a set of RDMA Reads are complete, the Read completion handler
will poke the transport to trigger a second call to
svc_rdma_recvfrom(). recvfrom() will then merge the RDMA Read
payloads with the previously received RPC header to form a completed
RPC Call message.
The new code is copied from the svc_rdma_process_read_list() path.
A subsequent patch will make use of this code and remove the code
that this was copied from (svc_rdma_rw.c).
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
A send/recv_ctxt already records transport-related information
in the cq.id, thus there is no need to record the IP addresses of
the transport endpoints.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Update the DMA error flow tracepoints to report the completion ID of
the failing context. This ties the wait/failure to a particular
operation or request, which is more useful than knowing only the
failing transport.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Update the Send Queue's error flow tracepoints to report the
completion ID of the waiting or failing context. This ties the
wait/failure to a particular operation or request, which is a little
more useful than knowing only the transport that is about to close.
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
Add a tracepoint to log calls to afs_make_call(), including the destination
server address.
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: linux-afs@lists.infradead.org
Fix the fileserver rotation so that it doesn't use RTT as the basis for
deciding which server and address to use as this doesn't necessarily give a
good indication of the best path. Instead, use the configurable preference
list in conjunction with whatever probes have succeeded at the time of
looking.
To this end, make the following changes:
(1) Keep an array of "server states" to track what addresses we've tried
on each server and move the waitqueue entries there that we'll need
for probing.
(2) Each afs_server_state struct is made to pin the corresponding server's
endpoint state rather than the afs_operation struct carrying a pin on
the server we're currently looking at.
(3) Drop the server list preference; we now always rescan the server list.
(4) afs_wait_for_probes() now uses the server state list to guide it in
what it waits for (and to provide the waitqueue entries) and returns
an indication of whether we'd got a response, run out of responsive
addresses or the endpoint state had been superseded and we need to
restart the iteration.
(5) Call afs_get_address_preferences*() occasionally to refresh the
preference values.
(6) When picking a server, scan the addresses of the servers for which we
have as-yet untested communications, looking for the highest priority
one and use that instead of trying all the addresses for a particular
server in ascending-RTT order.
(7) When a Busy or Offline state is seen across all available servers, do
a short sleep.
(8) If we detect that we accessed a future RO volume version whilst it is
undergoing replication, reissue the op against the older version until
at least half of the servers are replicated.
(9) Whilst RO replication is ongoing, increase the frequency of Volume
Location server checks for that volume to every ten minutes instead of
hourly.
Also add a tracepoint to track progress through the rotation algorithm.
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: linux-afs@lists.infradead.org
Overhaul the third party-induced invalidation handling, making use of the
previously added volume-level event counters (cb_scrub and cb_ro_snapshot)
that are now being parsed out of the VolSync record returned by the
fileserver in many of its replies.
This allows better handling of RO (and Backup) volumes. Since these are
snapshot of a RW volume that are updated atomically simultantanously across
all servers that host them, they only require a single callback promise for
the entire volume. The currently upstream code assumes that RO volumes
operate in the same manner as RW volumes, and that each file has its own
individual callback - which means that it does a status fetch for *every*
file in a RO volume, whether or not the volume got "released" (volume
callback breaks can occur for other reasons too, such as the volumeserver
taking ownership of a volume from a fileserver).
To this end, make the following changes:
(1) Change the meaning of the volume's cb_v_break counter so that it is
now a hint that we need to issue a status fetch to work out the state
of a volume. cb_v_break is incremented by volume break callbacks and
by server initialisation callbacks.
(2) Add a second counter, cb_v_check, to the afs_volume struct such that
if this differs from cb_v_break, we need to do a check. When the
check is complete, cb_v_check is advanced to what cb_v_break was at
the start of the status fetch.
(3) Move the list of mmap'd vnodes to the volume and trigger removal of
PTEs that map to files on a volume break rather than on a server
break.
(4) When a server reinitialisation callback comes in, use the
server-to-volume reverse mapping added in a preceding patch to iterate
over all the volumes using that server and clear the volume callback
promises for that server and the general volume promise as a whole to
trigger reanalysis.
(5) Replace the AFS_VNODE_CB_PROMISED flag with an AFS_NO_CB_PROMISE
(TIME64_MIN) value in the cb_expires_at field, reducing the number of
checks we need to make.
(6) Change afs_check_validity() to quickly see if various event counters
have been incremented or if the vnode or volume callback promise is
due to expire/has expired without making any changes to the state.
That is now left to afs_validate() as this may get more complicated in
future as we may have to examine server records too.
(7) Overhaul afs_validate() so that it does a single status fetch if we
need to check the state of either the vnode or the volume - and do so
under appropriate locking. The function does the following steps:
(A) If the vnode/volume is no longer seen as valid, then we take the
vnode validation lock and, if the volume promise has expired, the
volume check lock also. The latter prevents redundant checks being
made to find out if a new version of the volume got released.
(B) If a previous RPC call found that the volsync changed unexpectedly
or that a RO volume was updated, then we unmap all PTEs pointing to
the file to stop mmap being used for access.
(C) If the vnode is still seen to be of uncertain validity, then we
perform an FS.FetchStatus RPC op to jointly update the volume status
and the vnode status. This assessment is done as part of parsing the
reply:
If the RO volume creation timestamp advances, cb_ro_snapshot is
incremented; if either the creation or update timestamps changes in
an unexpected way, the cb_scrub counter is incremented
If the Data Version returned doesn't match the copy we have
locally, then we ask for the pagecache to be zapped. This takes
care of handling RO update.
(D) If cb_scrub differs between volume and vnode, the vnode's
pagecache is zapped and the vnode's cb_scrub is updated unless the
file is marked as having been deleted.
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: linux-afs@lists.infradead.org
A number of fileserver RPC operations return a VolSync record as part of
their reply that gives some information about the state of the volume being
accessed, including:
(1) A volume Creation timestamp. For an RW volume, this is the time at
which the volume was created; if it changes, the RW volume was
presumably restored from a backup and all cached data should be
scrubbed as Data Version numbers could regress on the files in the
volume.
For an RO volume, this is the time it was last snapshotted from the RW
volume. It is expected to advance each time this happens; if it
regresses, cached data should be scrubbed.
(2) A volume Update timestamp (Auristor only). For an RW volume, this is
updated any time any change is made to a volume or its contents. If
it regresses, all cached data must be scrubbed.
For an RO volume, this is a copy of the RW volume's Update timestamp
at the point of snapshotting. It can be used as a version number when
checking to see if a callback on a RO volume was due to a snapshot.
If it regresses, all cached data must be scrubbed.
but this is currently not made use of by the in-kernel afs filesystem.
Make the afs filesystem use this by:
(1) Add an update time field to the afs_volsync struct and use a value of
TIME64_MIN in both that and the creation time to indicate that they
are unset.
(2) Add creation and update time fields to the afs_volume struct and use
this to track the two timestamps.
(3) Add a volsync_lock mutex to the afs_volume struct to control
modification access for when we detect a change in these values.
(3) Add a 'pre-op volsync' struct to the afs_operation struct to record
the state of the volume tracking before the op.
(4) Add a new counter, cb_scrub, to the afs_volume struct to count events
that require all data to be scrubbed. A copy is placed in the
afs_vnode struct (inode) and if they no longer match, a scrub takes
place.
(5) When the result of an operation is being parsed, parse the VolSync
data too, if it is provided. Note that the two timestamps are handled
separately, since they don't work in quite the same way.
- If the afs_volume tracking is unset, just set it and do nothing
else.
- If the result timestamps are the same as the ones in afs_volume, do
nothing.
- If the timestamps regress, increment cb_scrub if not already done
so.
- If the creation timestamp on a RW volume changes, increment cb_scrub
if not already done so.
- If the creation timestamp on a RO volume advances, update the server
list and see if the current server has been excluded, if so reissue
the op. Once over half of the replication sites have been updated,
increment cb_ro_snapshot to indicate updates may be required and
switch over to excluding unupdated replication sites.
- If the creation timestamp on a Backup volume advances, just
increment cb_ro_snapshot to trigger updates.
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: linux-afs@lists.infradead.org
Apply server breaks to mmap'd files that are being used from that server
from the call processor work function rather than punting it off to a
workqueue. The work item, afs_server_init_callback(), then bumps each
individual inode off to its own work item introducing a potentially lengthy
delay. This reduces that delay at the cost of extending the amount of time
we delay replying to the CB.InitCallBack3 notification RPC from the server.
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: linux-afs@lists.infradead.org
Keep a record of the current fileserver endpoint state, including the probe
state, and replace it when a new probe is started rather than just
squelching the old state and overwriting it. Clearance of the old state
can cause a race if there's another thread also currently trying to
communicate with that server.
It appears that this race might be the culprit for some occasions where
kafs complains about invalid data in the RPC reply because the rotation
algorithm fell all the way through without actually issuing an RPC call and
the error return got filled in from the probe state (which has a zero error
recorded). Whatever happens to be in the caller's reply buffer is then
taken as the response.
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: linux-afs@lists.infradead.org
When probing all the addresses for a volume location server, dispatch them
in order of descending priority to try and get back highest priority one
first.
Also add a tracepoint to show the transmission and completion of the
probes.
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: linux-afs@lists.infradead.org
When probing all the addresses for a fileserver, dispatch them in order of
descending priority to try and get back highest priority one first.
Also add a tracepoint to show the transmission and completion of the
probes.
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: linux-afs@lists.infradead.org
This adds a new tracepoint for the ksm advisor. It reports the last scan
time, the new setting of the pages_to_scan parameter and the average cpu
percent usage of the ksmd background thread for the last scan.
Link: https://lkml.kernel.org/r/20231218231054.1625219-4-shr@devkernel.io
Signed-off-by: Stefan Roesch <shr@devkernel.io>
Acked-by: David Hildenbrand <david@redhat.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Rik van Riel <riel@surriel.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Fold the afs_addr_cursor struct into the afs_operation struct and the
afs_vl_cursor struct and fold its operations into their callers also.
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: linux-afs@lists.infradead.org
Add a tracepoint to track the lifetime of the afs_addr_list struct.
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: linux-afs@lists.infradead.org
Change rxrpc's API such that:
(1) A new function, rxrpc_kernel_lookup_peer(), is provided to look up an
rxrpc_peer record for a remote address and a corresponding function,
rxrpc_kernel_put_peer(), is provided to dispose of it again.
(2) When setting up a call, the rxrpc_peer object used during a call is
now passed in rather than being set up by rxrpc_connect_call(). For
afs, this meenat passing it to rxrpc_kernel_begin_call() rather than
the full address (the service ID then has to be passed in as a
separate parameter).
(3) A new function, rxrpc_kernel_remote_addr(), is added so that afs can
get a pointer to the transport address for display purposed, and
another, rxrpc_kernel_remote_srx(), to gain a pointer to the full
rxrpc address.
(4) The function to retrieve the RTT from a call, rxrpc_kernel_get_srtt(),
is then altered to take a peer. This now returns the RTT or -1 if
there are insufficient samples.
(5) Rename rxrpc_kernel_get_peer() to rxrpc_kernel_call_get_peer().
(6) Provide a new function, rxrpc_kernel_get_peer(), to get a ref on a
peer the caller already has.
This allows the afs filesystem to pin the rxrpc_peer records that it is
using, allowing faster lookups and pointer comparisons rather than
comparing sockaddr_rxrpc contents. It also makes it easier to get hold of
the RTT. The following changes are made to afs:
(1) The addr_list struct's addrs[] elements now hold a peer struct pointer
and a service ID rather than a sockaddr_rxrpc.
(2) When displaying the transport address, rxrpc_kernel_remote_addr() is
used.
(3) The port arg is removed from afs_alloc_addrlist() since it's always
overridden.
(4) afs_merge_fs_addr4() and afs_merge_fs_addr6() do peer lookup and may
now return an error that must be handled.
(5) afs_find_server() now takes a peer pointer to specify the address.
(6) afs_find_server(), afs_compare_fs_alists() and afs_merge_fs_addr[46]{}
now do peer pointer comparison rather than address comparison.
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: linux-afs@lists.infradead.org
Automatically generate trace tag enums from the symbol -> string mapping
tables rather than having the enums as well, thereby reducing duplicated
data.
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: Jeff Layton <jlayton@kernel.org>
cc: linux-afs@lists.infradead.org
cc: linux-fsdevel@vger.kernel.org
checkpatch objects to whitespace before ')', so remove most of it from the
afs trace header.
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: Jeff Layton <jlayton@kernel.org>
cc: linux-afs@lists.infradead.org
cc: linux-fsdevel@vger.kernel.org
When debugging timer code the timer tracepoints are very important. There
is no tracepoint when the is_idle flag of the timer base changes. Instead
of always adding manually trace_printk(), add tracepoints which can be
easily enabled whenever required.
Signed-off-by: Anna-Maria Behnsen <anna-maria@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Link: https://lore.kernel.org/r/20231201092654.34614-6-anna-maria@linutronix.de
For starting a timer, the timer is enqueued into a bucket of the timer
wheel. The bucket expiry is the defacto expiry of the timer but it is not
equal the timer expiry because of increasing granularity when bucket is in
a higher level of the wheel. To be able to figure out in a trace whether a
timer expired in time or not, the bucket expiry time is required as well.
Add bucket expiry time to the timer_start tracepoint and thereby simplify
the arguments.
Signed-off-by: Anna-Maria Behnsen <anna-maria@linutronix.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Link: https://lore.kernel.org/r/20231201092654.34614-5-anna-maria@linutronix.de
Currently, in struct extent_map, we use an unsigned int (32 bits) to
identify the compression type of an extent and an unsigned long (64 bits
on a 64 bits platform, 32 bits otherwise) for flags. We are only using
6 different flags, so an unsigned long is excessive and we can use flags
to identify the compression type instead of using a dedicated 32 bits
field.
We can easily have tens or hundreds of thousands (or more) of extent maps
on busy and large filesystems, specially with compression enabled or many
or large files with tons of small extents. So it's convenient to have the
extent_map structure as small as possible in order to use less memory.
So remove the compression type field from struct extent_map, use flags
to identify the compression type and shorten the flags field from an
unsigned long to a u32. This saves 8 bytes (on 64 bits platforms) and
reduces the size of the structure from 136 bytes down to 128 bytes, using
now only two cache lines, and increases the number of extent maps we can
have per 4K page from 30 to 32. By using a u32 for the flags instead of
an unsigned long, we no longer use test_bit(), set_bit() and clear_bit(),
but that level of atomicity is not needed as most flags are never cleared
once set (before adding an extent map to the tree), and the ones that can
be cleared or set after an extent map is added to the tree, are always
performed while holding the write lock on the extent map tree, while the
reader holds a lock on the tree or tests for a flag that never changes
once the extent map is in the tree (such as compression flags).
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
After commit ac3c0d36a2 ("btrfs: make fiemap more efficient and accurate
reporting extent sharedness") we no longer need to create special extent
maps during fiemap that have a block start with the EXTENT_MAP_DELALLOC
value. So this block start value for extent maps is no longer used since
then, therefore remove it.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
The extent_io_tree is embedded in several structures, notably in struct
btrfs_inode. The fs_info is only used for reporting errors and for
reference in trace points. We can get to the pointer through the inode,
but not all io trees set it. However, we always know the owner and
can recognize if inode is valid. For access helpers are provided, const
variant for the trace points.
This reduces size of extent_io_tree by 8 bytes and following structures
in turn:
- btrfs_inode 1104 -> 1088
- btrfs_device 520 -> 512
- btrfs_root 1360 -> 1344
- btrfs_transaction 456 -> 440
- btrfs_fs_info 3600 -> 3592
- reloc_control 1520 -> 1512
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: David Sterba <dsterba@suse.com>
Currently we abuse the extent_map structure for two purposes:
1) To actually represent extents for inodes;
2) To represent chunk mappings.
This is odd and has several disadvantages:
1) To create a chunk map, we need to do two memory allocations: one for
an extent_map structure and another one for a map_lookup structure, so
more potential for an allocation failure and more complicated code to
manage and link two structures;
2) For a chunk map we actually only use 3 fields (24 bytes) of the
respective extent map structure: the 'start' field to have the logical
start address of the chunk, the 'len' field to have the chunk's size,
and the 'orig_block_len' field to contain the chunk's stripe size.
Besides wasting a memory, it's also odd and not intuitive at all to
have the stripe size in a field named 'orig_block_len'.
We are also using 'block_len' of the extent_map structure to contain
the chunk size, so we have 2 fields for the same value, 'len' and
'block_len', which is pointless;
3) When an extent map is associated to a chunk mapping, we set the bit
EXTENT_FLAG_FS_MAPPING on its flags and then make its member named
'map_lookup' point to the associated map_lookup structure. This means
that for an extent map associated to an inode extent, we are not using
this 'map_lookup' pointer, so wasting 8 bytes (on a 64 bits platform);
4) Extent maps associated to a chunk mapping are never merged or split so
it's pointless to use the existing extent map infrastructure.
So add a dedicated data structure named 'btrfs_chunk_map' to represent
chunk mappings, this is basically the existing map_lookup structure with
some extra fields:
1) 'start' to contain the chunk logical address;
2) 'chunk_len' to contain the chunk's length;
3) 'stripe_size' for the stripe size;
4) 'rb_node' for insertion into a rb tree;
5) 'refs' for reference counting.
This way we do a single memory allocation for chunk mappings and we don't
waste memory for them with unused/unnecessary fields from an extent_map.
We also save 8 bytes from the extent_map structure by removing the
'map_lookup' pointer, so the size of struct extent_map is reduced from
144 bytes down to 136 bytes, and we can now have 30 extents map per 4K
page instead of 28.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
An out of bounds read can occur within the tracepoint 9p_protocol_dump. In
the fast assign, there is a memcpy that uses a constant size of 32 (macro
named P9_PROTO_DUMP_SZ). When the copy is invoked, the source buffer is not
guaranteed match this size. It was found that in some cases the source
buffer size is less than 32, resulting in a read that overruns.
The size of the source buffer seems to be known at the time of the
tracepoint being invoked. The allocations happen within p9_fcall_init(),
where the capacity field is set to the allocated size of the payload
buffer. This patch tries to fix the overrun by changing the fixed array to
a dynamically sized array and using the minimum of the capacity value or
P9_PROTO_DUMP_SZ as its length. The trace log statement is adjusted to
account for this. Note that the trace log no longer splits the payload on
the first 16 bytes. The full payload is now logged to a single line.
To repro the orignal problem, operations to a plan 9 managed resource can
be used. The simplest approach might just be mounting a shared filesystem
(between host and guest vm) using the plan 9 protocol while the tracepoint
is enabled.
mount -t 9p -o trans=virtio <mount_tag> <mount_path>
The bpftrace program below can be used to show the out of bounds read.
Note that a recent version of bpftrace is needed for the raw tracepoint
support. The script was tested using v0.19.0.
/* from include/net/9p/9p.h */
struct p9_fcall {
u32 size;
u8 id;
u16 tag;
size_t offset;
size_t capacity;
struct kmem_cache *cache;
u8 *sdata;
bool zc;
};
tracepoint:9p:9p_protocol_dump
{
/* out of bounds read can happen when this tracepoint is enabled */
}
rawtracepoint:9p_protocol_dump
{
$pdu = (struct p9_fcall *)arg1;
$dump_sz = (uint64)32;
if ($dump_sz > $pdu->capacity) {
printf("reading %zu bytes from src buffer of %zu bytes\n",
$dump_sz, $pdu->capacity);
}
}
Signed-off-by: JP Kobryn <inwardvessel@gmail.com>
Message-ID: <20231204202321.22730-1-inwardvessel@gmail.com>
Fixes: 60ece0833b ("net/9p: allocate appropriate reduced message buffers")
Cc: stable@vger.kernel.org
Reviewed-by: Christian Schoenebeck <linux_oss@crudebyte.com>
Signed-off-by: Dominique Martinet <asmadeus@codewreck.org>
Fix RTT determination to be able to use any type of ACK as the response
from which RTT can be calculated provided its ack.serial is non-zero and
matches the serial number of an outgoing DATA or ACK packet. This
shouldn't be limited to REQUESTED-type ACKs as these can have other types
substituted for them for things like duplicate or out-of-order packets.
Fixes: 4700c4d80b ("rxrpc: Fix loss of RTT samples due to interposed ACK")
Signed-off-by: David Howells <dhowells@redhat.com>
cc: Marc Dionne <marc.dionne@auristor.com>
cc: "David S. Miller" <davem@davemloft.net>
cc: Eric Dumazet <edumazet@google.com>
cc: Jakub Kicinski <kuba@kernel.org>
cc: Paolo Abeni <pabeni@redhat.com>
cc: linux-afs@lists.infradead.org
cc: netdev@vger.kernel.org
Signed-off-by: David S. Miller <davem@davemloft.net>
Tracing the runtime delta makes sense, observer can sum over time.
Tracing the absolute vruntime makes less sense, inconsistent:
absolute-vs-delta, but also vruntime delta can be computed from
runtime delta.
Removing the vruntime thing also makes the two tracepoint sites
identical, allowing to unify the code in a later patch.
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
included in this merge do the following:
- Kemeng Shi has contributed some compation maintenance work in the
series "Fixes and cleanups to compaction".
- Joel Fernandes has a patchset ("Optimize mremap during mutual
alignment within PMD") which fixes an obscure issue with mremap()'s
pagetable handling during a subsequent exec(), based upon an
implementation which Linus suggested.
- More DAMON/DAMOS maintenance and feature work from SeongJae Park i the
following patch series:
mm/damon: misc fixups for documents, comments and its tracepoint
mm/damon: add a tracepoint for damos apply target regions
mm/damon: provide pseudo-moving sum based access rate
mm/damon: implement DAMOS apply intervals
mm/damon/core-test: Fix memory leaks in core-test
mm/damon/sysfs-schemes: Do DAMOS tried regions update for only one apply interval
- In the series "Do not try to access unaccepted memory" Adrian Hunter
provides some fixups for the recently-added "unaccepted memory' feature.
To increase the feature's checking coverage. "Plug a few gaps where
RAM is exposed without checking if it is unaccepted memory".
- In the series "cleanups for lockless slab shrink" Qi Zheng has done
some maintenance work which is preparation for the lockless slab
shrinking code.
- Qi Zheng has redone the earlier (and reverted) attempt to make slab
shrinking lockless in the series "use refcount+RCU method to implement
lockless slab shrink".
- David Hildenbrand contributes some maintenance work for the rmap code
in the series "Anon rmap cleanups".
- Kefeng Wang does more folio conversions and some maintenance work in
the migration code. Series "mm: migrate: more folio conversion and
unification".
- Matthew Wilcox has fixed an issue in the buffer_head code which was
causing long stalls under some heavy memory/IO loads. Some cleanups
were added on the way. Series "Add and use bdev_getblk()".
- In the series "Use nth_page() in place of direct struct page
manipulation" Zi Yan has fixed a potential issue with the direct
manipulation of hugetlb page frames.
- In the series "mm: hugetlb: Skip initialization of gigantic tail
struct pages if freed by HVO" has improved our handling of gigantic
pages in the hugetlb vmmemmep optimizaton code. This provides
significant boot time improvements when significant amounts of gigantic
pages are in use.
- Matthew Wilcox has sent the series "Small hugetlb cleanups" - code
rationalization and folio conversions in the hugetlb code.
- Yin Fengwei has improved mlock()'s handling of large folios in the
series "support large folio for mlock"
- In the series "Expose swapcache stat for memcg v1" Liu Shixin has
added statistics for memcg v1 users which are available (and useful)
under memcg v2.
- Florent Revest has enhanced the MDWE (Memory-Deny-Write-Executable)
prctl so that userspace may direct the kernel to not automatically
propagate the denial to child processes. The series is named "MDWE
without inheritance".
- Kefeng Wang has provided the series "mm: convert numa balancing
functions to use a folio" which does what it says.
- In the series "mm/ksm: add fork-exec support for prctl" Stefan Roesch
makes is possible for a process to propagate KSM treatment across
exec().
- Huang Ying has enhanced memory tiering's calculation of memory
distances. This is used to permit the dax/kmem driver to use "high
bandwidth memory" in addition to Optane Data Center Persistent Memory
Modules (DCPMM). The series is named "memory tiering: calculate
abstract distance based on ACPI HMAT"
- In the series "Smart scanning mode for KSM" Stefan Roesch has
optimized KSM by teaching it to retain and use some historical
information from previous scans.
- Yosry Ahmed has fixed some inconsistencies in memcg statistics in the
series "mm: memcg: fix tracking of pending stats updates values".
- In the series "Implement IOCTL to get and optionally clear info about
PTEs" Peter Xu has added an ioctl to /proc/<pid>/pagemap which permits
us to atomically read-then-clear page softdirty state. This is mainly
used by CRIU.
- Hugh Dickins contributed the series "shmem,tmpfs: general maintenance"
- a bunch of relatively minor maintenance tweaks to this code.
- Matthew Wilcox has increased the use of the VMA lock over file-backed
page faults in the series "Handle more faults under the VMA lock". Some
rationalizations of the fault path became possible as a result.
- In the series "mm/rmap: convert page_move_anon_rmap() to
folio_move_anon_rmap()" David Hildenbrand has implemented some cleanups
and folio conversions.
- In the series "various improvements to the GUP interface" Lorenzo
Stoakes has simplified and improved the GUP interface with an eye to
providing groundwork for future improvements.
- Andrey Konovalov has sent along the series "kasan: assorted fixes and
improvements" which does those things.
- Some page allocator maintenance work from Kemeng Shi in the series
"Two minor cleanups to break_down_buddy_pages".
- In thes series "New selftest for mm" Breno Leitao has developed
another MM self test which tickles a race we had between madvise() and
page faults.
- In the series "Add folio_end_read" Matthew Wilcox provides cleanups
and an optimization to the core pagecache code.
- Nhat Pham has added memcg accounting for hugetlb memory in the series
"hugetlb memcg accounting".
- Cleanups and rationalizations to the pagemap code from Lorenzo
Stoakes, in the series "Abstract vma_merge() and split_vma()".
- Audra Mitchell has fixed issues in the procfs page_owner code's new
timestamping feature which was causing some misbehaviours. In the
series "Fix page_owner's use of free timestamps".
- Lorenzo Stoakes has fixed the handling of new mappings of sealed files
in the series "permit write-sealed memfd read-only shared mappings".
- Mike Kravetz has optimized the hugetlb vmemmap optimization in the
series "Batch hugetlb vmemmap modification operations".
- Some buffer_head folio conversions and cleanups from Matthew Wilcox in
the series "Finish the create_empty_buffers() transition".
- As a page allocator performance optimization Huang Ying has added
automatic tuning to the allocator's per-cpu-pages feature, in the series
"mm: PCP high auto-tuning".
- Roman Gushchin has contributed the patchset "mm: improve performance
of accounted kernel memory allocations" which improves their performance
by ~30% as measured by a micro-benchmark.
- folio conversions from Kefeng Wang in the series "mm: convert page
cpupid functions to folios".
- Some kmemleak fixups in Liu Shixin's series "Some bugfix about
kmemleak".
- Qi Zheng has improved our handling of memoryless nodes by keeping them
off the allocation fallback list. This is done in the series "handle
memoryless nodes more appropriately".
- khugepaged conversions from Vishal Moola in the series "Some
khugepaged folio conversions".
-----BEGIN PGP SIGNATURE-----
iHUEABYIAB0WIQTTMBEPP41GrTpTJgfdBJ7gKXxAjgUCZULEMwAKCRDdBJ7gKXxA
jhQHAQCYpD3g849x69DmHnHWHm/EHQLvQmRMDeYZI+nx/sCJOwEAw4AKg0Oemv9y
FgeUPAD1oasg6CP+INZvCj34waNxwAc=
=E+Y4
-----END PGP SIGNATURE-----
Merge tag 'mm-stable-2023-11-01-14-33' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull MM updates from Andrew Morton:
"Many singleton patches against the MM code. The patch series which are
included in this merge do the following:
- Kemeng Shi has contributed some compation maintenance work in the
series 'Fixes and cleanups to compaction'
- Joel Fernandes has a patchset ('Optimize mremap during mutual
alignment within PMD') which fixes an obscure issue with mremap()'s
pagetable handling during a subsequent exec(), based upon an
implementation which Linus suggested
- More DAMON/DAMOS maintenance and feature work from SeongJae Park i
the following patch series:
mm/damon: misc fixups for documents, comments and its tracepoint
mm/damon: add a tracepoint for damos apply target regions
mm/damon: provide pseudo-moving sum based access rate
mm/damon: implement DAMOS apply intervals
mm/damon/core-test: Fix memory leaks in core-test
mm/damon/sysfs-schemes: Do DAMOS tried regions update for only one apply interval
- In the series 'Do not try to access unaccepted memory' Adrian
Hunter provides some fixups for the recently-added 'unaccepted
memory' feature. To increase the feature's checking coverage. 'Plug
a few gaps where RAM is exposed without checking if it is
unaccepted memory'
- In the series 'cleanups for lockless slab shrink' Qi Zheng has done
some maintenance work which is preparation for the lockless slab
shrinking code
- Qi Zheng has redone the earlier (and reverted) attempt to make slab
shrinking lockless in the series 'use refcount+RCU method to
implement lockless slab shrink'
- David Hildenbrand contributes some maintenance work for the rmap
code in the series 'Anon rmap cleanups'
- Kefeng Wang does more folio conversions and some maintenance work
in the migration code. Series 'mm: migrate: more folio conversion
and unification'
- Matthew Wilcox has fixed an issue in the buffer_head code which was
causing long stalls under some heavy memory/IO loads. Some cleanups
were added on the way. Series 'Add and use bdev_getblk()'
- In the series 'Use nth_page() in place of direct struct page
manipulation' Zi Yan has fixed a potential issue with the direct
manipulation of hugetlb page frames
- In the series 'mm: hugetlb: Skip initialization of gigantic tail
struct pages if freed by HVO' has improved our handling of gigantic
pages in the hugetlb vmmemmep optimizaton code. This provides
significant boot time improvements when significant amounts of
gigantic pages are in use
- Matthew Wilcox has sent the series 'Small hugetlb cleanups' - code
rationalization and folio conversions in the hugetlb code
- Yin Fengwei has improved mlock()'s handling of large folios in the
series 'support large folio for mlock'
- In the series 'Expose swapcache stat for memcg v1' Liu Shixin has
added statistics for memcg v1 users which are available (and
useful) under memcg v2
- Florent Revest has enhanced the MDWE (Memory-Deny-Write-Executable)
prctl so that userspace may direct the kernel to not automatically
propagate the denial to child processes. The series is named 'MDWE
without inheritance'
- Kefeng Wang has provided the series 'mm: convert numa balancing
functions to use a folio' which does what it says
- In the series 'mm/ksm: add fork-exec support for prctl' Stefan
Roesch makes is possible for a process to propagate KSM treatment
across exec()
- Huang Ying has enhanced memory tiering's calculation of memory
distances. This is used to permit the dax/kmem driver to use 'high
bandwidth memory' in addition to Optane Data Center Persistent
Memory Modules (DCPMM). The series is named 'memory tiering:
calculate abstract distance based on ACPI HMAT'
- In the series 'Smart scanning mode for KSM' Stefan Roesch has
optimized KSM by teaching it to retain and use some historical
information from previous scans
- Yosry Ahmed has fixed some inconsistencies in memcg statistics in
the series 'mm: memcg: fix tracking of pending stats updates
values'
- In the series 'Implement IOCTL to get and optionally clear info
about PTEs' Peter Xu has added an ioctl to /proc/<pid>/pagemap
which permits us to atomically read-then-clear page softdirty
state. This is mainly used by CRIU
- Hugh Dickins contributed the series 'shmem,tmpfs: general
maintenance', a bunch of relatively minor maintenance tweaks to
this code
- Matthew Wilcox has increased the use of the VMA lock over
file-backed page faults in the series 'Handle more faults under the
VMA lock'. Some rationalizations of the fault path became possible
as a result
- In the series 'mm/rmap: convert page_move_anon_rmap() to
folio_move_anon_rmap()' David Hildenbrand has implemented some
cleanups and folio conversions
- In the series 'various improvements to the GUP interface' Lorenzo
Stoakes has simplified and improved the GUP interface with an eye
to providing groundwork for future improvements
- Andrey Konovalov has sent along the series 'kasan: assorted fixes
and improvements' which does those things
- Some page allocator maintenance work from Kemeng Shi in the series
'Two minor cleanups to break_down_buddy_pages'
- In thes series 'New selftest for mm' Breno Leitao has developed
another MM self test which tickles a race we had between madvise()
and page faults
- In the series 'Add folio_end_read' Matthew Wilcox provides cleanups
and an optimization to the core pagecache code
- Nhat Pham has added memcg accounting for hugetlb memory in the
series 'hugetlb memcg accounting'
- Cleanups and rationalizations to the pagemap code from Lorenzo
Stoakes, in the series 'Abstract vma_merge() and split_vma()'
- Audra Mitchell has fixed issues in the procfs page_owner code's new
timestamping feature which was causing some misbehaviours. In the
series 'Fix page_owner's use of free timestamps'
- Lorenzo Stoakes has fixed the handling of new mappings of sealed
files in the series 'permit write-sealed memfd read-only shared
mappings'
- Mike Kravetz has optimized the hugetlb vmemmap optimization in the
series 'Batch hugetlb vmemmap modification operations'
- Some buffer_head folio conversions and cleanups from Matthew Wilcox
in the series 'Finish the create_empty_buffers() transition'
- As a page allocator performance optimization Huang Ying has added
automatic tuning to the allocator's per-cpu-pages feature, in the
series 'mm: PCP high auto-tuning'
- Roman Gushchin has contributed the patchset 'mm: improve
performance of accounted kernel memory allocations' which improves
their performance by ~30% as measured by a micro-benchmark
- folio conversions from Kefeng Wang in the series 'mm: convert page
cpupid functions to folios'
- Some kmemleak fixups in Liu Shixin's series 'Some bugfix about
kmemleak'
- Qi Zheng has improved our handling of memoryless nodes by keeping
them off the allocation fallback list. This is done in the series
'handle memoryless nodes more appropriately'
- khugepaged conversions from Vishal Moola in the series 'Some
khugepaged folio conversions'"
[ bcachefs conflicts with the dynamically allocated shrinkers have been
resolved as per Stephen Rothwell in
https://lore.kernel.org/all/20230913093553.4290421e@canb.auug.org.au/
with help from Qi Zheng.
The clone3 test filtering conflict was half-arsed by yours truly ]
* tag 'mm-stable-2023-11-01-14-33' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (406 commits)
mm/damon/sysfs: update monitoring target regions for online input commit
mm/damon/sysfs: remove requested targets when online-commit inputs
selftests: add a sanity check for zswap
Documentation: maple_tree: fix word spelling error
mm/vmalloc: fix the unchecked dereference warning in vread_iter()
zswap: export compression failure stats
Documentation: ubsan: drop "the" from article title
mempolicy: migration attempt to match interleave nodes
mempolicy: mmap_lock is not needed while migrating folios
mempolicy: alloc_pages_mpol() for NUMA policy without vma
mm: add page_rmappable_folio() wrapper
mempolicy: remove confusing MPOL_MF_LAZY dead code
mempolicy: mpol_shared_policy_init() without pseudo-vma
mempolicy trivia: use pgoff_t in shared mempolicy tree
mempolicy trivia: slightly more consistent naming
mempolicy trivia: delete those ancient pr_debug()s
mempolicy: fix migrate_pages(2) syscall return nr_failed
kernfs: drop shared NUMA mempolicy hooks
hugetlbfs: drop shared NUMA mempolicy pretence
mm/damon/sysfs-test: add a unit test for damon_sysfs_set_targets()
...
Updates to the usual drivers (ufs, megaraid_sas, lpfc, target, ibmvfc,
scsi_debug) plus the usual assorted minor fixes and updates. The
major change this time around is a prep patch for rethreading of the
driver reset handler API not to take a scsi_cmd structure which starts
to reduce various drivers' dependence on scsi_cmd in error handling.
Signed-off-by: James E.J. Bottomley <jejb@linux.ibm.com>
-----BEGIN PGP SIGNATURE-----
iJwEABMIAEQWIQTnYEDbdso9F2cI+arnQslM7pishQUCZUORLiYcamFtZXMuYm90
dG9tbGV5QGhhbnNlbnBhcnRuZXJzaGlwLmNvbQAKCRDnQslM7pishQ4WAQDDIhzp
/PiJBBtt0U9ii/lYqRLrOVnN0extKEgEGO+FbwEAssKgs+5Jn/7XCgdpSrx8Co3/
0cPXrZGxs7tFpFWLZjM=
=AlRU
-----END PGP SIGNATURE-----
Merge tag 'scsi-misc' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi
Pull SCSI updates from James Bottomley:
"Updates to the usual drivers (ufs, megaraid_sas, lpfc, target, ibmvfc,
scsi_debug) plus the usual assorted minor fixes and updates.
The major change this time around is a prep patch for rethreading of
the driver reset handler API not to take a scsi_cmd structure which
starts to reduce various drivers' dependence on scsi_cmd in error
handling"
* tag 'scsi-misc' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi: (132 commits)
scsi: ufs: core: Leave space for '\0' in utf8 desc string
scsi: ufs: core: Conversion to bool not necessary
scsi: ufs: core: Fix race between force complete and ISR
scsi: megaraid: Fix up debug message in megaraid_abort_and_reset()
scsi: aic79xx: Fix up NULL command in ahd_done()
scsi: message: fusion: Initialize return value in mptfc_bus_reset()
scsi: mpt3sas: Fix loop logic
scsi: snic: Remove useless code in snic_dr_clean_pending_req()
scsi: core: Add comment to target_destroy in scsi_host_template
scsi: core: Clean up scsi_dev_queue_ready()
scsi: pmcraid: Add missing scsi_device_put() in pmcraid_eh_target_reset_handler()
scsi: target: core: Fix kernel-doc comment
scsi: pmcraid: Fix kernel-doc comment
scsi: core: Handle depopulation and restoration in progress
scsi: ufs: core: Add support for parsing OPP
scsi: ufs: core: Add OPP support for scaling clocks and regulators
scsi: ufs: dt-bindings: common: Add OPP table
scsi: scsi_debug: Add param to control sdev's allow_restart
scsi: scsi_debug: Add debugfs interface to fail target reset
scsi: scsi_debug: Add new error injection type: Reset LUN failed
...
The ia64 architecture gets its well-earned retirement as planned,
now that there is one last (mostly) working release that will
be maintained as an LTS kernel.
The architecture specific system call tables are updated for
the added map_shadow_stack() syscall and to remove references
to the long-gone sys_lookup_dcookie() syscall.
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEiK/NIGsWEZVxh/FrYKtH/8kJUicFAmVC40IACgkQYKtH/8kJ
Uidhmw/9EX+aWSXGoObJ3fngaNSMw+PmrEuP8qEKBHxfKHcCdX3hc451Oh4GlhaQ
tru91pPwgNvN2/rfoKusxT+V4PemGIzfNni/04rp+P0kvmdw5otQ2yNhsQNsfVmq
XGWvkxF4P2GO6bkjjfR/1dDq7GtlyXtwwPDKeLbYb6TnJOZjtx+EAN27kkfSn1Ms
R4Sa3zJ+DfHUmHL5S9g+7UD/CZ5GfKNmIskI4Mz5GsfoUz/0iiU+Bge/9sdcdSJQ
kmbLy5YnVzfooLZ3TQmBFsO3iAMWb0s/mDdtyhqhTVmTUshLolkPYyKnPFvdupyv
shXcpEST2XJNeaDRnL2K4zSCdxdbnCZHDpjfl9wfioBg7I8NfhXKpf1jYZHH1de4
LXq8ndEFEOVQw/zSpYWfQq1sux8Jiqr+UK/ukbVeFWiGGIUs91gEWtPAf8T0AZo9
ujkJvaWGl98O1g5wmBu0/dAR6QcFJMDfVwbmlIFpU8O+MEaz6X8mM+O5/T0IyTcD
eMbAUjj4uYcU7ihKzHEv/0SS9Of38kzff67CLN5k8wOP/9NlaGZ78o1bVle9b52A
BdhrsAefFiWHp1jT6Y9Rg4HOO/TguQ9e6EWSKOYFulsiLH9LEFaB9RwZLeLytV0W
vlAgY9rUW77g1OJcb7DoNv33nRFuxsKqsnz3DEIXtgozo9CzbYI=
=H1vH
-----END PGP SIGNATURE-----
Merge tag 'asm-generic-6.7' of git://git.kernel.org/pub/scm/linux/kernel/git/arnd/asm-generic
Pull ia64 removal and asm-generic updates from Arnd Bergmann:
- The ia64 architecture gets its well-earned retirement as planned,
now that there is one last (mostly) working release that will be
maintained as an LTS kernel.
- The architecture specific system call tables are updated for the
added map_shadow_stack() syscall and to remove references to the
long-gone sys_lookup_dcookie() syscall.
* tag 'asm-generic-6.7' of git://git.kernel.org/pub/scm/linux/kernel/git/arnd/asm-generic:
hexagon: Remove unusable symbols from the ptrace.h uapi
asm-generic: Fix spelling of architecture
arch: Reserve map_shadow_stack() syscall number for all architectures
syscalls: Cleanup references to sys_lookup_dcookie()
Documentation: Drop or replace remaining mentions of IA64
lib/raid6: Drop IA64 support
Documentation: Drop IA64 from feature descriptions
kernel: Drop IA64 support from sig_fault handlers
arch: Remove Itanium (IA-64) architecture
kernel:
- add initial vmemdup-user-array
core:
- fix platform remove() to return void
- drm_file owner updated to reflect owner
- move size calcs to drm buddy allocator
- let GPUVM build as a module
- allow variable number of run-queues in scheduler
edid:
- handle bad h/v sync_end in EDIDs
panfrost:
- add Boris as maintainer
fbdev:
- use fb_ops helpers more
- only allow logo use from fbcon
- rename fb_pgproto to pgprot_framebuffer
- add HPD state to drm_connector_oob_hotplug_event
- convert to fbdev i/o mem helpers
i915:
- Enable meteorlake by default
- Early Xe2 LPD/Lunarlake display enablement
- Rework subplatforms into IP version checks
- GuC based TLB invalidation for Meteorlake
- Display rework for future Xe driver integration
- LNL FBC features
- LNL display feature capability reads
- update recommended fw versions for DG2+
- drop fastboot module parameter
- added deviceid for Arrowlake-S
- drop preproduction workarounds
- don't disable preemption for resets
- cleanup inlines in headers
- PXP firmware loading fix
- Fix sg list lengths
- DSC PPS state readout/verification
- Add more RPL P/U PCI IDs
- Add new DG2-G12 stepping
- DP enhanced framing support to state checker
- Improve shared link bandwidth management
- stop using GEM macros in display code
- refactor related code into display code
- locally enable W=1 warnings
- remove PSR watchdog timers on LNL
amdgpu:
- RAS/FRU EEPROM updatse
- IP discovery updatses
- GC 11.5 support
- DCN 3.5 support
- VPE 6.1 support
- NBIO 7.11 support
- DML2 support
- lots of IP updates
- use flexible arrays for bo list handling
- W=1 fixes
- Enable seamless boot in more cases
- Enable context type property for HDMI
- Rework GPUVM TLB flushing
- VCN IB start/size alignment fixes
amdkfd:
- GC 10/11 fixes
- GC 11.5 support
- use partial migration in GPU faults
radeon:
- W=1 Fixes
- fix some possible buffer overflow/NULL derefs
nouveau:
- update uapi for NO_PREFETCH
- scheduler/fence fixes
- rework suspend/resume for GSP-RM
- rework display in preparation for GSP-RM
habanalabs:
- uapi: expose tsc clock
- uapi: block access to eventfd through control device
- uapi: force dma-buf export to PAGE_SIZE alignments
- complete move to accel subsystem
- move firmware interface include files
- perform hard reset on PCIe AXI drain event
- optimise user interrupt handling
msm:
- DP: use existing helpers for DPCD
- DPU: interrupts reworked
- gpu: a7xx (a730/a740) support
- decouple msm_drv from kms for headless devices
mediatek:
- MT8188 dsi/dp/edp support
- DDP GAMMA - 12 bit LUT support
- connector dynamic selection capability
rockchip:
- rv1126 mipi-dsi/vop support
- add planar formats
ast:
- rename constants
panels:
- Mitsubishi AA084XE01
- JDI LPM102A188A
- LTK050H3148W-CTA6
ivpu:
- power management fixes
qaic:
- add detach slice bo api
komeda:
- add NV12 writeback
tegra:
- support NVSYNC/NHSYNC
- host1x suspend fixes
ili9882t:
- separate into own driver
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEEEKbZHaGwW9KfbeusDHTzWXnEhr4FAmVAgzYACgkQDHTzWXnE
hr7ZEQ//UXne3tyGOsU3X8r+lstLFDMa90a3hvTg6hX+Q0MjHd/clwkKFkLpkipL
n7gIZlaHl11dRs0FzrIZA5EVAAgjMLKmIl10NBDFec6ZFA3VERcggx8y61uifI15
VviMR1VbLHYZaCdyrQOK0A4wcktWnKXyoXp7cwy9crdc2GOBMUZkdIqtvD7jHxQx
UMIFnzi1CyKUX/Fjt/JceYcNk9y2ZGkzakYO3sHcUdv4DPu9qX4kNzpjF691AZBP
UeKWvCswTRVg2M0kuo/RYIBzqaTmOlk6dHLWBognIeZPyuyhCcaGC2d64c6tShwQ
dtHdi+IgyQ8s2qb350ymKTQUP7xA/DfZBwH7LvrZALBxeQGYQN1CnsgDMOS2wcUc
XrRFiS7PxEOtMMBctcPBnnoV5ttnsLLlPpzM9puh9sUFMn6CgLzcAMqXdqxzMajH
+dz2aD1N0vMqq4varozOg9SC2QamgUiPN/TQfrulhCTCfQaXczy5x1OYiIz65+Sl
mKoe2WASuP9Ve8do4N/wEwH5SZY2ItipBdUTRxttY9NTanmV0X5DjZBXH5b9XGci
Zl5Ar613f9zwm5T5BVA5k6s3ZbGY6QcP5pDNTCPaSgitfFXIdReBZ2CaYzK3MPg/
Wit/TXrud9yT6VPpI1igboMyasf5QubV1MY1K83kOCWr9u8R2CM=
=l79u
-----END PGP SIGNATURE-----
Merge tag 'drm-next-2023-10-31-1' of git://anongit.freedesktop.org/drm/drm
Pull drm updates from Dave Airlie:
"Highlights:
- AMD adds some more upcoming HW platforms
- Intel made Meteorlake stable and started adding Lunarlake
- nouveau has a bunch of display rework in prepartion for the NVIDIA
GSP firmware support
- msm adds a7xx support
- habanalabs has finished migration to accel subsystem
Detail summary:
kernel:
- add initial vmemdup-user-array
core:
- fix platform remove() to return void
- drm_file owner updated to reflect owner
- move size calcs to drm buddy allocator
- let GPUVM build as a module
- allow variable number of run-queues in scheduler
edid:
- handle bad h/v sync_end in EDIDs
panfrost:
- add Boris as maintainer
fbdev:
- use fb_ops helpers more
- only allow logo use from fbcon
- rename fb_pgproto to pgprot_framebuffer
- add HPD state to drm_connector_oob_hotplug_event
- convert to fbdev i/o mem helpers
i915:
- Enable meteorlake by default
- Early Xe2 LPD/Lunarlake display enablement
- Rework subplatforms into IP version checks
- GuC based TLB invalidation for Meteorlake
- Display rework for future Xe driver integration
- LNL FBC features
- LNL display feature capability reads
- update recommended fw versions for DG2+
- drop fastboot module parameter
- added deviceid for Arrowlake-S
- drop preproduction workarounds
- don't disable preemption for resets
- cleanup inlines in headers
- PXP firmware loading fix
- Fix sg list lengths
- DSC PPS state readout/verification
- Add more RPL P/U PCI IDs
- Add new DG2-G12 stepping
- DP enhanced framing support to state checker
- Improve shared link bandwidth management
- stop using GEM macros in display code
- refactor related code into display code
- locally enable W=1 warnings
- remove PSR watchdog timers on LNL
amdgpu:
- RAS/FRU EEPROM updatse
- IP discovery updatses
- GC 11.5 support
- DCN 3.5 support
- VPE 6.1 support
- NBIO 7.11 support
- DML2 support
- lots of IP updates
- use flexible arrays for bo list handling
- W=1 fixes
- Enable seamless boot in more cases
- Enable context type property for HDMI
- Rework GPUVM TLB flushing
- VCN IB start/size alignment fixes
amdkfd:
- GC 10/11 fixes
- GC 11.5 support
- use partial migration in GPU faults
radeon:
- W=1 Fixes
- fix some possible buffer overflow/NULL derefs
nouveau:
- update uapi for NO_PREFETCH
- scheduler/fence fixes
- rework suspend/resume for GSP-RM
- rework display in preparation for GSP-RM
habanalabs:
- uapi: expose tsc clock
- uapi: block access to eventfd through control device
- uapi: force dma-buf export to PAGE_SIZE alignments
- complete move to accel subsystem
- move firmware interface include files
- perform hard reset on PCIe AXI drain event
- optimise user interrupt handling
msm:
- DP: use existing helpers for DPCD
- DPU: interrupts reworked
- gpu: a7xx (a730/a740) support
- decouple msm_drv from kms for headless devices
mediatek:
- MT8188 dsi/dp/edp support
- DDP GAMMA - 12 bit LUT support
- connector dynamic selection capability
rockchip:
- rv1126 mipi-dsi/vop support
- add planar formats
ast:
- rename constants
panels:
- Mitsubishi AA084XE01
- JDI LPM102A188A
- LTK050H3148W-CTA6
ivpu:
- power management fixes
qaic:
- add detach slice bo api
komeda:
- add NV12 writeback
tegra:
- support NVSYNC/NHSYNC
- host1x suspend fixes
ili9882t:
- separate into own driver"
* tag 'drm-next-2023-10-31-1' of git://anongit.freedesktop.org/drm/drm: (1803 commits)
drm/amdgpu: Remove unused variables from amdgpu_show_fdinfo
drm/amdgpu: Remove duplicate fdinfo fields
drm/amd/amdgpu: avoid to disable gfxhub interrupt when driver is unloaded
drm/amdgpu: Add EXT_COHERENT support for APU and NUMA systems
drm/amdgpu: Retrieve CE count from ce_count_lo_chip in EccInfo table
drm/amdgpu: Identify data parity error corrected in replay mode
drm/amdgpu: Fix typo in IP discovery parsing
drm/amd/display: fix S/G display enablement
drm/amdxcp: fix amdxcp unloads incompletely
drm/amd/amdgpu: fix the GPU power print error in pm info
drm/amdgpu: Use pcie domain of xcc acpi objects
drm/amd: check num of link levels when update pcie param
drm/amdgpu: Add a read to GFX v9.4.3 ring test
drm/amd/pm: call smu_cmn_get_smc_version in is_mode1_reset_supported.
drm/amdgpu: get RAS poison status from DF v4_6_2
drm/amdgpu: Use discovery table's subrevision
drm/amd/display: 3.2.256
drm/amd/display: add interface to query SubVP status
drm/amd/display: Read before writing Backlight Mode Set Register
drm/amd/display: Disable SYMCLK32_SE RCO on DCN314
...
Highlights:
- asus-wmi: Support for screenpad and solve brightness key
press duplication
- int3472: Eliminate the last use of deprecated GPIO functions
- mlxbf-pmc: New HW support
- msi-ec: Support new EC configurations
- thinkpad_acpi: Support reading aux MAC address during passthrough
- wmi: Fixes & improvements
- x86-android-tablets: Detection fix and avoid use of GPIO private APIs
- Debug & metrics interface improvements
- Miscellaneous cleanups / fixes / improvements
The following is an automated shortlog grouped by driver:
acer-wmi:
- Remove void function return
amd/hsmp:
- add support for metrics tbl
- create plat specific struct
- Fix iomem handling
- improve the error log
amd/pmc:
- Add dump_custom_stb module parameter
- Add PMFW command id to support S2D force flush
- Handle overflow cases where the num_samples range is higher
- Use flex array when calling amd_pmc_stb_debugfs_open_v2()
asus-wireless:
- Replace open coded acpi_match_acpi_device()
asus-wmi:
- add support for ASUS screenpad
- Do not report brightness up/down keys when also reported by acpi_video
gpiolib: acpi:
- Add a ignore interrupt quirk for Peaq C1010
- Check if a GPIO is listed in ignore_interrupt earlier
hp-bioscfg:
- Annotate struct bios_args with __counted_by
inspur-platform-profile:
- Add platform profile support
int3472:
- Add new skl_int3472_fill_gpiod_lookup() helper
- Add new skl_int3472_gpiod_get_from_temp_lookup() helper
- Stop using gpiod_toggle_active_low()
- Switch to devm_get_gpiod()
intel: bytcrc_pwrsrc:
- Convert to platform remove callback returning void
intel/ifs:
- Add new CPU support
- Add new error code
- ARRAY BIST for Sierra Forest
- Gen2 scan image loading
- Gen2 Scan test support
- Metadata validation for start_chunk
- Refactor image loading code
- Store IFS generation number
- Validate image size
intel_speed_select_if:
- Remove hardcoded map size
- Use devm_ioremap_resource
intel/tpmi:
- Add debugfs support for read/write blocked
- Add defines to get version information
intel-uncore-freq:
- Ignore minor version change
ISST:
- Allow level 0 to be not present
- Ignore minor version change
- Use fuse enabled mask instead of allowed levels
mellanox:
- Fix misspelling error in routine name
- Rename some init()/exit() functions for consistent naming
mlxbf-bootctl:
- Convert to platform remove callback returning void
mlxbf-pmc:
- Add support for BlueField-3
mlxbf-tmfifo:
- Convert to platform remove callback returning void
mlx-Convert to platform remove callback returning void:
- mlx-Convert to platform remove callback returning void
mlxreg-hotplug:
- Convert to platform remove callback returning void
mlxreg-io:
- Convert to platform remove callback returning void
mlxreg-lc:
- Convert to platform remove callback returning void
msi-ec:
- Add more EC configs
- rename fn_super_swap
nvsw-sn2201:
- Convert to platform remove callback returning void
sel3350-Convert to platform remove callback returning void:
- sel3350-Convert to platform remove callback returning void
siemens: simatic-ipc-batt-apollolake:
- Convert to platform remove callback returning void
siemens: simatic-ipc-batt:
- Convert to platform remove callback returning void
siemens: simatic-ipc-batt-elkhartlake:
- Convert to platform remove callback returning void
siemens: simatic-ipc-batt-f7188x:
- Convert to platform remove callback returning void
siemens: simatic-ipc-batt:
- Simplify simatic_ipc_batt_remove()
surface: acpi-notify:
- Convert to platform remove callback returning void
surface: aggregator:
- Annotate struct ssam_event with __counted_by
surface: aggregator-cdev:
- Convert to platform remove callback returning void
surface: aggregator-registry:
- Convert to platform remove callback returning void
surface: dtx:
- Convert to platform remove callback returning void
surface: gpe:
- Convert to platform remove callback returning void
surface: hotplug:
- Convert to platform remove callback returning void
surface: surface3-wmi:
- Convert to platform remove callback returning void
think-lmi:
- Add bulk save feature
- Replace kstrdup() + strreplace() with kstrdup_and_replace()
- Use strreplace() to replace a character by nul
thinkpad_acpi:
- Add battery quirk for Thinkpad X120e
- replace deprecated strncpy with memcpy
- sysfs interface to auxmac
tools/power/x86/intel-speed-select:
- Display error for core-power support
- Increase max CPUs in one request
- No TRL for non compute domains
- Sanitize integer arguments
- turbo-mode enable disable swapped
- Update help for TRL
- Use cgroup isolate for CPU 0
- v1.18 release
wmi:
- Decouple probe deferring from wmi_block_list
- Decouple WMI device removal from wmi_block_list
- Fix opening of char device
- Fix probe failure when failing to register WMI devices
- Fix refcounting of WMI devices in legacy functions
x86-android-tablets:
- Add a comment about x86_android_tablet_get_gpiod()
- Create a platform_device from module_init()
- Drop "linux,power-supply-name" from lenovo_yt3_bq25892_0_props[]
- Fix Lenovo Yoga Tablet 2 830F/L vs 1050F/L detection
- Remove invalid_aei_gpiochip from Peaq C1010
- Remove invalid_aei_gpiochip support
- Stop using gpiolib private APIs
- Use platform-device as gpio-keys parent
xo15-ebook:
- Replace open coded acpi_match_acpi_device()
Merges:
- Merge branch 'pdx86/platform-drivers-x86-int3472' into review-ilpo
- Merge branch 'pdx86/platform-drivers-x86-mellanox-init' into review-ilpo
- Merge remote-tracking branch 'intel-speed-select/intel-sst' into review-ilpo
- Merge remote-tracking branch 'pdx86/platform-drivers-x86-android-tablets' into review-hans
-----BEGIN PGP SIGNATURE-----
iHUEABYIAB0WIQSCSUwRdwTNL2MhaBlZrE9hU+XOMQUCZT+lBwAKCRBZrE9hU+XO
Mck0AQCFU7dYLCF4d1CXtHf1eZhSXLpYdhcO+C08JGGoM+MqSgD+Jyb9KJHk4pxE
FvKG51I9neyAne9lvNrLodHRzxCYgAo=
=duM8
-----END PGP SIGNATURE-----
Merge tag 'platform-drivers-x86-v6.7-1' of git://git.kernel.org/pub/scm/linux/kernel/git/pdx86/platform-drivers-x86
Pull x86 platform driver updates from Ilpo Järvinen:
- asus-wmi: Support for screenpad and solve brightness key press
duplication
- int3472: Eliminate the last use of deprecated GPIO functions
- mlxbf-pmc: New HW support
- msi-ec: Support new EC configurations
- thinkpad_acpi: Support reading aux MAC address during passthrough
- wmi: Fixes & improvements
- x86-android-tablets: Detection fix and avoid use of GPIO private APIs
- Debug & metrics interface improvements
- Miscellaneous cleanups / fixes / improvements
* tag 'platform-drivers-x86-v6.7-1' of git://git.kernel.org/pub/scm/linux/kernel/git/pdx86/platform-drivers-x86: (80 commits)
platform/x86: inspur-platform-profile: Add platform profile support
platform/x86: thinkpad_acpi: Add battery quirk for Thinkpad X120e
platform/x86: wmi: Decouple WMI device removal from wmi_block_list
platform/x86: wmi: Fix opening of char device
platform/x86: wmi: Fix probe failure when failing to register WMI devices
platform/x86: wmi: Fix refcounting of WMI devices in legacy functions
platform/x86: wmi: Decouple probe deferring from wmi_block_list
platform/x86/amd/hsmp: Fix iomem handling
platform/x86: asus-wmi: Do not report brightness up/down keys when also reported by acpi_video
platform/x86: thinkpad_acpi: replace deprecated strncpy with memcpy
tools/power/x86/intel-speed-select: v1.18 release
tools/power/x86/intel-speed-select: Use cgroup isolate for CPU 0
tools/power/x86/intel-speed-select: Increase max CPUs in one request
tools/power/x86/intel-speed-select: Display error for core-power support
tools/power/x86/intel-speed-select: No TRL for non compute domains
tools/power/x86/intel-speed-select: turbo-mode enable disable swapped
tools/power/x86/intel-speed-select: Update help for TRL
tools/power/x86/intel-speed-select: Sanitize integer arguments
platform/x86: acer-wmi: Remove void function return
platform/x86/amd/pmc: Add dump_custom_stb module parameter
...
Core & protocols
----------------
- Support usec resolution of TCP timestamps, enabled selectively by
a route attribute.
- Defer regular TCP ACK while processing socket backlog, try to send
a cumulative ACK at the end. Increase single TCP flow performance
on a 200Gbit NIC by 20% (100Gbit -> 120Gbit).
- The Fair Queuing (FQ) packet scheduler:
- add built-in 3 band prio / WRR scheduling
- support bypass if the qdisc is mostly idle (5% speed up for TCP RR)
- improve inactive flow reporting
- optimize the layout of structures for better cache locality
- Support TCP Authentication Option (RFC 5925, TCP-AO), a more modern
replacement for the old MD5 option.
- Add more retransmission timeout (RTO) related statistics to TCP_INFO.
- Support sending fragmented skbs over vsock sockets.
- Make sure we send SIGPIPE for vsock sockets if socket was shutdown().
- Add sysctl for ignoring lower limit on lifetime in Router
Advertisement PIO, based on an in-progress IETF draft.
- Add sysctl to control activation of TCP ping-pong mode.
- Add sysctl to make connection timeout in MPTCP configurable.
- Support rcvlowat and notsent_lowat on MPTCP sockets, to help apps
limit the number of wakeups.
- Support netlink GET for MDB (multicast forwarding), allowing user
space to request a single MDB entry instead of dumping the entire
table.
- Support selective FDB flushing in the VXLAN tunnel driver.
- Allow limiting learned FDB entries in bridges, prevent OOM attacks.
- Allow controlling via configfs netconsole targets which were created
via the kernel cmdline at boot, rather than via configfs at runtime.
- Support multiple PTP timestamp event queue readers with different
filters.
- MCTP over I3C.
BPF
---
- Add new veth-like netdevice where BPF program defines the logic
of the xmit routine. It can operate in L3 and L2 mode.
- Support exceptions - allow asserting conditions which should
never be true but are hard for the verifier to infer.
With some extra flexibility around handling of the exit / failure.
https://lwn.net/Articles/938435/
- Add support for local per-cpu kptr, allow allocating and storing
per-cpu objects in maps. Access to those objects operates on
the value for the current CPU. This allows to deprecate local
one-off implementations of per-CPU storage like
BPF_MAP_TYPE_PERCPU_CGROUP_STORAGE maps.
- Extend cgroup BPF sockaddr hooks for UNIX sockets. The use case is
for systemd to re-implement the LogNamespace feature which allows
running multiple instances of systemd-journald to process the logs
of different services.
- Enable open-coded task_vma iteration, after maple tree conversion
made it hard to directly walk VMAs in tracing programs.
- Add open-coded task, css_task and css iterator support.
One of the use cases is customizable OOM victim selection via BPF.
- Allow source address selection with bpf_*_fib_lookup().
- Add ability to pin BPF timer to the current CPU.
- Prevent creation of infinite loops by combining tail calls and
fentry/fexit programs.
- Add missed stats for kprobes to retrieve the number of missed kprobe
executions and subsequent executions of BPF programs.
- Inherit system settings for CPU security mitigations.
- Add BPF v4 CPU instruction support for arm32 and s390x.
Changes to common code
----------------------
- overflow: add DEFINE_FLEX() for on-stack definition of structs
with flexible array members.
- Process doc update with more guidance for reviewers.
Driver API
----------
- Simplify locking in WiFi (cfg80211 and mac80211 layers), use wiphy
mutex in most places and remove a lot of smaller locks.
- Create a common DPLL configuration API. Allow configuring
and querying state of PLL circuits used for clock syntonization,
in network time distribution.
- Unify fragmented and full page allocation APIs in page pool code.
Let drivers be ignorant of PAGE_SIZE.
- Rework PHY state machine to avoid races with calls to phy_stop().
- Notify DSA drivers of MAC address changes on user ports, improve
correctness of offloads which depend on matching port MAC addresses.
- Allow antenna control on injected WiFi frames.
- Reduce the number of variants of napi_schedule().
- Simplify error handling when composing devlink health messages.
Misc
----
- A lot of KCSAN data race "fixes", from Eric.
- A lot of __counted_by() annotations, from Kees.
- A lot of strncpy -> strscpy and printf format fixes.
- Replace master/slave terminology with conduit/user in DSA drivers.
- Handful of KUnit tests for netdev and WiFi core.
Removed
-------
- AppleTalk COPS.
- AppleTalk ipddp.
- TI AR7 CPMAC Ethernet driver.
Drivers
-------
- Ethernet high-speed NICs:
- Intel (100G, ice, idpf):
- add a driver for the Intel E2000 IPUs
- make CRC/FCS stripping configurable
- cross-timestamping for E823 devices
- basic support for E830 devices
- use aux-bus for managing client drivers
- i40e: report firmware versions via devlink
- nVidia/Mellanox:
- support 4-port NICs
- increase max number of channels to 256
- optimize / parallelize SF creation flow
- Broadcom (bnxt):
- enhance NIC temperature reporting
- support PAM4 speeds and lane configuration
- Marvell OcteonTX2:
- PTP pulse-per-second output support
- enable hardware timestamping for VFs
- Solarflare/AMD:
- conntrack NAT offload and offload for tunnels
- Wangxun (ngbe/txgbe):
- expose HW statistics
- Pensando/AMD:
- support PCI level reset
- narrow down the condition under which skbs are linearized
- Netronome/Corigine (nfp):
- support CHACHA20-POLY1305 crypto in IPsec offload
- Ethernet NICs embedded, slower, virtual:
- Synopsys (stmmac):
- add Loongson-1 SoC support
- enable use of HW queues with no offload capabilities
- enable PPS input support on all 5 channels
- increase TX coalesce timer to 5ms
- RealTek USB (r8152): improve efficiency of Rx by using GRO frags
- xen: support SW packet timestamping
- add drivers for implementations based on TI's PRUSS (AM64x EVM)
- nVidia/Mellanox Ethernet datacenter switches:
- avoid poor HW resource use on Spectrum-4 by better block selection
for IPv6 multicast forwarding and ordering of blocks in ACL region
- Ethernet embedded switches:
- Microchip:
- support configuring the drive strength for EMI compliance
- ksz9477: partial ACL support
- ksz9477: HSR offload
- ksz9477: Wake on LAN
- Realtek:
- rtl8366rb: respect device tree config of the CPU port
- Ethernet PHYs:
- support Broadcom BCM5221 PHYs
- TI dp83867: support hardware LED blinking
- CAN:
- add support for Linux-PHY based CAN transceivers
- at91_can: clean up and use rx-offload helpers
- WiFi:
- MediaTek (mt76):
- new sub-driver for mt7925 USB/PCIe devices
- HW wireless <> Ethernet bridging in MT7988 chips
- mt7603/mt7628 stability improvements
- Qualcomm (ath12k):
- WCN7850:
- enable 320 MHz channels in 6 GHz band
- hardware rfkill support
- enable IEEE80211_HW_SINGLE_SCAN_ON_ALL_BANDS
to make scan faster
- read board data variant name from SMBIOS
- QCN9274: mesh support
- RealTek (rtw89):
- TDMA-based multi-channel concurrency (MCC)
- Silicon Labs (wfx):
- Remain-On-Channel (ROC) support
- Bluetooth:
- ISO: many improvements for broadcast support
- mark BCM4378/BCM4387 as BROKEN_LE_CODED
- add support for QCA2066
- btmtksdio: enable Bluetooth wakeup from suspend
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEE6jPA+I1ugmIBA4hXMUZtbf5SIrsFAmU8XsYACgkQMUZtbf5S
Irv19RAAnud/24OOF5XMEJkIcYlnfqximh4XO6PujRSYkSkOUJdZTF6iJPgf3pSP
YpwoHYbYKHYfeOf8+3bTNESiQNSnoVmvmvwiS6/7lZ3behHUrGLQzW9Htc3EZyWH
2h6QkDZ5OOjfg0bwYSfp3vXkmMH2k8WE9Y0NvCkhcohqZi13Rmp14RnyPmNb2d1V
yZRYDMSM133KqE6gnBr1Ct65IEvnKeGlCUN2mTGqOJgdn6DZMsyxvtt0y4rmN7Ab
41+CgPU5SfxfbYpW+Dl2HJpgfte3WrC57KC6AM0PAPJzPmQWgeB/m9mjz/apj6Bg
bhsEIo7FdvbCnQm3yWPhK2OgCAcSwLr8jfGMU+Q+W4VnL5SRRR3Rm0zjsze+kHNP
OfqJgxzl3DpvoJqVBy1h5FGcZt0XHwhksm4cTxWqIahsF+veY0ECBXbuBBQx9XTF
Y7INfI8ulg7wISJs+CJfIClYkgOibTw2u8taBS5ikbtgxNqp5D4QqODn7UefQap1
PR/IDYODF+zRgmMJLeBqSa6fij6BkfOEDiOWak5kggBoZdtbtmeKI6tzze06CNdW
lWv1WEhRufxnwK+IuWsEkjhiMbs2WGLvkJ5JbgQV9BfqHfIfiqBCrcWtT/WbQnGt
lmU46CXh1t/FZEqbmK9h+8vsIIfrcDl6jb5npEiKPRG00vDKRTM=
=46nS
-----END PGP SIGNATURE-----
Merge tag 'net-next-6.7' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next
Pull networking updates from Jakub Kicinski:
"Core & protocols:
- Support usec resolution of TCP timestamps, enabled selectively by a
route attribute.
- Defer regular TCP ACK while processing socket backlog, try to send
a cumulative ACK at the end. Increase single TCP flow performance
on a 200Gbit NIC by 20% (100Gbit -> 120Gbit).
- The Fair Queuing (FQ) packet scheduler:
- add built-in 3 band prio / WRR scheduling
- support bypass if the qdisc is mostly idle (5% speed up for TCP RR)
- improve inactive flow reporting
- optimize the layout of structures for better cache locality
- Support TCP Authentication Option (RFC 5925, TCP-AO), a more modern
replacement for the old MD5 option.
- Add more retransmission timeout (RTO) related statistics to
TCP_INFO.
- Support sending fragmented skbs over vsock sockets.
- Make sure we send SIGPIPE for vsock sockets if socket was
shutdown().
- Add sysctl for ignoring lower limit on lifetime in Router
Advertisement PIO, based on an in-progress IETF draft.
- Add sysctl to control activation of TCP ping-pong mode.
- Add sysctl to make connection timeout in MPTCP configurable.
- Support rcvlowat and notsent_lowat on MPTCP sockets, to help apps
limit the number of wakeups.
- Support netlink GET for MDB (multicast forwarding), allowing user
space to request a single MDB entry instead of dumping the entire
table.
- Support selective FDB flushing in the VXLAN tunnel driver.
- Allow limiting learned FDB entries in bridges, prevent OOM attacks.
- Allow controlling via configfs netconsole targets which were
created via the kernel cmdline at boot, rather than via configfs at
runtime.
- Support multiple PTP timestamp event queue readers with different
filters.
- MCTP over I3C.
BPF:
- Add new veth-like netdevice where BPF program defines the logic of
the xmit routine. It can operate in L3 and L2 mode.
- Support exceptions - allow asserting conditions which should never
be true but are hard for the verifier to infer. With some extra
flexibility around handling of the exit / failure:
https://lwn.net/Articles/938435/
- Add support for local per-cpu kptr, allow allocating and storing
per-cpu objects in maps. Access to those objects operates on the
value for the current CPU.
This allows to deprecate local one-off implementations of per-CPU
storage like BPF_MAP_TYPE_PERCPU_CGROUP_STORAGE maps.
- Extend cgroup BPF sockaddr hooks for UNIX sockets. The use case is
for systemd to re-implement the LogNamespace feature which allows
running multiple instances of systemd-journald to process the logs
of different services.
- Enable open-coded task_vma iteration, after maple tree conversion
made it hard to directly walk VMAs in tracing programs.
- Add open-coded task, css_task and css iterator support. One of the
use cases is customizable OOM victim selection via BPF.
- Allow source address selection with bpf_*_fib_lookup().
- Add ability to pin BPF timer to the current CPU.
- Prevent creation of infinite loops by combining tail calls and
fentry/fexit programs.
- Add missed stats for kprobes to retrieve the number of missed
kprobe executions and subsequent executions of BPF programs.
- Inherit system settings for CPU security mitigations.
- Add BPF v4 CPU instruction support for arm32 and s390x.
Changes to common code:
- overflow: add DEFINE_FLEX() for on-stack definition of structs with
flexible array members.
- Process doc update with more guidance for reviewers.
Driver API:
- Simplify locking in WiFi (cfg80211 and mac80211 layers), use wiphy
mutex in most places and remove a lot of smaller locks.
- Create a common DPLL configuration API. Allow configuring and
querying state of PLL circuits used for clock syntonization, in
network time distribution.
- Unify fragmented and full page allocation APIs in page pool code.
Let drivers be ignorant of PAGE_SIZE.
- Rework PHY state machine to avoid races with calls to phy_stop().
- Notify DSA drivers of MAC address changes on user ports, improve
correctness of offloads which depend on matching port MAC
addresses.
- Allow antenna control on injected WiFi frames.
- Reduce the number of variants of napi_schedule().
- Simplify error handling when composing devlink health messages.
Misc:
- A lot of KCSAN data race "fixes", from Eric.
- A lot of __counted_by() annotations, from Kees.
- A lot of strncpy -> strscpy and printf format fixes.
- Replace master/slave terminology with conduit/user in DSA drivers.
- Handful of KUnit tests for netdev and WiFi core.
Removed:
- AppleTalk COPS.
- AppleTalk ipddp.
- TI AR7 CPMAC Ethernet driver.
Drivers:
- Ethernet high-speed NICs:
- Intel (100G, ice, idpf):
- add a driver for the Intel E2000 IPUs
- make CRC/FCS stripping configurable
- cross-timestamping for E823 devices
- basic support for E830 devices
- use aux-bus for managing client drivers
- i40e: report firmware versions via devlink
- nVidia/Mellanox:
- support 4-port NICs
- increase max number of channels to 256
- optimize / parallelize SF creation flow
- Broadcom (bnxt):
- enhance NIC temperature reporting
- support PAM4 speeds and lane configuration
- Marvell OcteonTX2:
- PTP pulse-per-second output support
- enable hardware timestamping for VFs
- Solarflare/AMD:
- conntrack NAT offload and offload for tunnels
- Wangxun (ngbe/txgbe):
- expose HW statistics
- Pensando/AMD:
- support PCI level reset
- narrow down the condition under which skbs are linearized
- Netronome/Corigine (nfp):
- support CHACHA20-POLY1305 crypto in IPsec offload
- Ethernet NICs embedded, slower, virtual:
- Synopsys (stmmac):
- add Loongson-1 SoC support
- enable use of HW queues with no offload capabilities
- enable PPS input support on all 5 channels
- increase TX coalesce timer to 5ms
- RealTek USB (r8152): improve efficiency of Rx by using GRO frags
- xen: support SW packet timestamping
- add drivers for implementations based on TI's PRUSS (AM64x EVM)
- nVidia/Mellanox Ethernet datacenter switches:
- avoid poor HW resource use on Spectrum-4 by better block
selection for IPv6 multicast forwarding and ordering of blocks
in ACL region
- Ethernet embedded switches:
- Microchip:
- support configuring the drive strength for EMI compliance
- ksz9477: partial ACL support
- ksz9477: HSR offload
- ksz9477: Wake on LAN
- Realtek:
- rtl8366rb: respect device tree config of the CPU port
- Ethernet PHYs:
- support Broadcom BCM5221 PHYs
- TI dp83867: support hardware LED blinking
- CAN:
- add support for Linux-PHY based CAN transceivers
- at91_can: clean up and use rx-offload helpers
- WiFi:
- MediaTek (mt76):
- new sub-driver for mt7925 USB/PCIe devices
- HW wireless <> Ethernet bridging in MT7988 chips
- mt7603/mt7628 stability improvements
- Qualcomm (ath12k):
- WCN7850:
- enable 320 MHz channels in 6 GHz band
- hardware rfkill support
- enable IEEE80211_HW_SINGLE_SCAN_ON_ALL_BANDS to
make scan faster
- read board data variant name from SMBIOS
- QCN9274: mesh support
- RealTek (rtw89):
- TDMA-based multi-channel concurrency (MCC)
- Silicon Labs (wfx):
- Remain-On-Channel (ROC) support
- Bluetooth:
- ISO: many improvements for broadcast support
- mark BCM4378/BCM4387 as BROKEN_LE_CODED
- add support for QCA2066
- btmtksdio: enable Bluetooth wakeup from suspend"
* tag 'net-next-6.7' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (1816 commits)
net: pcs: xpcs: Add 2500BASE-X case in get state for XPCS drivers
net: bpf: Use sockopt_lock_sock() in ip_sock_set_tos()
net: mana: Use xdp_set_features_flag instead of direct assignment
vxlan: Cleanup IFLA_VXLAN_PORT_RANGE entry in vxlan_get_size()
iavf: delete the iavf client interface
iavf: add a common function for undoing the interrupt scheme
iavf: use unregister_netdev
iavf: rely on netdev's own registered state
iavf: fix the waiting time for initial reset
iavf: in iavf_down, don't queue watchdog_task if comms failed
iavf: simplify mutex_trylock+sleep loops
iavf: fix comments about old bit locks
doc/netlink: Update schema to support cmd-cnt-name and cmd-max-name
tools: ynl: introduce option to process unknown attributes or types
ipvlan: properly track tx_errors
netdevsim: Block until all devices are released
nfp: using napi_build_skb() to replace build_skb()
net: dsa: microchip: ksz9477: Fix spelling mistake "Enery" -> "Energy"
net: dsa: microchip: Ensure Stable PME Pin State for Wake-on-LAN
net: dsa: microchip: Refactor switch shutdown routine for WoL preparation
...
- Switch the smp_call_function*() @csd argument to call_single_data_t
type, which is a cache-line aligned typedef of the underlying struct
__call_single_data.
This ensures that the call data is not crossing a cacheline which
avoids bouncing an extra cache-line for the SMP function call
- Prevent offlining of the last housekeeping CPU when CPU isolation is
active.
Offlining the last housekeeping CPU makes no sense in general, but also
caused the scheduler to panic due to the empty CPU mask when rebuilding
the scheduler domains.
- Remove an unused CPU hotplug state
-----BEGIN PGP SIGNATURE-----
iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAmU+vdYTHHRnbHhAbGlu
dXRyb25peC5kZQAKCRCmGPVMDXSYocb3EAChhdVZCBm3UoDcrWtGuS7mMkTuLuUK
rheS9OtUt/uDEI0YZT5bD2R7KqdDVVNqbG1RLyICIWpQMvuMojZyu6fMCUjkONzS
iioun/2lZP4Q9TyAn0rdr9/GrPxb1/glYnLuz8ZJcY+jC63skNIDVfzJdhJ81/sx
t4BSswVsi75qZtBWWYFIzMQaJ1nUGJ5SZPYJV/WuQLf+pronoeWu+2VZHnaDqr3h
1N3oTQRbg0syPBg6trRuLEnn3384LYtdq7CHjeREX2jn2sU8yr+xzjKup5ShtSCR
7Amka/IlCTe2+FNS0F+6e3RGCH9Man1W593DqjUeIQT/Z0O2u4l9vNuVTv5GjX6C
fqDVd4hwVRd7/OGmaSPPY+pn9QK6B1WYU3BaAxACcGE6GaY69PU2jREnuCpt/pu9
Pg4xYDqClVwzvq3YYoU7YISya2TXjyJticxg3FtPUzrpVu0LIIq3IAcO7Nej+AzS
uSwhIRkqyT20CO/fRXhn5KQ2h66G6QNLPnEMtK/35K24Am7MGqwJd7wnGxKPu3RO
zAcRkQofouS0UcVbNY4UbV4vD6lpEAvy1RdxNPWt5DOIk5f83E176Yyc+vB8jAjG
YEM8ZnS3gFd7jvNC37rk9FfjlAIL9Z9QcrhtHJJ/h5y9sgCqzsV96B8c2KR3Ggs0
BQbaSJhdB89BwQ==
=tP7K
-----END PGP SIGNATURE-----
Merge tag 'smp-core-2023-10-29-v2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull SMP and CPU hotplug updates from Thomas Gleixner:
- Switch the smp_call_function*() @csd argument to call_single_data_t
type, which is a cache-line aligned typedef of the underlying struct
__call_single_data.
This ensures that the call data is not crossing a cacheline which
avoids bouncing an extra cache-line for the SMP function call
- Prevent offlining of the last housekeeping CPU when CPU isolation is
active.
Offlining the last housekeeping CPU makes no sense in general, but
also caused the scheduler to panic due to the empty CPU mask when
rebuilding the scheduler domains.
- Remove an unused CPU hotplug state
* tag 'smp-core-2023-10-29-v2' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
cpu/hotplug: Don't offline the last non-isolated CPU
cpu/hotplug: Remove unused cpuhp_state CPUHP_AP_X86_VDSO_VMA_ONLINE
smp: Change function signatures to use call_single_data_t
- Fair scheduler (SCHED_OTHER) improvements:
- Remove the old and now unused SIS_PROP code & option
- Scan cluster before LLC in the wake-up path
- Use candidate prev/recent_used CPU if scanning failed for cluster wakeup
- NUMA scheduling improvements:
- Improve the VMA access-PID code to better skip/scan VMAs
- Extend tracing to cover VMA-skipping decisions
- Improve/fix the recently introduced sched_numa_find_nth_cpu() code
- Generalize numa_map_to_online_node()
- Energy scheduling improvements:
- Remove the EM_MAX_COMPLEXITY limit
- Add tracepoints to track energy computation
- Make the behavior of the 'sched_energy_aware' sysctl more consistent
- Consolidate and clean up access to a CPU's max compute capacity
- Fix uclamp code corner cases
- RT scheduling improvements:
- Drive dl_rq->overloaded with dl_rq->pushable_dl_tasks updates
- Drive the ->rto_mask with rt_rq->pushable_tasks updates
- Scheduler scalability improvements:
- Rate-limit updates to tg->load_avg
- On x86 disable IBRS when CPU is offline to improve single-threaded performance
- Micro-optimize in_task() and in_interrupt()
- Micro-optimize the PSI code
- Avoid updating PSI triggers and ->rtpoll_total when there are no state changes
- Core scheduler infrastructure improvements:
- Use saved_state to reduce some spurious freezer wakeups
- Bring in a handful of fast-headers improvements to scheduler headers
- Make the scheduler UAPI headers more widely usable by user-space
- Simplify the control flow of scheduler syscalls by using lock guards
- Fix sched_setaffinity() vs. CPU hotplug race
- Scheduler debuggability improvements:
- Disallow writing invalid values to sched_rt_period_us
- Fix a race in the rq-clock debugging code triggering warnings
- Fix a warning in the bandwidth distribution code
- Micro-optimize in_atomic_preempt_off() checks
- Enforce that the tasklist_lock is held in for_each_thread()
- Print the TGID in sched_show_task()
- Remove the /proc/sys/kernel/sched_child_runs_first sysctl
- Misc cleanups & fixes
Signed-off-by: Ingo Molnar <mingo@kernel.org>
-----BEGIN PGP SIGNATURE-----
iQJFBAABCgAvFiEEBpT5eoXrXCwVQwEKEnMQ0APhK1gFAmU8/NoRHG1pbmdvQGtl
cm5lbC5vcmcACgkQEnMQ0APhK1gN+xAAvKGYNZBCBG4jowxccgqAbCx81KOhhsy/
KUaOmdLPg9WaXuqjZ5sggXQCMT0wUqBYAmqV7ts53VhWcma2I1ap4dCM6Jj+RLrc
vNwkeNetsikiZtarMoCJs5NahL8ULh3liBaoAkkToPjQ5r43aZ/eKwDovEdIKc+g
+Vgn7jUY8ssIrAOKT1midSwY1y8kAU2AzWOSFDTgedkJP4PgOu9/lBl9jSJ2sYaX
N4XqONYPXTwOHUtvmzkYILxLz0k0GgJ7hmt78E8Xy2rC4taGCRwCfCMBYxREuwiP
huo3O1P/iIe5svm4/EBUvcpvf44eAWTV+CD0dnJPwOc9IvFhpSzqSZZAsyy/JQKt
Lnzmc/xmyc1PnXCYJfHuXrw2/m+MyUHaegPzh5iLJFrlqa79GavOElj0jNTAMzbZ
39fybzPtuFP+64faRfu0BBlQZfORPBNc/oWMpPKqgP58YGuveKTWaUF5rl5lM7Ne
nm07uOmq02JVR8YzPl/FcfhU2dPMawWuMwUjEr2eU+lAunY3PF88vu0FALj7iOBd
66F8qrtpDHJanOxrdEUwSJ7hgw79qY1iw66Db7cQYjMazFKZONxArQPqFUZ0ngLI
n9hVa7brg1bAQKrQflqjcIAIbpVu3SjPEl15cKpAJTB/gn5H66TQgw8uQ6HfG+h2
GtOsn1nlvuk=
=GDqb
-----END PGP SIGNATURE-----
Merge tag 'sched-core-2023-10-28' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler updates from Ingo Molnar:
"Fair scheduler (SCHED_OTHER) improvements:
- Remove the old and now unused SIS_PROP code & option
- Scan cluster before LLC in the wake-up path
- Use candidate prev/recent_used CPU if scanning failed for cluster
wakeup
NUMA scheduling improvements:
- Improve the VMA access-PID code to better skip/scan VMAs
- Extend tracing to cover VMA-skipping decisions
- Improve/fix the recently introduced sched_numa_find_nth_cpu() code
- Generalize numa_map_to_online_node()
Energy scheduling improvements:
- Remove the EM_MAX_COMPLEXITY limit
- Add tracepoints to track energy computation
- Make the behavior of the 'sched_energy_aware' sysctl more
consistent
- Consolidate and clean up access to a CPU's max compute capacity
- Fix uclamp code corner cases
RT scheduling improvements:
- Drive dl_rq->overloaded with dl_rq->pushable_dl_tasks updates
- Drive the ->rto_mask with rt_rq->pushable_tasks updates
Scheduler scalability improvements:
- Rate-limit updates to tg->load_avg
- On x86 disable IBRS when CPU is offline to improve single-threaded
performance
- Micro-optimize in_task() and in_interrupt()
- Micro-optimize the PSI code
- Avoid updating PSI triggers and ->rtpoll_total when there are no
state changes
Core scheduler infrastructure improvements:
- Use saved_state to reduce some spurious freezer wakeups
- Bring in a handful of fast-headers improvements to scheduler
headers
- Make the scheduler UAPI headers more widely usable by user-space
- Simplify the control flow of scheduler syscalls by using lock
guards
- Fix sched_setaffinity() vs. CPU hotplug race
Scheduler debuggability improvements:
- Disallow writing invalid values to sched_rt_period_us
- Fix a race in the rq-clock debugging code triggering warnings
- Fix a warning in the bandwidth distribution code
- Micro-optimize in_atomic_preempt_off() checks
- Enforce that the tasklist_lock is held in for_each_thread()
- Print the TGID in sched_show_task()
- Remove the /proc/sys/kernel/sched_child_runs_first sysctl
... and misc cleanups & fixes"
* tag 'sched-core-2023-10-28' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (82 commits)
sched/fair: Remove SIS_PROP
sched/fair: Use candidate prev/recent_used CPU if scanning failed for cluster wakeup
sched/fair: Scan cluster before scanning LLC in wake-up path
sched: Add cpus_share_resources API
sched/core: Fix RQCF_ACT_SKIP leak
sched/fair: Remove unused 'curr' argument from pick_next_entity()
sched/nohz: Update comments about NEWILB_KICK
sched/fair: Remove duplicate #include
sched/psi: Update poll => rtpoll in relevant comments
sched: Make PELT acronym definition searchable
sched: Fix stop_one_cpu_nowait() vs hotplug
sched/psi: Bail out early from irq time accounting
sched/topology: Rename 'DIE' domain to 'PKG'
sched/psi: Delete the 'update_total' function parameter from update_triggers()
sched/psi: Avoid updating PSI triggers and ->rtpoll_total when there are no state changes
sched/headers: Remove comment referring to rq::cpu_load, since this has been removed
sched/numa: Complete scanning of inactive VMAs when there is no alternative
sched/numa: Complete scanning of partial VMAs regardless of PID activity
sched/numa: Move up the access pid reset logic
sched/numa: Trace decisions related to skipping VMAs
...
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEE8rQSAMVO+zA4DBdWxWXV+ddtWDsFAmU/xAEACgkQxWXV+ddt
WDvYKg//SjTimA5Nins9mb4jdz8n+dDeZnQhKzy3FqInU41EzDRc4WwnEODmDlTa
AyU9rGB3k0JNSUc075jZFCyLqq/ARiOqRi4x33Gk0ckIlc4X5OgBoqP2XkPh0VlP
txskLCrmhc3pwyR4ErlFDX2jebIUXfkv39bJuE40grGvUatRe+WNq0ERIrgO8RAr
Rc3hBotMH8AIqfD1L6j1ZiZIAyrOkT1BJMuqeoq27/gJZn/MRhM9TCrMTzfWGaoW
SxPrQiCDEN3KECsOY/caroMn3AekDijg/ley1Nf7Z0N6oEV+n4VWWPBFE9HhRz83
9fIdvSbGjSJF6ekzTjcVXPAbcuKZFzeqOdBRMIW3TIUo7mZQyJTVkMsc1y/NL2Z3
9DhlRLIzvWJJjt1CEK0u18n5IU+dGngdktbhWWIuIlo8r+G/iKR/7zqU92VfWLHL
Z7/eh6HgH5zr2bm+yKORbrUjkv4IVhGVarW8D4aM+MCG0lFN2GaPcJCCUrp4n7rZ
PzpQbxXa38ANBk6hsp4ndS8TJSBL9moY8tumzLcKg97nzNMV6KpBdV/G6/QfRLCN
3kM6UbwTAkMwGcQS86Mqx6s04ORLnQeD6f7N6X4Ppx0Mi/zkjI2HkRuvQGp12B0v
iZjCCZAYY2Iu+/TU0GrCXSss/grzIAUPzM9msyV3XGO/VBpwdec=
=9TVx
-----END PGP SIGNATURE-----
Merge tag 'for-6.7-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs updates from David Sterba:
"New features:
- raid-stripe-tree
New tree for logical file extent mapping where the physical mapping
may not match on multiple devices. This is now used in zoned mode
to implement RAID0/RAID1* profiles, but can be used in non-zoned
mode as well. The support for RAID56 is in development and will
eventually fix the problems with the current implementation. This
is a backward incompatible feature and has to be enabled at mkfs
time.
- simple quota accounting (squota)
A simplified mode of qgroup that accounts all space on the initial
extent owners (a subvolume), the snapshots are then cheap to create
and delete. The deletion of snapshots in fully accounting qgroups
is a known CPU/IO performance bottleneck.
The squota is not suitable for the general use case but works well
for containers where the original subvolume exists for the whole
time. This is a backward incompatible feature as it needs extending
some structures, but can be enabled on an existing filesystem.
- temporary filesystem fsid (temp_fsid)
The fsid identifies a filesystem and is hard coded in the
structures, which disallows mounting the same fsid found on
different devices.
For a single device filesystem this is not strictly necessary, a
new temporary fsid can be generated on mount e.g. after a device is
cloned. This will be used by Steam Deck for root partition A/B
testing, or can be used for VM root images.
Other user visible changes:
- filesystems with partially finished metadata_uuid conversion cannot
be mounted anymore and the uuid fixup has to be done by btrfs-progs
(btrfstune).
Performance improvements:
- reduce reservations for checksum deletions (with enabled free space
tree by factor of 4), on a sample workload on file with many
extents the deletion time decreased by 12%
- make extent state merges more efficient during insertions, reduce
rb-tree iterations (run time of critical functions reduced by 5%)
Core changes:
- the integrity check functionality has been removed, this was a
debugging feature and removal does not affect other integrity
checks like checksums or tree-checker
- space reservation changes:
- more efficient delayed ref reservations, this avoids building up
too much work or overusing or exhausting the global block
reserve in some situations
- move delayed refs reservation to the transaction start time,
this prevents some ENOSPC corner cases related to exhaustion of
global reserve
- improvements in reducing excessive reservations for block group
items
- adjust overcommit logic in near full situations, account for one
more chunk to eventually allocate metadata chunk, this is mostly
relevant for small filesystems (<10GiB)
- single device filesystems are scanned but not registered (except
seed devices), this allows temp_fsid to work
- qgroup iterations do not need GFP_ATOMIC allocations anymore
- cleanups, refactoring, reduced data structure size, function
parameter simplifications, error handling fixes"
* tag 'for-6.7-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: (156 commits)
btrfs: open code timespec64 in struct btrfs_inode
btrfs: remove redundant log root tree index assignment during log sync
btrfs: remove redundant initialization of variable dirty in btrfs_update_time()
btrfs: sysfs: show temp_fsid feature
btrfs: disable the device add feature for temp-fsid
btrfs: disable the seed feature for temp-fsid
btrfs: update comment for temp-fsid, fsid, and metadata_uuid
btrfs: remove pointless empty log context list check when syncing log
btrfs: update comment for struct btrfs_inode::lock
btrfs: remove pointless barrier from btrfs_sync_file()
btrfs: add and use helpers for reading and writing last_trans_committed
btrfs: add and use helpers for reading and writing fs_info->generation
btrfs: add and use helpers for reading and writing log_transid
btrfs: add and use helpers for reading and writing last_log_commit
btrfs: support cloned-device mount capability
btrfs: add helper function find_fsid_by_disk
btrfs: stop reserving excessive space for block group item insertions
btrfs: stop reserving excessive space for block group item updates
btrfs: reorder btrfs_inode to fill gaps
btrfs: open code btrfs_ordered_inode_tree in btrfs_inode
...
This release completes the SunRPC thread scheduler work that was
begun in v6.6. The scheduler can now find an svc thread to wake in
constant time and without a list walk. Thanks again to Neil Brown
for this overhaul.
Lorenzo Bianconi contributed infrastructure for a netlink-based
NFSD control plane. The long-term plan is to provide the same
functionality as found in /proc/fs/nfsd, plus some interesting
additions, and then migrate the NFSD user space utilities to
netlink.
A long series to overhaul NFSD's NFSv4 operation encoding was
applied in this release. The goals are to bring this family of
encoding functions in line with the matching NFSv4 decoding
functions and with the NFSv2 and NFSv3 XDR functions, preparing
the way for better memory safety and maintainability.
A further improvement to NFSD's write delegation support was
contributed by Dai Ngo. This adds a CB_GETATTR callback,
enabling the server to retrieve cached size and mtime data from
clients holding write delegations. If the server can retrieve
this information, it does not have to recall the delegation in
some cases.
The usual panoply of bug fixes and minor improvements round out
this release. As always I am grateful to all contributors,
reviewers, and testers.
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEEKLLlsBKG3yQ88j7+M2qzM29mf5cFAmU5IuoACgkQM2qzM29m
f5eVsg//bVp8S93ci/oDlKfzOwH2fO5e5rna91wrDpJxkd51h6KTx55dSRG5sjAZ
EywIVOann6xCtsixAPyff5Cweg2dWvzQRsy1ZnvWQ1qZBzD5KAJY5LPkeSFUCKBo
Zani/qTOYbxzgFMjZx+yDSXDPKG68WYZBQK59SI7mURu4SYdk8aRyNY8mjHfr0Vh
Aqrcny4oVtXV4sL5P5G/2FUW7WKT3olA3jSYlRRNMhbs2qpEemRCCrspOEMMad+b
t1+ZCg+U27PMranvOJnof4RU7peZbaxDWA0gyiUbivVXVtZn9uOs0ffhktkvechL
ePc33dqdp2ITdKIPA6JlaRv5WflKXQw0YYM9Kv5mcR4A2el7owL4f/pMlPhtbYwJ
IOJv15KdKVN979G2e6WMYiKK+iHfaUUguhMEXnfnGoAajHOZNQiUEo3iFQAD7LDc
DvMF8d9QqYmB9IW8FOYaRRfZGJOQHf3TL79Nd08z/bn5swvlvfj77leux9Sb+0/m
Luk2Xvz2AJVSXE31wzabaGHkizN+BtH+e4MMbXUHBPW5jE9v7XOnEUFr4UdZyr9P
Gl87A7NcrzNjJWT5TrnzM4sOslNsx46Aeg+VuNt2fSRn2dm6iBu2B8s0N4imx6dV
PX1y9VSLq5WRhjrFZ1qeiZdsuTaQtrEiNDoRIQR6nCJPAV80iFk=
=B4wJ
-----END PGP SIGNATURE-----
Merge tag 'nfsd-6.7' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux
Pull nfsd updates from Chuck Lever:
"This release completes the SunRPC thread scheduler work that was begun
in v6.6. The scheduler can now find an svc thread to wake in constant
time and without a list walk. Thanks again to Neil Brown for this
overhaul.
Lorenzo Bianconi contributed infrastructure for a netlink-based NFSD
control plane. The long-term plan is to provide the same functionality
as found in /proc/fs/nfsd, plus some interesting additions, and then
migrate the NFSD user space utilities to netlink.
A long series to overhaul NFSD's NFSv4 operation encoding was applied
in this release. The goals are to bring this family of encoding
functions in line with the matching NFSv4 decoding functions and with
the NFSv2 and NFSv3 XDR functions, preparing the way for better memory
safety and maintainability.
A further improvement to NFSD's write delegation support was
contributed by Dai Ngo. This adds a CB_GETATTR callback, enabling the
server to retrieve cached size and mtime data from clients holding
write delegations. If the server can retrieve this information, it
does not have to recall the delegation in some cases.
The usual panoply of bug fixes and minor improvements round out this
release. As always I am grateful to all contributors, reviewers, and
testers"
* tag 'nfsd-6.7' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux: (127 commits)
svcrdma: Fix tracepoint printk format
svcrdma: Drop connection after an RDMA Read error
NFSD: clean up alloc_init_deleg()
NFSD: Fix frame size warning in svc_export_parse()
NFSD: Rewrite synopsis of nfsd_percpu_counters_init()
nfsd: Clean up errors in nfs3proc.c
nfsd: Clean up errors in nfs4state.c
NFSD: Clean up errors in stats.c
NFSD: simplify error paths in nfsd_svc()
NFSD: Clean up nfsd4_encode_seek()
NFSD: Clean up nfsd4_encode_offset_status()
NFSD: Clean up nfsd4_encode_copy_notify()
NFSD: Clean up nfsd4_encode_copy()
NFSD: Clean up nfsd4_encode_test_stateid()
NFSD: Clean up nfsd4_encode_exchange_id()
NFSD: Clean up nfsd4_do_encode_secinfo()
NFSD: Clean up nfsd4_encode_access()
NFSD: Clean up nfsd4_encode_readdir()
NFSD: Clean up nfsd4_encode_entry4()
NFSD: Add an nfsd4_encode_nfs_cookie4() helper
...
Add nr_split to trace_mm_migrate_pages for large folio (including THP)
split events.
[akpm@linux-foundation.org: cleanup per Huang, Ying]
Link: https://lkml.kernel.org/r/20231017163129.2025214-2-zi.yan@sent.com
Signed-off-by: Zi Yan <ziy@nvidia.com>
Reviewed-by: "Huang, Ying" <ying.huang@intel.com>
Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Huang Ying <ying.huang@intel.com>
Cc: Matthew Wilcox <willy@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
-----BEGIN PGP SIGNATURE-----
iQFSBAABCAA8FiEEq68RxlopcLEwq+PEeb4+QwBBGIYFAmU1ngkeHHRvcnZhbGRz
QGxpbnV4LWZvdW5kYXRpb24ub3JnAAoJEHm+PkMAQRiGrsIH/0k/+gdBBYFFdEym
foRhKir9WV3ZX4oIozJjA1f7T+qVYclKs6kaYm3gNepRBb6AoG8pdgv4MMAqhYsf
QMe2XHi0MrO/qKBgfNfivxEa9jq+0QK5uvTbqCRqCAB8LfwVyDqapCmg3EuiZcPW
UbMITmnwLIfXgPxvp9rabmCsTqO6FLbf0GDOVIkNSAIDBXMpcO1iffjrWUbhRa7n
oIoiJmWJLcXLxPWDsRKbpJwzw2cIG08YhfQYAiQnC3YaeRm1FKLDIICRBsmfYzja
rWv9r4dn4TDfV4/AnjggQnsZvz2yPCxNaFSQIT88nIeiLvyuUTJ9j8aidsSfMZQf
xZAbzbA=
=NoQv
-----END PGP SIGNATURE-----
BackMerge tag 'v6.6-rc7' into drm-next
This is needed to add the msm pr which is based on a higher base.
Signed-off-by: Dave Airlie <airlied@redhat.com>