Ideally, read the present DDR memory controller count first and then
allocate the skx_dev list using this count. However, this approach
requires adding a significant amount of code similar to
skx_get_all_bus_mappings() to obtain the PCI bus mappings for the first
socket and use these mappings along with the related PCI register offset
to read the memory controller count.
Given that the Granite Rapids CPU is the only one that can detect the
count of memory controllers at runtime (other CPUs use the count in the
configuration data), to reduce code complexity, reallocate the skx_dev
list only if the preconfigured count of DDR memory controllers differs
from the count read at runtime for Granite Rapids CPU.
Signed-off-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
Link: https://lore.kernel.org/r/20250731145534.2759334-7-qiuxu.zhuo@intel.com
The following upper bound check for the memory controller physical index
decoded by ADXL is the only place where use the macro 'NUM_IMC' is used:
res->imc > NUM_IMC - 1
Since this check is already covered by skx_get_mc_mapping(), meaning no
memory controller logical index exists for an invalid memory controller
physical index decoded by ADXL, remove the redundant upper bound check
so that the definition for 'NUM_IMC' can be cleaned up (in another patch).
Signed-off-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
Link: https://lore.kernel.org/r/20250731145534.2759334-6-qiuxu.zhuo@intel.com
The current skx->imc[NUM_IMC] array of memory controller instances is
sized using the macro NUM_IMC. Each time EDAC support is added for a
new CPU, NUM_IMC needs to be updated to ensure it is greater than or
equal to the number of memory controllers for the new CPU. This approach
is inconvenient and results in memory waste for older CPUs with fewer
memory controllers.
To address this, make skx->imc[] a flexible array and determine its size
from configuration data or at runtime.
Suggested-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
Link: https://lore.kernel.org/r/20250731145534.2759334-5-qiuxu.zhuo@intel.com
The current mapping of memory controller indices is from physical index [1]
to logical index [2], as show below:
skx_dev->imc[pmc].mc_mapping = lmc
Since skx_dev->imc[] is an array of present memory controller instances,
mapping memory controller indices from logical index to physical index,
as show below, is more reasonable. This is also a preparatory step for
making skx_dev->imc[] a flexible array.
skx_dev->imc[lmc].mc_mapping = pmc
Both mappings are equivalent. No functional changes intended.
[1] Indices for memory controllers include both those present to the
OS and those disabled by BIOS.
[2] Indices for memory controllers present to the OS.
Signed-off-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
Link: https://lore.kernel.org/r/20250731145534.2759334-4-qiuxu.zhuo@intel.com
The mc_mapping and imc fields of struct skx_dev have the same size,
NUM_IMC. Move mc_mapping to be a field inside struct skx_imc to prepare
for making the imc array of memory controller instances a flexible array.
No functional changes intended.
Suggested-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
Link: https://lore.kernel.org/r/20250731145534.2759334-3-qiuxu.zhuo@intel.com
Use model-specific configuration data for the number of memory controllers
per socket, channels per memory controller, and DIMMs per channel as
intended, instead of relying on global macros for maximum values.
No functional changes intended.
Signed-off-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
Link: https://lore.kernel.org/r/20250731145534.2759334-2-qiuxu.zhuo@intel.com
When loading the i10nm_edac driver on some Intel Granite Rapids servers,
a call trace may appear as follows:
UBSAN: shift-out-of-bounds in drivers/edac/skx_common.c:453:16
shift exponent -66 is negative
...
__ubsan_handle_shift_out_of_bounds+0x1e3/0x390
skx_get_dimm_info.cold+0x47/0xd40 [skx_edac_common]
i10nm_get_dimm_config+0x23e/0x390 [i10nm_edac]
skx_register_mci+0x159/0x220 [skx_edac_common]
i10nm_init+0xcb0/0x1ff0 [i10nm_edac]
...
This occurs because some BIOS may disable a memory controller if there
aren't any memory DIMMs populated on this memory controller. The DIMMMTR
register of this disabled memory controller contains the invalid value
~0, resulting in the call trace above.
Fix this call trace by skipping DIMM enumeration on a disabled memory
controller.
Fixes: ba987eaaab ("EDAC/i10nm: Add Intel Granite Rapids server support")
Reported-by: Jose Jesus Ambriz Meza <jose.jesus.ambriz.meza@intel.com>
Reported-by: Chia-Lin Kao (AceLan) <acelan.kao@canonical.com>
Closes: https://lore.kernel.org/all/20250730063155.2612379-1-acelan.kao@canonical.com/
Signed-off-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
Tested-by: Chia-Lin Kao (AceLan) <acelan.kao@canonical.com>
Link: https://lore.kernel.org/r/20250806065707.3533345-1-qiuxu.zhuo@intel.com
The driver is designed to support error detection and reporting for
Cortex A72 cores, specifically within their L1 and L2 cache systems.
The errors are detected by reading CPU/L2 memory error syndrome
registers.
Unfortunately there is no robust way to inject errors into the caches,
so this driver doesn't contain any code to actually test it. It has
been tested though with code taken from an older version [1] of this
driver. For reasons stated in thread [1], the error injection code is
not suitable for mainline, so it is removed from the driver.
[1] https://lore.kernel.org/all/1521073067-24348-1-git-send-email-york.sun@nxp.com/#t
[ bp: minor touchups. ]
Signed-off-by: Sascha Hauer <s.hauer@pengutronix.de>
Co-developed-by: Vijay Balakrishna <vijayb@linux.microsoft.com>
Signed-off-by: Vijay Balakrishna <vijayb@linux.microsoft.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/1752714390-27389-2-git-send-email-vijayb@linux.microsoft.com
- switch to use scnprintf()
- Add Granite Rapids-D support
- synopsys: Make sure ECC error and counter registers are cleared during
init/probing to avoid reporting stale errors
- igen6: Add Wildcat Lake SoCs support
- Make sure scrub features sysfs attributes are initialized properly
- Allocate memory repair sysfs attributes statically to reduce stack
usage
- Fix DIMM module size computation for DIMMs with total capacity which
is a non power-of-two number, in amd64_edac
- Do not be too dramatic when reporting disabled memory controllers in
igen6_edac
- Add support to ie31200_edac for the following SoCs:
- Core i5-14[67]00
- Bartless Lake-S SoCs
- Raptor Lake-HX
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmiHeDcACgkQEsHwGGHe
VUrnzhAAryFKu8xWuwOE3eGaMW6oJhjKF8wPxLiCxxi6ZdQ/1uudFVnzwgozmkXo
l10h41A3yc1ZdJqdqn54gF8PxbQ0E1MvbXfmBqZ/U+V+dv6zMwu9TygoPRIJ60ST
aIxTBq2zoSii7ucGCBjbqClMTF3ZcH/Q2FzZoFbZyZd84snWSz0B9+S+937mtMhl
9Y55sAgQuigQDQ71YZymAGyWi9E9J20wFk76vIHEboRIa5sS0iCU88Wb4PT+5iKf
Qc/1gyqnd+6FO9O9ddrYpeDcaIicLShuGVNZNlJalD/JyTIOcP6XdEDa5J7TYp27
7IcmfHSYmZ5eL0vrJfrIwbauEpRL9ZjWXS+uQjj8/K/gkPUsH/Sdldgldkd50GHV
6L79XSzpy4yhlAr3BXU0o917qRVWOpbxr9E7l6VAFGBpLl5ewtZiV3W7/Su4rPd2
zpUGBZvjxO8jmNQn49IPs/XotVQ2L+mT+KSxUMZAO2pV+dztSJELMFQQC0uAXiZc
ApcrSkQxa4fsxU2Ukc1dLOJNkwxEC1ECcPsl2I9EE1cFoix7NP2E+G92D/V52VoZ
QeVkxM7LHZCTH9tH1nrCZ+WJr8S2vZ+uY8jRl42P12xU4kcd3RWEtna18bX5oe++
RlgchnXwutEPSgHYZVPocuaDD7C6eIvYzpaVezVl9dgbRLLx8u4=
=PBTf
-----END PGP SIGNATURE-----
Merge tag 'edac_updates_for_v6.17_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/ras/ras
Pull EDAC updates from Borislav Petkov:
- i10nm:
- switch to using scnprintf()
- Add Granite Rapids-D support
- synopsys: Make sure ECC error and counter registers are cleared
during init/probing to avoid reporting stale errors
- igen6: Add Wildcat Lake SoCs support
- Make sure scrub features sysfs attributes are initialized properly
- Allocate memory repair sysfs attributes statically to reduce stack
usage
- Fix DIMM module size computation for DIMMs with total capacity which
is a non power-of-two number, in amd64_edac
- Do not be too dramatic when reporting disabled memory controllers in
igen6_edac
- Add support to ie31200_edac for the following SoCs:
- Core i5-14[67]00
- Bartless Lake-S SoCs
- Raptor Lake-HX
* tag 'edac_updates_for_v6.17_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/ras/ras:
EDAC/{skx_common,i10nm}: Use scnprintf() for safer buffer handling
EDAC/synopsys: Clear the ECC counters on init
EDAC/ie31200: Add Intel Raptor Lake-HX SoCs support
EDAC/igen6: Add Intel Wildcat Lake SoCs support
EDAC/i10nm: Add Intel Granite Rapids-D support
EDAC/mem_repair: Reduce stack usage in edac_mem_repair_get_desc()
EDAC/igen6: Reduce log level to debug for absent memory controllers
EDAC/ie31200: Document which CPUs correspond to each Raptor Lake-S device ID
EDAC/ie31200: Enable support for Core i5-14600 and i7-14700
ie31200/EDAC: Add Intel Bartlett Lake-S SoCs support
snprintf() is fragile when its return value will be used to append
additional data to a buffer. Use scnprintf() instead.
Signed-off-by: Wang Haoran <haoranwangsec@gmail.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
Tested-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com>
Reviewed-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com>
Link: https://lore.kernel.org/r/20250715131700.1092720-1-haoranwangsec@gmail.com
Clear the ECC error and counter registers during initialization/probe to avoid
reporting stale errors that may have occurred before EDAC registration.
For that, unify the Zynq and ZynqMP ECC state reading paths and simplify the
code.
[ bp: Massage commit message.
Fix an -Wsometimes-uninitialized warning as reported by
Reported-by: kernel test robot <lkp@intel.com>
Closes: https://lore.kernel.org/oe-kbuild-all/202507141048.obUv3ZUm-lkp@intel.com ]
Signed-off-by: Shubhrajyoti Datta <shubhrajyoti.datta@amd.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/20250713050753.7042-1-shubhrajyoti.datta@amd.com
Intel Raptor Lake-HX SoC shares the same memory controller registers
as Raptor Lake-S SoC. Add a compute die ID for Raptor Lake-HX SoCs with
Out-of-Band ECC capability for EDAC support.
Signed-off-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
Tested-by: Laurens SEGHERS <laurens@rale.com>
Link: https://lore.kernel.org/r/20250704151609.7833-4-qiuxu.zhuo@intel.com
Intel Wildcat Lake is a mobile derivative of Panther Lake with one
memory controller. Wildcat Lake SoCs share the same IBECC registers
with Meteor Lake-P SoCs.
Add a compute die ID and a new configuration structure for Wildcat
Lake SoCs with In-Band ECC capability for EDAC support.
Signed-off-by: Lili Li <lili.li@intel.com>
Signed-off-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
Link: https://lore.kernel.org/r/20250704151609.7833-3-qiuxu.zhuo@intel.com
The Granite Rapids-D CPU model uses memory controller registers similar
to those of the Granite Rapids server CPU but with a different memory
controller MMIO base.
Add the Granite Rapids-D CPU model ID and use the new memory controller
MMIO base for EDAC support.
Signed-off-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
Tested-by: VikasX Chougule <vikasx.chougule@intel.com>
Link: https://lore.kernel.org/r/20250704151609.7833-2-qiuxu.zhuo@intel.com
Constructing an array on the stack adds complexity and can exceed the
warning limit for per-function stack usage:
drivers/edac/mem_repair.c:361:5: error: stack frame size (1296) exceeds
limit (1280) in 'edac_mem_repair_get_desc' [-Werror,-Wframe-larger-than]
Change this to have the actual attribute array allocated statically and then
just add the instance number on the per-instance copy.
Fixes: 699ea5219c ("EDAC: Add a memory repair control feature")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/20250620114135.4017183-1-arnd@kernel.org
Each Chip-Select (CS) of a Unified Memory Controller (UMC) on AMD Zen-based
SOCs has an Address Mask and a Secondary Address Mask register associated with
it. The amd64_edac module logs DIMM sizes on a per-UMC per-CS granularity
during init using these two registers.
Currently, the module primarily considers only the Address Mask register for
computing DIMM sizes. The Secondary Address Mask register is only considered
for odd CS. Additionally, if it has been considered, the Address Mask register
is ignored altogether for that CS. For power-of-two DIMMs i.e. DIMMs whose
total capacity is a power of two (32GB, 64GB, etc), this is not an issue
since only the Address Mask register is used.
For non-power-of-two DIMMs i.e., DIMMs whose total capacity is not a power of
two (48GB, 96GB, etc), however, the Secondary Address Mask register is used
in conjunction with the Address Mask register. However, since the module only
considers either of the two registers for a CS, the size computed by the
module is incorrect. The Secondary Address Mask register is not considered for
even CS, and the Address Mask register is not considered for odd CS.
Introduce a new helper function so that both Address Mask and Secondary
Address Mask registers are considered, when valid, for computing DIMM sizes.
Furthermore, also rename some variables for greater clarity.
Fixes: 81f5090db8 ("EDAC/amd64: Support asymmetric dual-rank DIMMs")
Closes: https://lore.kernel.org/dbec22b6-00f2-498b-b70d-ab6f8a5ec87e@natrix.lt
Reported-by: Žilvinas Žaltiena <zilvinas@natrix.lt>
Signed-off-by: Avadhut Naik <avadhut.naik@amd.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Yazen Ghannam <yazen.ghannam@amd.com>
Tested-by: Žilvinas Žaltiena <zilvinas@natrix.lt>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/20250529205013.403450-1-avadhut.naik@amd.com
The current KERN_WARNING level message for detecting absent memory
controllers is overly dramatic. The BIOS likely had valid reasons to
disable the memory controller (e.g. it isn't connected to any DIMM
slots on the motherboard for this system). So there's nothing actually
wrong that needs to be fixed.
Reduce the log level to KERN_DEBUG to eliminate the false warning.
Suggested-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/20250618162307.1523736-2-qiuxu.zhuo@intel.com
Device ID '0xa740' is shared by i7-14700, i7-14700K, and i7-14700T.
Device ID '0xa704' is shared by i5-14600, i5-14600K, and i5-14600T.
Tested locally on my i7-14700K.
Signed-off-by: George Gaidarov <gdgaidarov+lkml@gmail.com>
Reviewed-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/20250529162933.1228735-1-gdgaidarov+lkml@gmail.com
Bartlett Lake-S is a derivative of Raptor Lake-S and is optimized for
IoT/Edge applications. It shares the same memory controller registers
as Raptor Lake-S. Add compute die IDs of Bartlett Lake-S and reuse the
configuration data of Raptor Lake-S for Bartlett Lake-S EDAC support.
Signed-off-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/20250502013900.343430-1-qiuxu.zhuo@intel.com
A kernel panic was reported with the following kernel log:
EDAC igen6: Expected 2 mcs, but only 1 detected.
BUG: unable to handle page fault for address: 000000000000d570
...
Hardware name: Notebook V54x_6x_TU/V54x_6x_TU, BIOS Dasharo (coreboot+UEFI) v0.9.0 07/17/2024
RIP: e030:ecclog_handler+0x7e/0xf0 [igen6_edac]
...
igen6_probe+0x2a0/0x343 [igen6_edac]
...
igen6_init+0xc5/0xff0 [igen6_edac]
...
This issue occurred because one memory controller was disabled by
the BIOS but the igen6_edac driver still checked all the memory
controllers, including this absent one, to identify the source of
the error. Accessing the null MMIO for the absent memory controller
resulted in the oops above.
Fix this issue by reverting the configuration structure to non-const
and updating the field 'res_cfg->num_imc' to reflect the number of
detected memory controllers.
Fixes: 20e190b1c1 ("EDAC/igen6: Skip absent memory controllers")
Reported-by: Marek Marczykowski-Górecki <marmarek@invisiblethingslab.com>
Closes: https://lore.kernel.org/all/aFFN7RlXkaK_loQb@mail-itl/
Suggested-by: Borislav Petkov <bp@alien8.de>
Signed-off-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Tested-by: Marek Marczykowski-Górecki <marmarek@invisiblethingslab.com>
Link: https://lore.kernel.org/r/20250618162307.1523736-1-qiuxu.zhuo@intel.com
AMD's Family 19h-based Models 70h-7fh support 4 unified memory controllers
(UMC) per processor die.
The amd64_edac driver, however, assumes only 2 UMCs are supported since
max_mcs variable for the models has not been explicitly set to 4. The same
results in incomplete or incorrect memory information being logged to dmesg by
the module during initialization in some instances.
Fixes: 6c79e42169 ("EDAC/amd64: Add support for ECC on family 19h model 60h-7Fh")
Closes: https://lore.kernel.org/all/27dc093f-ce27-4c71-9e81-786150a040b6@reox.at/
Reported-by: reox <mailinglist@reox.at>
Signed-off-by: Avadhut Naik <avadhut.naik@amd.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Cc: stable@kernel.org
Link: https://lore.kernel.org/20250613005233.2330627-1-avadhut.naik@amd.com
- Remove always true condition in cxl features code.
- Add verification of CHBS length for CXL 2.0
- Ignore interleave granularity when interleave ways is 1
- Add update addressing mising MODULE_DESCRIPTION for cxl_test
- A series of cleanups/refactor to prep for AMD Zen5 translate code
- Clean %pa debug printk in core/hdm.c
- Documentation updates
- Update to CXL Maturity Map
- Fixes to source linking in CXL documentation
- CXL documentation fixes, spelling corrections
- A large collection of CXL documentation for the entire CXL subsystem, including
documentation on CXL related platform and firmware notes
- Remove redundant code of cxlctl_get_supported_features()
- Series to support CXL RAS Features
- Including "Patrol Scrub Control", "Error Check Scrub", "Performance Maitenance"
and "Memory Sparing". The series connects CXL to EDAC.
-----BEGIN PGP SIGNATURE-----
iQIzBAABCAAdFiEE5DAy15EJMCV1R6v9YGjFFmlTOEoFAmg/EX0ACgkQYGjFFmlT
OErkxA/+MuvYH6PjdSwbJiJUcLCrq0gdNczX7t+CAo3v4rNs5CrOJSKuXMxGVIpf
lCjwS/J7j4XOa9DigDO1Dxl4PWG/R0K3HjbHQJLVjy3jmsVr5GgJVt4s5EqS78Or
QoM9d/2dq8Q6dqk89Z4rFY2JlAmXcibe+lz2m9k5vy8KPQvrZI1KruZMG0qN1rWC
SBa+eUWW49MP3Ab6pBDRRCI7EPcJ44QF+49SWXrkkiJjll/OTtYu3V1JymaPV4zT
/UM/CwHLnmb5odUfOx5EZJcZIZzqasBD28Xu6Y6Vjs2pgTNPr9VNGCs+8lLwQCg7
1O8cyPjPa1p5HwKo2INJfM1Xdpo1Nqar1qGcSPVJKk0+a538i07YuXR3uWqnJ+mO
uplJwvtL1Rvg9h0C0fHwfB86Tl5poFIDn0zeZQ4tqMWH6y2wged5PcS5RSVuVYdX
CHEzAvp1RreQvXd9KcSo6ITKbIvv5PplfcyTiftG0R71ewXYOpB386D7ihh8ZvvO
Y6TJN1nXUv8w3ve7rbs3T2ncX+BbhHRKSnXTp3rbkXTQLF5t+2gjzVuwSSEH71Ps
4bD+7EeE0JGSz76qLbfQPNt6l3HnG6ctycpAydsUs/YmIOnciKpT5R9OGwncKHrT
/Ccx+9uN5+CizqsKCi+rxadoOw7Pk4bKmo0a4wCZ5sDPGix0g18=
=evKj
-----END PGP SIGNATURE-----
Merge tag 'cxl-for-6.16' of git://git.kernel.org/pub/scm/linux/kernel/git/cxl/cxl
Pull Compute Express Link (CXL) updates from Dave Jiang:
- Remove always true condition in cxl features code
- Add verification of CHBS length for CXL 2.0
- Ignore interleave granularity when interleave ways is 1
- Add update addressing mising MODULE_DESCRIPTION for cxl_test
- A series of cleanups/refactor to prep for AMD Zen5 translate code
- Clean %pa debug printk in core/hdm.c
- Documentation updates:
- Update to CXL Maturity Map
- Fixes to source linking in CXL documentation
- CXL documentation fixes, spelling corrections
- A large collection of CXL documentation for the entire CXL
subsystem, including documentation on CXL related platform and
firmware notes
- Remove redundant code of cxlctl_get_supported_features()
- Series to support CXL RAS Features
- Including "Patrol Scrub Control", "Error Check Scrub",
"Performance Maitenance" and "Memory Sparing". The series
connects CXL to EDAC.
* tag 'cxl-for-6.16' of git://git.kernel.org/pub/scm/linux/kernel/git/cxl/cxl: (53 commits)
cxl/edac: Add CXL memory device soft PPR control feature
cxl/edac: Add CXL memory device memory sparing control feature
cxl/edac: Support for finding memory operation attributes from the current boot
cxl/edac: Add support for PERFORM_MAINTENANCE command
cxl/edac: Add CXL memory device ECS control feature
cxl/edac: Add CXL memory device patrol scrub control feature
cxl: Update prototype of function get_support_feature_info()
EDAC: Update documentation for the CXL memory patrol scrub control feature
cxl/features: Remove the inline specifier from to_cxlfs()
cxl/feature: Remove redundant code of get supported features
docs: ABI: Fix "firwmare" to "firmware"
cxl/Documentation: Fix typo in sysfs write_bandwidth attribute path
cxl: doc/linux/access-coordinates Update access coordinates calculation methods
cxl: docs/platform/acpi/srat Add generic target documentation
cxl: docs/platform/cdat reference documentation
Documentation: Update the CXL Maturity Map
cxl: Sync up the driver-api/cxl documentation
cxl: docs - add self-referencing cross-links
cxl: docs/allocation/hugepages
cxl: docs/allocation/reclaim
...
On the SoCFPGA platform, the INTTEST register supports only 16-bit writes.
A 32-bit write triggers an SError to the CPU so do 16-bit accesses only.
[ bp: AI-massage the commit message. ]
Fixes: c7b4be8db8 ("EDAC, altera: Add Arria10 OCRAM ECC support")
Signed-off-by: Niravkumar L Rabara <niravkumar.l.rabara@intel.com>
Signed-off-by: Matthew Gerlach <matthew.gerlach@altera.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Acked-by: Dinh Nguyen <dinguyen@kernel.org>
Cc: stable@kernel.org
Link: https://lore.kernel.org/20250527145707.25458-1-matthew.gerlach@altera.com
- Rework how RRL registers per channel tracking is done in order to
support newer hardware with different RRL configurations and refactor
that code. Add support for Granite Rapids server
- i10nm: explicitly set RRL modes to fix any wrong BIOS programming
- Properly save and restore Retry Read error Log channel configuration
info on Intel drivers
- igen6: Handle correctly the case of fused off memory controllers on
Arizona Beach and Amston Lake SoCs before adding support for them
- the usual set of fixes and cleanups
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmgzfuAACgkQEsHwGGHe
VUrYuw/+KE1MD+zGPdTTsY5ZCMli+eKclGKqnr1rMy/5RUWPRvHoBBVc7fGiVeMU
b0EsD5UEGW4Pzecm4w1rQ+FCFkOG2wHzA2+pbWI0dTE9ZxXjAFtMYSkKCqBOzR0C
AwdvvYYK6IZlykh/+h5v7h8++4WPbnQQKfsiCu5w3JlHrgpvj/NQnUFCCNCFzavO
aoQd11UuAwm4tUeshoRLkz+sD6g3+uo9vBiOpMITdBl4ESxA16gJF+7uWUpJB1B7
M7G2P0c5lhRv704M1AhUSVr9i46F7IYEeoexTTdc1Obb39IYKSl/vRdhatM3V1dy
dbFMnuXBRiBuHtt6WkoU7oj6OtQQTU/uJ3GWGl1pVoKNVVw46qM7gsrX9sWk/JGW
Sd5pGpiWVWe0RXdDchWCywlWB3xC9NrGjP+ivgCtyO9z0kLAjtwOPFdBGK2byg4t
aKkyHCsAjOayCT2uN/7nuPOjjr8Zw2KTgM+g2oDp9oK7TSb70keqDqShgDeB/PKH
cYbWJPu9m97j1HICMLWBzHHFn4KxByIZbRfWqbcYV8Ufno7fuali6CuV7QCmYyZU
mqMi4xQ0hfIHgmkamHBWWegyvFtBV7/r6nJjvEz631aME9tPveG/fl8wjABLaKjq
Q8klRRmC1sdnJssl4KJY7n0DedzrRLfmThRg6gzvF7hVevzK6Bg=
=GCi0
-----END PGP SIGNATURE-----
Merge tag 'edac_updates_for_v6.16' of git://git.kernel.org/pub/scm/linux/kernel/git/ras/ras
Pull EDAC updates from Borislav Petkov:
- ie31200: Add support for Raptor Lake-S and Alder Lake-S compute dies
- Rework how RRL registers per channel tracking is done in order to
support newer hardware with different RRL configurations and refactor
that code. Add support for Granite Rapids server
- i10nm: explicitly set RRL modes to fix any wrong BIOS programming
- Properly save and restore Retry Read error Log channel configuration
info on Intel drivers
- igen6: Handle correctly the case of fused off memory controllers on
Arizona Beach and Amston Lake SoCs before adding support for them
- the usual set of fixes and cleanups
* tag 'edac_updates_for_v6.16' of git://git.kernel.org/pub/scm/linux/kernel/git/ras/ras:
EDAC/bluefield: Don't use bluefield_edac_readl() result on error
EDAC/i10nm: Fix the bitwise operation between variables of different sizes
EDAC/ie31200: Add two Intel SoCs for EDAC support
EDAC/{skx_common,i10nm}: Add RRL support for Intel Granite Rapids server
EDAC/{skx_common,i10nm}: Refactor show_retry_rd_err_log()
EDAC/{skx_common,i10nm}: Refactor enable_retry_rd_err_log()
EDAC/{skx_common,i10nm}: Structure the per-channel RRL registers
EDAC/i10nm: Explicitly set the modes of the RRL register sets
EDAC/{skx_common,i10nm}: Fix the loss of saved RRL for HBM pseudo channel 0
EDAC/skx_common: Fix general protection fault
EDAC/igen6: Add Intel Amston Lake SoCs support
EDAC/igen6: Add Intel Arizona Beach SoCs support
EDAC/igen6: Skip absent memory controllers
- Consolidate on one set of functions for the interrupt domain code to
get rid of pointlessly duplicated code with only marginal different
semantics.
- Update the documentation accordingly and consolidate the coding style
of the irqdomain header.
-----BEGIN PGP SIGNATURE-----
iQJHBAABCgAxFiEEQp8+kY+LLUocC4bMphj1TA10mKEFAmgzd+MTHHRnbHhAbGlu
dXRyb25peC5kZQAKCRCmGPVMDXSYodTRD/0RmG5tngCbEJmTw6lPDQzRZH4OO3ja
yRYlyBipemoRmvJRGjV4uHqN2QPrdOuoqMuyBO1aWcMdkpww5bAHcbgSFrlGM1lW
kqtaxVMbufPiLQSGYe7OQf478CE1ykoBd5Va8whFKrtA73qEUdEMfWT0stspg780
7BlmQOemL91p7Ytf03FbDdo8tZ5Xu9uXGAulwY9FZsFtsCNyvhl7nOv5Sk8ZQtGO
xHRCeunjZLWR+IaK59hdakvQybXwSnjT6jODp96nlyKABEKSPShGSPFDWd3g9px7
4911QwgnvTbcrsk6YmQEmPIOgXZzypjbnjpJr8tFpTbkVIy+6chi5cBJzXoqsUaM
ylTwFcUQNvcP8yF447qb+nyPFKM5xsC07W0UpZMuJUDmhhPRtDm5pK0jpsif96GP
l4aMsWe65PUmXHQqLdE89RJXAa8XQ2qspKVtNKq9DmEVgTviQ09Z9SSQIx4U0yIx
w+YPde8kH2+O+YtMUn/MmfHhUP4MKya7j5zd8Bnv8wLBi7XGPPA5EKKh9I0dz9m+
X94lweNXyH+Q8U9mt2cQf8VG8Yzgk0eeC0sliJIlybwRgEgRcQbVWw0VvZUA1ySa
VBlaj3SinO90FEQ0CctT51ss2mUJ/XsGCnxpiGZXfqIZzFbyD1YfZQnXJH0H67DI
CqdHw22I27Mu/A==
=9nLp
-----END PGP SIGNATURE-----
Merge tag 'irq-cleanups-2025-05-25' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull irq cleanups from Thomas Gleixner:
"A set of cleanups for the generic interrupt subsystem:
- Consolidate on one set of functions for the interrupt domain code
to get rid of pointlessly duplicated code with only marginal
different semantics.
- Update the documentation accordingly and consolidate the coding
style of the irqdomain header"
* tag 'irq-cleanups-2025-05-25' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (46 commits)
irqdomain: Consolidate coding style
irqdomain: Fix kernel-doc and add it to Documentation
Documentation: irqdomain: Update it
Documentation: irq-domain.rst: Simple improvements
Documentation: irq/concepts: Minor improvements
Documentation: irq/concepts: Add commas and reflow
irqdomain: Improve kernel-docs of functions
irqdomain: Make struct irq_domain_info variables const
irqdomain: Use irq_domain_instantiate()'s return value as initializers
irqdomain: Drop irq_linear_revmap()
pinctrl: keembay: Switch to irq_find_mapping()
irqchip/armada-370-xp: Switch to irq_find_mapping()
gpu: ipu-v3: Switch to irq_find_mapping()
gpio: idt3243x: Switch to irq_find_mapping()
sh: Switch to irq_find_mapping()
powerpc: Switch to irq_find_mapping()
irqdomain: Drop irq_domain_add_*() functions
powerpc: Switch irq_domain_add_nomap() to use fwnode
thermal: Switch to irq_domain_create_linear()
soc: Switch to irq_domain_create_*()
...
Memory sparing is defined as a repair function that replaces a portion of
memory with a portion of functional memory at that same DPA. The subclasses
for this operation vary in terms of the scope of the sparing being
performed. The cacheline sparing subclass refers to a sparing action that
can replace a full cacheline. Row sparing is provided as an alternative to
PPR sparing functions and its scope is that of a single DDR row.
As per CXL r3.2 Table 8-125 foot note 1. Memory sparing is preferred over
PPR when possible.
Bank sparing allows an entire bank to be replaced. Rank sparing is defined
as an operation in which an entire DDR rank is replaced.
Memory sparing maintenance operations may be supported by CXL devices
that implement CXL.mem protocol. A sparing maintenance operation requests
the CXL device to perform a repair operation on its media.
For example, a CXL device with DRAM components that support memory sparing
features may implement sparing maintenance operations.
The host may issue a query command by setting query resources flag in the
input payload (CXL spec 3.2 Table 8-120) to determine availability of
sparing resources for a given address. In response to a query request,
the device shall report the resource availability by producing the memory
sparing event record (CXL spec 3.2 Table 8-60) in which the Channel, Rank,
Nibble Mask, Bank Group, Bank, Row, Column, Sub-Channel fields are a copy
of the values specified in the request.
During the execution of a sparing maintenance operation, a CXL memory
device:
- may not retain data
- may not be able to process CXL.mem requests correctly.
These CXL memory device capabilities are specified by restriction flags
in the memory sparing feature readable attributes.
When a CXL device identifies error on a memory component, the device
may inform the host about the need for a memory sparing maintenance
operation by using DRAM event record, where the 'maintenance needed' flag
may set. The event record contains some of the DPA, Channel, Rank,
Nibble Mask, Bank Group, Bank, Row, Column, Sub-Channel fields that
should be repaired. The userspace tool requests for maintenance operation
if the 'maintenance needed' flag set in the CXL DRAM error record.
CXL spec 3.2 section 8.2.10.7.1.4 describes the device's memory sparing
maintenance operation feature.
CXL spec 3.2 section 8.2.10.7.2.3 describes the memory sparing feature
discovery and configuration.
Add support for controlling CXL memory device memory sparing feature.
Register with EDAC driver, which gets the memory repair attr descriptors
from the EDAC memory repair driver and exposes sysfs repair control
attributes for memory sparing to the userspace. For example CXL memory
sparing control for the CXL mem0 device is exposed in
/sys/bus/edac/devices/cxl_mem0/mem_repairX/
Use case
========
1. CXL device identifies a failure in a memory component, report to
userspace in a CXL DRAM trace event with DPA and other attributes of
memory to repair such as channel, rank, nibble mask, bank Group,
bank, row, column, sub-channel.
2. Rasdaemon process the trace event and may issue query request in sysfs
check resources available for memory sparing if either of the following
conditions met.
- 'maintenance needed' flag set in the event record.
- 'threshold event' flag set for CVME threshold feature.
- When the number of corrected error reported on a CXL.mem media to the
userspace exceeds the threshold value for corrected error count defined
by the userspace policy.
3. Rasdaemon process the memory sparing trace event and issue repair
request for memory sparing.
Kernel CXL driver shall report memory sparing event record to the userspace
with the resource availability in order rasdaemon to process the event
record and issue a repair request in sysfs for the memory sparing operation
in the CXL device.
Note: Based on the feedbacks from the community 'query' sysfs attribute is
removed and reporting memory sparing error record to the userspace are not
supported. Instead userspace issues sparing operation and kernel does the
same to the CXL memory device, when 'maintenance needed' flag set in the
DRAM event record.
Add checks to ensure the memory to be repaired is offline and if online,
then originates from a CXL DRAM error record reported in the current boot
before requesting a memory sparing operation on the device.
Note: Tested memory sparing feature control with QEMU patch
"hw/cxl: Add emulation for memory sparing control feature"
https://lore.kernel.org/linux-cxl/20250509172229.726-1-shiju.jose@huawei.com/T/#m5f38512a95670d75739f9dad3ee91b95c7f5c8d6
Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Reviewed-by: Dave Jiang <dave.jiang@intel.com>
Signed-off-by: Shiju Jose <shiju.jose@huawei.com>
Reviewed-by: Alison Schofield <alison.schofield@intel.com>
Acked-by: Dan Williams <dan.j.williams@intel.com>
Link: https://patch.msgid.link/20250521124749.817-8-shiju.jose@huawei.com
Signed-off-by: Dave Jiang <dave.jiang@intel.com>
The bluefield_edac_readl() routine returns an uninitialized result on error
paths. In those cases the calling routine should not use the uninitialized
result. The driver should simply log the error, and then return early.
Fixes: e419675754 ("EDAC/bluefield: Use Arm SMC for EMI access on BlueField-2")
Signed-off-by: David Thompson <davthompson@nvidia.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Shravan Kumar Ramani <shravankr@nvidia.com>
Link: https://lore.kernel.org/20250318214747.12271-1-davthompson@nvidia.com
irq_domain_add_linear() is going away as being obsolete now. Switch to
the preferred irq_domain_create_linear(). That differs in the first
parameter: It takes more generic struct fwnode_handle instead of struct
device_node. Therefore, of_fwnode_handle() is added around the
parameter.
Note some of the users can likely use dev->fwnode directly instead of
indirect of_fwnode_handle(dev->of_node). But dev->fwnode is not
guaranteed to be set for all, so this has to be investigated on case to
case basis (by people who can actually test with the HW).
[ tglx: Fix up subject prefix ]
Signed-off-by: Jiri Slaby (SUSE) <jirislaby@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/all/20250319092951.37667-17-jirislaby@kernel.org
For historic reasons there are some TSC-related functions in the
<asm/msr.h> header, even though there's an <asm/tsc.h> header.
To facilitate the relocation of rdtsc{,_ordered}() from <asm/msr.h>
to <asm/tsc.h> and to eventually eliminate the inclusion of
<asm/msr.h> in <asm/tsc.h>, add an explicit <asm/msr.h> dependency
to the source files that reference definitions from <asm/msr.h>.
[ mingo: Clarified the changelog. ]
Signed-off-by: Xin Li (Intel) <xin@zytor.com>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Acked-by: Dave Hansen <dave.hansen@linux.intel.com>
Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com>
Cc: Andy Lutomirski <luto@kernel.org>
Cc: Brian Gerst <brgerst@gmail.com>
Cc: Juergen Gross <jgross@suse.com>
Cc: H. Peter Anvin <hpa@zytor.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Kees Cook <keescook@chromium.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Josh Poimboeuf <jpoimboe@redhat.com>
Cc: Uros Bizjak <ubizjak@gmail.com>
Link: https://lore.kernel.org/r/20250501054241.1245648-1-xin@zytor.com
The tool of Smatch static checker reported the following warning:
drivers/edac/i10nm_base.c:364 show_retry_rd_err_log()
warn: should bitwise negate be 'ullong'?
This warning was due to the bitwise NOT/AND operations between
'status_mask' (a u32 type) and 'log' (a u64 type), which resulted in
the high 32 bits of 'log' were cleared.
This was a false positive warning, as only the low 32 bits of 'log' was
written to the first RRL memory controller register (a u32 type).
To improve code sanity, fix this warning by changing 'status_mask' to
a u64 type, ensuring it matches the size of 'log' for bitwise operations.
Reported-by: Dan Carpenter <dan.carpenter@linaro.org>
Closes: https://lore.kernel.org/all/aAih0KmEVq7ch6v2@stanley.mountain/
Signed-off-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
Link: https://lore.kernel.org/r/20250424081454.2952632-1-qiuxu.zhuo@intel.com
Add two compute die IDs for Raptor Lake-S and Alder Lake-S for EDAC
support. Note that because Alder Lake-S shares the same memory controller
registers as Raptor Lake-S, it can reuse the configuration data of Raptor
Lake-S for EDAC support.
Signed-off-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
Tested-by: James Jernigan <jameswestonjernigan@gmail.com>
Link: https://lore.kernel.org/r/20250422134450.1880648-1-qiuxu.zhuo@intel.com
Compared to previous generations, Granite Rapids defines the RRL control
bits {en_patspr, noover, en} in different positions, adds an extra RRL set
for the new mode of the first patrol-scrub read error, and extends the
number of CORRERRCNT registers from 4 to 8, encoding one counter per
CORRERRCNT register.
Add a Granite Rapids reg_rrl configuration table and adjust the code to
accommodate the differences mentioned above for RRL support.
Signed-off-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
Tested-by: Feng Xu <feng.f.xu@intel.com>
Link: https://lore.kernel.org/r/20250417150724.1170168-8-qiuxu.zhuo@intel.com
Make the {valid bit, overwritten status, number} of RRL registers and the
{number, offsets, widths} of per-channel CORRERRCNT registers configurable.
Refactor show_retry_rd_err_log() to use the configurable fields of struct
reg_rrl, making the code more scalable and simpler.
No functional changes intended.
Signed-off-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
Tested-by: Feng Xu <feng.f.xu@intel.com>
Link: https://lore.kernel.org/r/20250417150724.1170168-7-qiuxu.zhuo@intel.com
Refactor enable_retry_rd_err_log() using helper functions for both
DDR and HBM, making the RRL control bits configurable instead of
hard-coded. Additionally, explicitly define the four RRL modes for
better readability.
No functional changes intended.
Signed-off-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
Tested-by: Feng Xu <feng.f.xu@intel.com>
Link: https://lore.kernel.org/r/20250417150724.1170168-6-qiuxu.zhuo@intel.com
As the number of RRL (retry_rd_err_log) registers per memory channel
increases, the positions of the RRL control bits and the widths of the
RRL registers vary across different CPU generations. Adding RRL support
for a new CPU requires handling these differences throughout the
RRL-related code.
Structure the offsets, widths, control bit positions, set numbers, modes,
etc., of the per-channel RRL registers and make them configurable to
facilitate easier RRL support for new CPUs.
No functional changes are intended.
Signed-off-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
Tested-by: Feng Xu <feng.f.xu@intel.com>
Link: https://lore.kernel.org/r/20250417150724.1170168-5-qiuxu.zhuo@intel.com
The i10nm_edac driver uses the default modes (either patrol scrub read
or on-demand read) of the RRL register sets configured by the BIOS.
Explicitly set the modes during the loading of the i10nm_edac driver with
the module parameter retry_rd_err_log=2.
Signed-off-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
Tested-by: Feng Xu <feng.f.xu@intel.com>
Link: https://lore.kernel.org/r/20250417150724.1170168-4-qiuxu.zhuo@intel.com
When enabling the retry_rd_err_log (RRL) feature during the loading of the
i10nm_edac driver with the module parameter retry_rd_err_log=2 (Linux RRL
control mode), the default values of the control bits of RRL are saved so
that they can be restored during the unloading of the driver.
In the current code, the RRL of pseudo channel 1 of HBM overwrites pseudo
channel 0 during the loading of the driver, resulting in the loss of saved
RRL for pseudo channel 0. This causes the RRL of pseudo channel 0 of HBM to
be wrongly restored with the values from pseudo channel 1 when unloading
the driver.
Fix this issue by creating two separate groups of RRL control registers
per channel to save default RRL settings of two {sub-,pseudo-}channels.
Fixes: acd4cf68fe ("EDAC/i10nm: Retrieve and print retry_rd_err_log registers for HBM")
Signed-off-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
Tested-by: Feng Xu <feng.f.xu@intel.com>
Link: https://lore.kernel.org/r/20250417150724.1170168-3-qiuxu.zhuo@intel.com
After loading i10nm_edac (which automatically loads skx_edac_common), if
unload only i10nm_edac, then reload it and perform error injection testing,
a general protection fault may occur:
mce: [Hardware Error]: Machine check events logged
Oops: general protection fault ...
...
Workqueue: events mce_gen_pool_process
RIP: 0010:string+0x53/0xe0
...
Call Trace:
<TASK>
? die_addr+0x37/0x90
? exc_general_protection+0x1e7/0x3f0
? asm_exc_general_protection+0x26/0x30
? string+0x53/0xe0
vsnprintf+0x23e/0x4c0
snprintf+0x4d/0x70
skx_adxl_decode+0x16a/0x330 [skx_edac_common]
skx_mce_check_error.part.0+0xf8/0x220 [skx_edac_common]
skx_mce_check_error+0x17/0x20 [skx_edac_common]
...
The issue arose was because the variable 'adxl_component_count' (inside
skx_edac_common), which counts the ADXL components, was not reset. During
the reloading of i10nm_edac, the count was incremented by the actual number
of ADXL components again, resulting in a count that was double the real
number of ADXL components. This led to an out-of-bounds reference to the
ADXL component array, causing the general protection fault above.
Fix this issue by resetting the 'adxl_component_count' in adxl_put(),
which is called during the unloading of {skx,i10nm}_edac.
Fixes: 123b158635 ("EDAC, i10nm: make skx_common.o a separate module")
Reported-by: Feng Xu <feng.f.xu@intel.com>
Signed-off-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
Tested-by: Feng Xu <feng.f.xu@intel.com>
Link: https://lore.kernel.org/r/20250417150724.1170168-2-qiuxu.zhuo@intel.com
Intel Amston Lake is a series of SoCs tailored for edge computing needs.
The Amston Lake SoCs, equipped with IBECC(In-Band ECC) capability, share
the same IBECC registers with Alder Lake-N SoCs. Add the Intel Amston Lake
SoC compute die ID for EDAC support.
Signed-off-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
Link: https://lore.kernel.org/r/20250408132455.489046-4-qiuxu.zhuo@intel.com
The Intel Arizona Beach SoC series is oriented toward network computing.
Some types of these SoCs are equipped with IBECC(In-Band ECC) and share
the same IBECC registers with Alder Lake-N SoCs. Add a die ID for Arizona
Beach SoC for EDAC support.
[Tony: s/Arizona Lake/Arizona Beach/ in commit message]
Signed-off-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
Link: https://lore.kernel.org/r/20250408132455.489046-3-qiuxu.zhuo@intel.com
Some BIOS versions may fuse off certain memory controllers and set the
registers of these absent memory controllers to ~0. The current igen6_edac
mistakenly enumerates these absent memory controllers and registers them
with the EDAC core.
Skip the absent memory controllers to avoid mistakenly enumerating them.
Signed-off-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
Link: https://lore.kernel.org/r/20250408132455.489046-2-qiuxu.zhuo@intel.com
scrubbing RAS functionality with the kernel and expose sysfs nodes to
control such scrubbing functionality. The main use case is CXL devices which
provide different scrubbers for their built-in memories so that tools like
rasdaemon can configure and control memory scrubbing and other, more
advanced RAS functionality. (Shiju Jose and Jonathan Cameron)
- Add support to ie31200_edac for client SoCs like Raptor Lake-S which have
multiple memory controllers and out-of-band ECC capability. (Qiuxu Zhuo)
- The usual round of cleanups, simplifications and fixlets
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmfitgYACgkQEsHwGGHe
VUp/Tg/7BJeE9QunMlB2EcCbYM+3eelp+Sg899S/6iNdC66sevFPoVXTpv9qz7Q6
+ZD+V5vKIuKmGlV9dtn8nK5o8VvA2EvZsYSp6kk85qC+GNoqlc9E50I1yB3+otl8
/3qD7PH0Ww5a4csjg+ioTRTphXp5DaK5J1m+Gze4h9n2ADs/aDb6vWr2AobomYOT
h8pIb5PBdX9ehjWqUP/d+G+/ZN7244+FtMt1p3/xhBMjRJcwUxeAkw1u59EC5Hpb
poP60Sl4pjr6uUI6QXrGEvLqvX3kq+fqveRosX1L+SlgAXesGXSg/tbdY5T78zGS
aTebwmej00tvqQIYfsPpFKqk4W2wxUfnG6a2K0U3fYINQqSjI8kPrq9kMLpPejAG
Lb0rZmHwLTPMM+G0BZVc4QSClhO9GXnD1wIsH8YGcqEkjDo0wkDL3KVm8aFhcx7b
BDHn7b9Zx9zIvPhlcRupsUUNiqrNAV3R+zfUWzH9JF/GeCrT148vs8cZX16QGk18
bnA5SY/mv8EExbeaEltKRagdToqxW6WEq5vv6KuLpco4kfCWJYWUHe5QaAMzZUBZ
hXW0vvzUbkBLZo2NsdbXJk0+iT/lmjjOaqHZGcuwtepLIsJoRySkabbsMe6TTTtY
O4abNt5yJkgcCYJwMwKmS9RogROO9yZjhK+4QfGex2lYgjlMTds=
=OpTE
-----END PGP SIGNATURE-----
Merge tag 'edac_updates_for_v6.15' of git://git.kernel.org/pub/scm/linux/kernel/git/ras/ras
Pull EDAC updates from Borislav Petkov:
- Add infrastructure support to EDAC in order to be able to register
memory scrubbing RAS functionality with the kernel and expose sysfs
nodes to control such scrubbing functionality.
The main use case is CXL devices which provide different scrubbers
for their built-in memories so that tools like rasdaemon can
configure and control memory scrubbing and other, more advanced RAS
functionality (Shiju Jose and Jonathan Cameron)
- Add support to ie31200_edac for client SoCs like Raptor Lake-S which
have multiple memory controllers and out-of-band ECC capability
(Qiuxu Zhuo)
- The usual round of cleanups, simplifications and fixlets
* tag 'edac_updates_for_v6.15' of git://git.kernel.org/pub/scm/linux/kernel/git/ras/ras: (25 commits)
MAINTAINERS: Add a secondary maintainer for bluefield_edac
EDAC/ie31200: Switch Raptor Lake-S to interrupt mode
EDAC/ie31200: Add Intel Raptor Lake-S SoCs support
EDAC/ie31200: Break up ie31200_probe1()
EDAC/ie31200: Fold the two channel loops into one loop
EDAC/ie31200: Make struct dimm_data contain decoded information
EDAC/ie31200: Make the memory controller resources configurable
EDAC/ie31200: Simplify the pci_device_id table
EDAC/ie31200: Fix the 3rd parameter name of *populate_dimm_info()
EDAC/ie31200: Fix the error path order of ie31200_init()
EDAC/ie31200: Fix the DIMM size mask for several SoCs
EDAC/ie31200: Fix the size of EDAC_MC_LAYER_CHIP_SELECT layer
EDAC/device: Fix dev_set_name() format string
EDAC/pnd2: Make read-only const array intlv static
EDAC/igen6: Constify struct res_config
EDAC/amd64: Simplify return statement in dct_ecc_enabled()
EDAC: Update memory repair control interface for memory sparing feature
EDAC: Add a memory repair control feature
EDAC: Use string choice helper functions
EDAC: Add a Error Check Scrub control feature
...
* ras/edac-cxl:
EDAC/device: Fix dev_set_name() format string
EDAC: Update memory repair control interface for memory sparing feature
EDAC: Add a memory repair control feature
EDAC: Add a Error Check Scrub control feature
EDAC: Add scrub control feature
EDAC: Add support for EDAC device features control
* ras/edac-drivers:
EDAC/ie31200: Switch Raptor Lake-S to interrupt mode
EDAC/ie31200: Add Intel Raptor Lake-S SoCs support
EDAC/ie31200: Break up ie31200_probe1()
EDAC/ie31200: Fold the two channel loops into one loop
EDAC/ie31200: Make struct dimm_data contain decoded information
EDAC/ie31200: Make the memory controller resources configurable
EDAC/ie31200: Simplify the pci_device_id table
EDAC/ie31200: Fix the 3rd parameter name of *populate_dimm_info()
EDAC/ie31200: Fix the error path order of ie31200_init()
EDAC/ie31200: Fix the DIMM size mask for several SoCs
EDAC/ie31200: Fix the size of EDAC_MC_LAYER_CHIP_SELECT layer
EDAC/{skx_common,i10nm}: Fix some missing error reports on Emerald Rapids
EDAC/igen6: Fix the flood of invalid error reports
EDAC/ie31200: work around false positive build warning
* ras/edac-misc:
MAINTAINERS: Add a secondary maintainer for bluefield_edac
EDAC/pnd2: Make read-only const array intlv static
EDAC/igen6: Constify struct res_config
EDAC/amd64: Simplify return statement in dct_ecc_enabled()
EDAC: Use string choice helper functions
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Raptor Lake-S SoCs notify correctable memory errors via CMCI (Corrected
Machine Check Interrupt). Switch Raptor Lake-S EDAC support from polling
to interrupt mode by registering the callback to the MCE decode notifier
chain.
Note that as Raptor Lake-S SoCs may not recover from uncorrectable memory
errors, the system will hang as soon as this type of error occurs, and the
registered callback on the MCE decode chain will not be executed. This is
the expected behavior.
Signed-off-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
Tested-by: Gary Wang <gary.c.wang@intel.com>
Link: https://lore.kernel.org/r/20250310011411.31685-12-qiuxu.zhuo@intel.com
The Intel Raptor Lake-S SoC contains two memory controllers with DDR5
memory type and out-of-band ECC capability. The resource definitions of
the memory controller are different from previous generations. One notable
difference is that the PCI ERRSTS register is deprecated and is not used
to indicate the presence of errors or to clear the MMIO-mapped ECC error
log regsiters.
Extend the ie31200_edac driver to support multiple memory controllers,
add a resource configuration table and use an MSR register to clear the
ECC error log registers to provide EDAC support for Raptor Lake-S SoCs.
Signed-off-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
Tested-by: Gary Wang <gary.c.wang@intel.com>
Link: https://lore.kernel.org/r/20250310011411.31685-11-qiuxu.zhuo@intel.com
Split ie31200_probe1() into two helper functions to easily extend support
for multiple memory controllers.
No functional changes intended.
Signed-off-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
Tested-by: Gary Wang <gary.c.wang@intel.com>
Link: https://lore.kernel.org/r/20250310011411.31685-10-qiuxu.zhuo@intel.com
Fold the two channel loops to simplify the code and improve readability.
Also, delete the comments related to the DRB register, as this register
is not used here.
No functional changes intended.
Signed-off-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
Tested-by: Gary Wang <gary.c.wang@intel.com>
Link: https://lore.kernel.org/r/20250310011411.31685-9-qiuxu.zhuo@intel.com
The current dimm_data structure contains encoded DIMM information,
which needs to be decoded for a given SoC when it is used. Make it
contain decoded information when it's initialized so that the places
where it is used do not need to decode it again, thereby simplifying
the code.
No functional changes intended.
Signed-off-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
Tested-by: Gary Wang <gary.c.wang@intel.com>
Link: https://lore.kernel.org/r/20250310011411.31685-8-qiuxu.zhuo@intel.com
The resources such as MMIO, register offset, register mask, memory DIMM
information, ECC error log location, etc., of the memory controller, and
the number of memory controllers can be device-ID-specific. It requires
adding numerous 'if (device_id == new_id)' special handling cases to the
code to support a new SoC.
Make these kinds of resources configurable and separate them from the code
to facilitate the addition of new SoC support.
No functional changes intended.
Signed-off-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
Tested-by: Gary Wang <gary.c.wang@intel.com>
Link: https://lore.kernel.org/r/20250310011411.31685-7-qiuxu.zhuo@intel.com
Use PCI_VDEVICE() to simplify the pci_device_id table.
No functional changes intended.
Signed-off-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
Tested-by: Gary Wang <gary.c.wang@intel.com>
Link: https://lore.kernel.org/r/20250310011411.31685-6-qiuxu.zhuo@intel.com
The 3rd parameter of *populate_dimm_info() pertains to the DIMM index
within a channel, not the channel index. Fix the parameter name to dimm
to reflect its actual purpose.
No functional changes intended.
Signed-off-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
Tested-by: Gary Wang <gary.c.wang@intel.com>
Link: https://lore.kernel.org/r/20250310011411.31685-5-qiuxu.zhuo@intel.com
The error path order of ie31200_init() is incorrect, fix it.
Fixes: 709ed1bcef ("EDAC/ie31200: Fallback if host bridge device is already initialized")
Signed-off-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
Tested-by: Gary Wang <gary.c.wang@intel.com>
Link: https://lore.kernel.org/r/20250310011411.31685-4-qiuxu.zhuo@intel.com
The DIMM size mask for {Sky, Kaby, Coffee} Lake is not bits{7:0},
but bits{5:0}. Fix it.
Fixes: 953dee9bbd ("EDAC, ie31200_edac: Add Skylake support")
Signed-off-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
Tested-by: Gary Wang <gary.c.wang@intel.com>
Link: https://lore.kernel.org/r/20250310011411.31685-3-qiuxu.zhuo@intel.com
The EDAC_MC_LAYER_CHIP_SELECT layer pertains to the rank, not the DIMM.
Fix its size to reflect the number of ranks instead of the number of DIMMs.
Also delete the unused macros IE31200_{DIMMS,RANKS}.
Fixes: 7ee40b897d ("ie31200_edac: Introduce the driver")
Signed-off-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
Tested-by: Gary Wang <gary.c.wang@intel.com>
Link: https://lore.kernel.org/r/20250310011411.31685-2-qiuxu.zhuo@intel.com
Passing a variable string as the format to dev_set_name() causes a W=1 warning:
drivers/edac/edac_device.c:736:9: error: format not a string literal and no format arguments [-Werror=format-security]
736 | ret = dev_set_name(&ctx->dev, name);
| ^~~
Use a literal "%s" instead so the name can be the argument.
Fixes: db99ea5f2c ("EDAC: Add support for EDAC device features control")
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/20250304143603.995820-1-arnd@kernel.org
Don't populate the const read-only array intlv on the stack at run time,
instead make it static. This also shrinks the object size:
$ size pnd2_edac.o.*
text data bss dec hex filename
15632 264 1384 17280 4380 pnd2_edac.o.new
15644 264 1384 17292 438c pnd2_edac.o.old
Signed-off-by: Colin Ian King <colin.i.king@gmail.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com>
Link: https://lore.kernel.org/r/20240919170427.497429-1-colin.i.king@gmail.com
The res_config structs are not modified in this driver.
Constifying these structures moves some data to a read-only section, so
increase overall security, especially when the structure holds some function
pointers.
On a x86_64, with allmodconfig, as an example:
Before:
======
text data bss dec hex filename
36777 2479 4304 43560 aa28 drivers/edac/igen6_edac.o
After:
=====
text data bss dec hex filename
37297 1959 4304 43560 aa28 drivers/edac/igen6_edac.o
Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com>
Link: https://lore.kernel.org/r/a06153870951a64b438e76adf97d440e02c1a1fc.1738355198.git.christophe.jaillet@wanadoo.fr
Update memory repair control interface for memory sparing feature.
CXL memory devices can support soft and hard memory sparing at cacheline,
row, bank and rank granularities. Memory sparing is defined as a repair
function that replaces a portion of memory with a portion of functional
memory at that same granularity.
When a CXL device detects an error in memory, it will report to the host
that there's need for a repair maintenance operation by using an event
record where the "maintenance needed" flag is set.
The event records contain the device physical address (DPA) and other
attributes of the memory to repair such as bank group, bank, rank, row,
column, channel etc.
The kernel will report the corresponding CXL general media or DRAM trace
event to userspace, and userspace tools (e.g. rasdaemon) will initiate
a repair operation in response to the device request via the sysfs
repair control.
[ bp: Massage. ]
Signed-off-by: Shiju Jose <shiju.jose@huawei.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/20250212143654.1893-15-shiju.jose@huawei.com
Add a generic EDAC memory repair control driver to manage memory repairs in
the system, such as CXL Post Package Repair (PPR) and other soft and hard PPR
features.
For example, a CXL device with DRAM components that support PPR features may
implement PPR maintenance operations. DRAM components may support two types of
PPR:
- hard PPR, for a permanent row repair, and
- soft PPR, for a temporary row repair.
Soft PPR is much faster than hard PPR, but the repair is lost with a power
cycle.
When a CXL device detects an error in a memory, it may report the need for
a repair maintenance operation by using an event record where the "maintenance
needed" flag is set. The event records contain the device physical
address (DPA) and other optional attributes of the memory to repair.
The kernel will report the corresponding CXL general media or DRAM trace event
to userspace, and userspace tools (e.g. rasdaemon) will initiate a repair
operation in response to the device request via the sysfs repair control.
Device with memory repair features registers with EDAC device driver, which
retrieves a memory repair descriptor from EDAC memory repair driver and exposes
the sysfs repair control attributes to userspace in
/sys/bus/edac/devices/<dev-name>/mem_repairX/.
The common memory repair control interface abstracts the control of arbitrary
memory repair functionality into a standardized set of functions. The sysfs
memory repair attribute nodes are only available if the client driver has
implemented the corresponding attribute callback function and provided
operations to the EDAC device driver during registration.
[ bp: Massage, fixup edac_dev_register() retvals, merge
write_overflow fix to mem_repair_create_desc() ]
Signed-off-by: Shiju Jose <shiju.jose@huawei.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/20250212143654.1893-5-shiju.jose@huawei.com
Remove hard-coded strings by using the str_enabled_disabled(), str_yes_no(),
str_write_read(), and str_plural() helper functions.
Add a space in "All DIMMs support ECC: yes/no" to improve readability.
Signed-off-by: Thorsten Blum <thorsten.blum@linux.dev>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Yazen Ghannam <yazen.ghannam@amd.com>
Reviewed-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com>
Link: https://lore.kernel.org/r/20250223212429.3466-2-thorsten.blum@linux.dev
Add an Error Check Scrub (ECS) control to manage a memory device's ECS
feature.
The ECS is a feature defined in JEDEC DDR5 SDRAM Specification (JESD79-5) and
allows the DRAM to internally read, correct single-bit errors, and write back
corrected data bits to the DRAM array while providing transparency to error
counts.
The DDR5 device contains a number of memory media Field Replaceable Units
(FRU) per device. The DDR5 ECS feature and thus the ECS control driver
supports configuring the ECS parameters per FRU.
Memory devices support the ECS feature register with the EDAC device driver,
which retrieves the ECS descriptor from the EDAC ECS driver. This driver
exposes sysfs ECS control attributes to userspace via
/sys/bus/edac/devices/<dev-name>/ecs_fruX/.
The common sysfs ECS control interface abstracts the control of an arbitrary
ECS functionality to a common set of functions.
Support for the ECS feature is added separately because the control attributes
of the DDR5 ECS feature differ from those of the scrub feature.
The sysfs ECS attribute nodes are only present if the client driver has
implemented the corresponding attribute callback function and passed the
necessary operations to the EDAC RAS feature driver during registration.
[ bp: Massage, fixup edac_dev_register() retvals. ]
Co-developed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Signed-off-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Signed-off-by: Shiju Jose <shiju.jose@huawei.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Fan Ni <fan.ni@samsung.com>
Tested-by: Fan Ni <fan.ni@samsung.com>
Link: https://lore.kernel.org/r/20250212143654.1893-4-shiju.jose@huawei.com
Add a scrub control to manage memory scrubbers in the system.
Devices with a scrub feature register with the EDAC device driver which
retrieves the scrub descriptor from the scrub driver and exposes the
control attributes for a instance to userspace at
/sys/bus/edac/devices/<dev-name>/scrubX/.
The common sysfs scrub control interface abstracts the control of
arbitrary scrubbing functionality into a common set of functions. The
attribute nodes are only present if the client driver has implemented
the corresponding attribute callback function and passed the operations
to the device driver during registration.
[ bp: Massage commit message, docs and code, simplify text a bit.
Integrate fixup for: https://lore.kernel.org/r/202502251009.0sGkolEJ-lkp@intel.com
Reported-by: kernel test robot <lkp@intel.com>
Reported-by: Dan Carpenter <dan.carpenter@linaro.org> ]
Co-developed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Signed-off-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Signed-off-by: Shiju Jose <shiju.jose@huawei.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Tested-by: Daniel Ferguson <danielf@os.amperecomputing.com>
Tested-by: Fan Ni <fan.ni@samsung.com>
Link: https://lore.kernel.org/r/20250212143654.1893-3-shiju.jose@huawei.com
Add generic EDAC device feature controls supporting the registration of RAS
features available in the system. The driver exposes control attributes for
these features to userspace in
/sys/bus/edac/devices/<dev-name>/<ras-feature>
[ bp: Touch-up documentation, simplify, make edac_dev_type static,
fixup edac_dev_register() retvals. ]
Co-developed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Signed-off-by: Jonathan Cameron <Jonathan.Cameron@huawei.com>
Signed-off-by: Shiju Jose <shiju.jose@huawei.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Fan Ni <fan.ni@samsung.com>
Tested-by: Daniel Ferguson <danielf@os.amperecomputing.com>
Tested-by: Fan Ni <fan.ni@samsung.com>
Link: https://lore.kernel.org/r/20250212143654.1893-2-shiju.jose@huawei.com
When doing error injection to some memory DIMMs on certain Intel Emerald
Rapids servers, the i10nm_edac missed error reports for some memory DIMMs.
Certain BIOS configurations may hide some memory controllers, and the
i10nm_edac doesn't enumerate these hidden memory controllers. However, the
ADXL decodes memory errors using memory controller physical indices even
if there are hidden memory controllers. Therefore, the memory controller
physical indices reported by the ADXL may mismatch the logical indices
enumerated by the i10nm_edac, resulting in missed error reports for some
memory DIMMs.
Fix this issue by creating a mapping table from memory controller physical
indices (used by the ADXL) to logical indices (used by the i10nm_edac) and
using it to convert the physical indices to the logical indices during the
error handling process.
Fixes: c545f5e412 ("EDAC/i10nm: Skip the absent memory controllers")
Reported-by: Kevin Chang <kevin1.chang@intel.com>
Tested-by: Kevin Chang <kevin1.chang@intel.com>
Reported-by: Thomas Chen <Thomas.Chen@intel.com>
Tested-by: Thomas Chen <Thomas.Chen@intel.com>
Signed-off-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
Link: https://lore.kernel.org/r/20250214002728.6287-1-qiuxu.zhuo@intel.com
gcc-14 produces a bogus warning in some configurations:
drivers/edac/ie31200_edac.c: In function 'ie31200_probe1.isra':
drivers/edac/ie31200_edac.c:412:26: error: 'dimm_info' is used uninitialized [-Werror=uninitialized]
412 | struct dimm_data dimm_info[IE31200_CHANNELS][IE31200_DIMMS_PER_CHANNEL];
| ^~~~~~~~~
drivers/edac/ie31200_edac.c:412:26: note: 'dimm_info' declared here
412 | struct dimm_data dimm_info[IE31200_CHANNELS][IE31200_DIMMS_PER_CHANNEL];
| ^~~~~~~~~
I don't see any way the unintialized access could really happen here,
but I can see why the compiler gets confused by the two loops.
Instead, rework the two nested loops to only read the addr_decode
registers and then keep only one instance of the dimm info structure.
[Tony: Qiuxu pointed out that the "populate DIMM info" comment was left
behind in the refactor and suggested moving it. I deleted the comment
as unnecessry in front os a call to populate_dimm_info(). That seems
pretty self-describing.]
Signed-off-by: Arnd Bergmann <arnd@arndb.de>
Acked-by: Jason Baron <jbaron@akamai.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
Link: https://lore.kernel.org/all/20250122065031.1321015-1-arnd@kernel.org
The previous implementation incorrectly configured the cmn_interrupt_2_enable
register for interrupt handling. Using cmn_interrupt_2_enable to configure
Tag, Data RAM ECC interrupts would lead to issues like double handling of the
interrupts (EL1 and EL3) as cmn_interrupt_2_enable is meant to be configured
for interrupts which needs to be handled by EL3.
EL1 LLCC EDAC driver needs to use cmn_interrupt_0_enable register to configure
Tag, Data RAM ECC interrupts instead of cmn_interrupt_2_enable.
Fixes: 27450653f1 ("drivers: edac: Add EDAC driver support for QCOM SoCs")
Signed-off-by: Komal Bajaj <quic_kbajaj@quicinc.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Reviewed-by: Manivannan Sadhasivam <manivannan.sadhasivam@linaro.org>
Cc: <stable@kernel.org>
Link: https://lore.kernel.org/r/20241119064608.12326-1-quic_kbajaj@quicinc.com
which is legacy now, and the creation of the new AMD node concept which
represents the Zen architecture of having a collection of I/O devices within
an SoC. Those nodes comprise the so-called data fabric on Zen. This has
at least one practical advantage of not having to add a PCI ID each time
a new data fabric PCI device releases. Eventually, the lot more uniform
provider of data fabric functionality amd_node.c will be used by all the
drivers which need it
- Smaller cleanups
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmePuPIACgkQEsHwGGHe
VUpU6Q//S9j9+YC9EpredFoJ5W0BfERR5XOum7YjlLxq2mVTStrf9Q1ecrwmS4Q6
4mAydIDfhqNlouUjMBgNNFJcvm8lat+/pjY78oT8ZdjumslMbMxo81VmQ3fX+6fE
izMrL81DG4j8zeleUyz5ecJEK/KPw1s3SkY736511PeJSalOU4hLYmU819imfAk/
5c9os2GNhszIROE1YUYZQ3zXne1t2PNXKvctzVrJYjyKpIDgFNzTj6gXhePzXBNO
iFdApqSgKdnnsD6VsfxYVnOKP+cSIl27Tbge6dm7DHQbSs00aVL64JPcX8/hWtp6
ExrwBYiFk6yafwsNUu7/PmqbZNKYxDgvXFq8jSOFfioh6Km/QZYs8y1/qXN3qmSU
78Ah5jyO+U+++FsSa2o9eRpU2l84UIQqvp84PeSLylzh7iLFyFCWsMfreNeIsF9v
Jsost58JQOCufRK3qfMiDO88QUZRKyCfFymDAVcvPoBwp5nK9R1ohlbxgXrCPsE7
Bd7J6jrlpcoRyYc8vhshkrnK2Sk6pP77OZOh5AZ9AybnALH0afUNLzk6sBtaObkZ
xIJcSIBkKz3P4zWFKsXmqGYHWp1IsKsYRsNjCt5FExWOF+uKKKBjynHmlKeS0l/b
J6bwDUPVW/gfkBqDV8bILultj9Gm8L5Z8SwvD1ww69OYN+c7oVk=
=ZAjD
-----END PGP SIGNATURE-----
Merge tag 'x86_misc_for_v6.14_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull misc x86 updates from Borislav Petkov:
- The first part of a restructuring of AMD's representation of a
northbridge which is legacy now, and the creation of the new AMD node
concept which represents the Zen architecture of having a collection
of I/O devices within an SoC. Those nodes comprise the so-called data
fabric on Zen.
This has at least one practical advantage of not having to add a PCI
ID each time a new data fabric PCI device releases. Eventually, the
lot more uniform provider of data fabric functionality amd_node.c
will be used by all the drivers which need it
- Smaller cleanups
* tag 'x86_misc_for_v6.14_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/amd_node: Use defines for SMN register offsets
x86/amd_node: Remove dependency on AMD_NB
x86/amd_node: Update __amd_smn_rw() error paths
x86/amd_nb: Move SMN access code to a new amd_node driver
x86/amd_nb, hwmon: (k10temp): Simplify amd_pci_dev_to_node_id()
x86/amd_nb: Simplify function 3 search
x86/amd_nb: Use topology info to get AMD node count
x86/amd_nb: Simplify root device search
x86/amd_nb: Simplify function 4 search
x86: Start moving AMD node functionality out of AMD_NB
x86/amd_nb: Clean up early_is_amd_nb()
x86/amd_nb: Restrict init function to AMD-based systems
x86/mtrr: Rename mtrr_overwrite_state() to guest_force_mtrr_state()
use the generic struct x86_cpu_id thing
- Remove magic naked numbers for CPUID functions and use proper defines of the
prefix CPUID_LEAF_*. Consolidate some of the crazy use around the tree
- Smaller cleanups and improvements
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmePjeIACgkQEsHwGGHe
VUqRBA//TinKFcWagaQB3lsnoBRwqyg6JJZIBNMF9sBMDD9HnvEZ/JduC+3+g1rx
iztuCmRSgQsi/QvRaEFNuDMOgk6gACyXxi7Uf6eXsQkSlsZFViaqbXsy9kqslRbl
7QP1NS1sfdSd42JPp2UZT/lg9kluuVnn5b40zZIwy2AAzwrNFfZAS4Yg7Qe4XQDF
xBcHi8MAF+LTm5Tv0hLmx2UcfZLhi7hXy8mTAIFS0Liww+Y5qaam33xw9KxNU5lZ
tVepzY5my43pRs4MB1CvaQCiZ84GxvAVqz3JYsg5YhVp45xh7P2WtjBeeOqLljaW
MkWnDLOmlaD4Y0kL4QA3ReyBVux54RbDGKC0E/t5fwYlk3dQ7gYwSEvh5358R+0z
kwxw3NdnNngoLRXAX45EonSxj36jb6KCBHAGqXSfL73OOt30RWCqknEnixcOp/BP
chNxCiIx7qko+rAYOD62QkguEEPFdb8roeayhIKtiKL5zUwQAr+jt/pKVx2htWLi
xxqSaVoCFu4edWpsEJnanqhS0Es0v7YiBU3jDC37rZJ+dtzf0C2ewD7Nb1g+wUTn
NzDkmt58hQW4jBxoxHBIclLfhEETISTEGAAObTa5I5r8IDb7Dv+ZnSv7RfjoR9fL
RWMz1bJ1Scem+Fx7fc/IRJFSElC41giSwFlhThHdAzI1m95zJN8=
=9Hdg
-----END PGP SIGNATURE-----
Merge tag 'x86_cpu_for_v6.14_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull x86 cpuid updates from Borislav Petkov:
- Remove the less generic CPU matching infra around struct x86_cpu_desc
and use the generic struct x86_cpu_id thing
- Remove magic naked numbers for CPUID functions and use proper defines
of the prefix CPUID_LEAF_*. Consolidate some of the crazy use around
the tree
- Smaller cleanups and improvements
* tag 'x86_cpu_for_v6.14_rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
x86/cpu: Make all all CPUID leaf names consistent
x86/fpu: Remove unnecessary CPUID level check
x86/fpu: Move CPUID leaf definitions to common code
x86/tsc: Remove CPUID "frequency" leaf magic numbers.
x86/tsc: Move away from TSC leaf magic numbers
x86/cpu: Move TSC CPUID leaf definition
x86/cpu: Refresh DCA leaf reading code
x86/cpu: Remove unnecessary MwAIT leaf checks
x86/cpu: Use MWAIT leaf definition
x86/cpu: Move MWAIT leaf definition to common header
x86/cpu: Remove 'x86_cpu_desc' infrastructure
x86/cpu: Move AMD erratum 1386 table over to 'x86_cpu_id'
x86/cpu: Replace PEBS use of 'x86_cpu_desc' use with 'x86_cpu_id'
x86/cpu: Expose only stepping min/max interface
x86/cpu: Introduce new microcode matching helper
x86/cpufeature: Document cpu_feature_enabled() as the default to use
x86/paravirt: Remove the WBINVD callback
x86/cpufeatures: Free up unused feature bits
blades support
- Add a new EDAC driver for Loongson SoCs which reports single-bit correctable
errors
- Extend the SKX and i10NM EDAC drivers to support UV systems which can have
more than 8 nodes
- Add Intel Clearwater Forest server support to i10nm_edac
- Minor fix
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmeLi5QACgkQEsHwGGHe
VUpZEhAAsbvNhKhq84FUCoQP+OWOdQBv0s3WNGherWPAxC3bGY7xDKcjjdf6ebYG
4Rk0+nTEK5kefA6PsyiWtxWQZiYUzERDrNpdjWq5RxaO6WtREDiczgJm4NaCCTBr
F59eHImjW/ajBsU8FAcWnVZLo7KqZvtF19vQFL1TXAKJO6Zpb0ybLIts9BVSwwyV
c1xYyjMFh1p1H8n7MOsF11u2QUpCcc/SMDesCGWSVAJ2QnB7Ox5NfUaI97lQSiow
gnS8vTWTIM6e6rZk2HtkSage0Wt7UHCkIsza5DEdW3xQG10eZUE6o33kerxNMezd
lMWNzavR26CYhkO6/McvhsClOHwcAZZVd4PUTXnNvlNTSV+EEbEbD5JWryQvmvkV
gazOlPHwd0pj1MkAZBUTdCnR6/DCpqsu68sGMjAiPvR7pb3sBLyRF/DGg9p5Rz0a
s3Q6SuTAg/ZQWNqgRNcnjh2SZbzqZ+GGD4blDTv1pNjWemTDYpxj73Wl5JxKTdtr
6n7/ariQTaiMTpQ8ZhDkDP5eHWx5fyTY7P5MzPEuOmNzW/gLz0V5goSqXUUYhQjm
YAKy5PDS1QwTBKzEOQ8dE3RcOhZtd1X2Vj04CcDgHrjhZBDderPkGs42R3B74cF0
JB4k5QJFJ97Cxe9xYIFHju7Es6+z+j8kLzWGTydfHGMZfHMHCHM=
=nwgH
-----END PGP SIGNATURE-----
Merge tag 'edac_updates_for_v6.14' of git://git.kernel.org/pub/scm/linux/kernel/git/ras/ras
Pull EDAC updates from Borislav Petkov:
- Remove the EDAC PowerPC Cell driver due to the removal of the IBM
Cell blades support
- Add a new EDAC driver for Loongson SoCs which reports single-bit
correctable errors
- Extend the SKX and i10NM EDAC drivers to support UV systems which can
have more than 8 nodes
- Add Intel Clearwater Forest server support to i10nm_edac
- Minor fix
* tag 'edac_updates_for_v6.14' of git://git.kernel.org/pub/scm/linux/kernel/git/ras/ras:
EDAC/cell: Remove powerpc Cell driver
EDAC: Add an EDAC driver for the Loongson memory controller
EDAC: Fix typos in comments
EDAC/{i10nm,skx,skx_common}: Support UV systems
EDAC/i10nm: Add Intel Clearwater Forest server support
* ras/edac-drivers:
EDAC/cell: Remove powerpc Cell driver
EDAC: Add an EDAC driver for the Loongson memory controller
EDAC/{i10nm,skx,skx_common}: Support UV systems
EDAC/i10nm: Add Intel Clearwater Forest server support
* ras/edac-misc:
EDAC: Fix typos in comments
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
This driver can no longer be built since support for IBM Cell Blades was
removed, in particular PPC_CELL_COMMON.
Remove the driver.
[ bp: Remove EDAC_CELL from Cell's defconfig too. ]
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/20241218105523.416573-23-mpe@ellerman.id.au
SMN access was bolted into amd_nb mostly as convenience. This has
limitations though that require incurring tech debt to keep it working.
Move SMN access to the newly introduced AMD Node driver.
Signed-off-by: Mario Limonciello <mario.limonciello@amd.com>
Signed-off-by: Yazen Ghannam <yazen.ghannam@amd.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Acked-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com> # pdx86
Acked-by: Shyam Sundar S K <Shyam-sundar.S-k@amd.com> # PMF, PMC
Link: https://lore.kernel.org/r/20241206161210.163701-11-yazen.ghannam@amd.com
The x86_match_cpu() infrastructure can match CPU steppings. Since
there are only 16 possible steppings, the matching infrastructure goes
all out and stores the stepping match as a bitmap. That means it can
match any possible steppings in a single list entry. Fun.
But it exposes this bitmap to each of the X86_MATCH_*() helpers when
none of them really need a bitmap. It makes up for this by exporting a
helper (X86_STEPPINGS()) which converts a contiguous stepping range
into the bitmap which every single user leverages.
Instead of a bitmap, have the main helper for this sort of thing
(X86_MATCH_VFM_STEPS()) just take a stepping range. This ends up
actually being even more compact than before.
Leave the helper in place (renamed to __X86_STEPPINGS()) to make it
more clear what is going on instead of just having a random GENMASK()
in the middle of an already complicated macro.
One oddity that I hit was this macro:
X86_MATCH_VFM_STEPS(vfm, X86_STEPPING_MIN, max_stepping, issues)
It *could* have been converted over to take a min/max stepping value
for each entry. But that would have been a bit too verbose and would
prevent the one oddball in the list (INTEL_COMETLAKE_L stepping 0)
from sticking out.
Instead, just have it take a *maximum* stepping and imply that the match
is from 0=>max_stepping. This is functional for all the cases now and
also retains the nice property of having INTEL_COMETLAKE_L stepping 0
stick out like a sore thumb.
skx_cpuids[] is goofy. It uses the stepping match but encodes all
possible steppings. Just use a normal, non-stepping match helper.
Suggested-by: Ingo Molnar <mingo@kernel.org>
Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com>
Link: https://lore.kernel.org/all/20241213185129.65527B2A%40davehans-spike.ostc.intel.com
The 3-bit source IDs in PCI configuration space registers, used to map
devices to sockets, are limited to 8 unique IDs, and each ID is local to
a UPI/QPI domain.
Source IDs cannot be used to map devices to sockets on UV systems
because they can exceed 8 sockets and have multiple UPI/QPI domains with
identical, repeating source IDs.
Use NUMA information to get package IDs instead of source IDs on UV
systems, and use package/source IDs to name IMC information structures.
Signed-off-by: Kyle Meyer <kyle.meyer@hpe.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
Tested-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com>
Reviewed-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com>
Link: https://lore.kernel.org/all/20241213012549.43099-1-kyle.meyer@hpe.com/
The intent of the check is to see whether at least one UMC has ECC
enabled. So do that instead of tracking which ones are enabled in masks
which are too small in size anyway and lead to not loading the driver on
Zen4 machines with UMCs enabled over UMC8.
Fixes: e2be5955a8 ("EDAC/amd64: Add support for AMD Family 19h Models 10h-1Fh and A0h-AFh")
Reported-by: Avadhut Naik <avadhut.naik@amd.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Tested-by: Avadhut Naik <avadhut.naik@amd.com>
Reviewed-by: Avadhut Naik <avadhut.naik@amd.com>
Cc: <stable@kernel.org>
Link: https://lore.kernel.org/r/20241210212054.3895697-1-avadhut.naik@amd.com
Clearwater Forest is the successor to Sierra Forest. Add Clearwater
Forest CPU model ID for EDAC support.
Signed-off-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
Tested-by: Yi Lai <yi1.lai@intel.com>
Link: https://lore.kernel.org/r/20241203022038.72873-1-qiuxu.zhuo@intel.com
The continual trickle of small conversion patches is grating on me, and
is really not helping. Just get rid of the 'remove_new' member
function, which is just an alias for the plain 'remove', and had a
comment to that effect:
/*
* .remove_new() is a relic from a prototype conversion of .remove().
* New drivers are supposed to implement .remove(). Once all drivers are
* converted to not use .remove_new any more, it will be dropped.
*/
This was just a tree-wide 'sed' script that replaced '.remove_new' with
'.remove', with some care taken to turn a subsequent tab into two tabs
to make things line up.
I did do some minimal manual whitespace adjustment for places that used
spaces to line things up.
Then I just removed the old (sic) .remove_new member function, and this
is the end result. No more unnecessary conversion noise.
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
- Rework kfence support for the HPT MMU to work on systems with >= 16TB of RAM.
- Remove the powerpc "maple" platform, used by the "Yellow Dog Powerstation".
- Add support for DYNAMIC_FTRACE_WITH_CALL_OPS,
DYNAMIC_FTRACE_WITH_DIRECT_CALLS & BPF Trampolines.
- Add support for running KVM nested guests on Power11.
- Other small features, cleanups and fixes.
Thanks to: Amit Machhiwal, Arnd Bergmann, Christophe Leroy, Costa Shulyupin,
David Hunter, David Wang, Disha Goel, Gautam Menghani, Geert Uytterhoeven,
Hari Bathini, Julia Lawall, Kajol Jain, Keith Packard, Lukas Bulwahn, Madhavan
Srinivasan, Markus Elfring, Michal Suchanek, Ming Lei, Mukesh Kumar Chaurasiya,
Nathan Chancellor, Naveen N Rao, Nicholas Piggin, Nysal Jan K.A, Paulo Miguel
Almeida, Pavithra Prakash, Ritesh Harjani (IBM), Rob Herring (Arm), Sachin P
Bappalige, Shen Lichuan, Simon Horman, Sourabh Jain, Thomas Weißschuh, Thorsten
Blum, Thorsten Leemhuis, Venkat Rao Bagalkote, Zhang Zekun,
zhang jiao.
-----BEGIN PGP SIGNATURE-----
iHUEABYKAB0WIQRjvi15rv0TSTaE+SIF0oADX8seIQUCZ0Fi5AAKCRAF0oADX8se
IeI0AQCAkNWRYzGNzPM6aMwDpq5qdeZzvp0rZxuNsRSnIKJlxAD+PAOxOietgjbQ
Lxt3oizg+UcH/304Y/iyT8IrwI4n+gE=
=xNtu
-----END PGP SIGNATURE-----
Merge tag 'powerpc-6.13-1' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux
Pull powerpc updates from Michael Ellerman:
- Rework kfence support for the HPT MMU to work on systems with >= 16TB
of RAM.
- Remove the powerpc "maple" platform, used by the "Yellow Dog
Powerstation".
- Add support for DYNAMIC_FTRACE_WITH_CALL_OPS,
DYNAMIC_FTRACE_WITH_DIRECT_CALLS & BPF Trampolines.
- Add support for running KVM nested guests on Power11.
- Other small features, cleanups and fixes.
Thanks to Amit Machhiwal, Arnd Bergmann, Christophe Leroy, Costa
Shulyupin, David Hunter, David Wang, Disha Goel, Gautam Menghani, Geert
Uytterhoeven, Hari Bathini, Julia Lawall, Kajol Jain, Keith Packard,
Lukas Bulwahn, Madhavan Srinivasan, Markus Elfring, Michal Suchanek,
Ming Lei, Mukesh Kumar Chaurasiya, Nathan Chancellor, Naveen N Rao,
Nicholas Piggin, Nysal Jan K.A, Paulo Miguel Almeida, Pavithra Prakash,
Ritesh Harjani (IBM), Rob Herring (Arm), Sachin P Bappalige, Shen
Lichuan, Simon Horman, Sourabh Jain, Thomas Weißschuh, Thorsten Blum,
Thorsten Leemhuis, Venkat Rao Bagalkote, Zhang Zekun, and zhang jiao.
* tag 'powerpc-6.13-1' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux: (89 commits)
EDAC/powerpc: Remove PPC_MAPLE drivers
powerpc/perf: Add per-task/process monitoring to vpa_pmu driver
powerpc/kvm: Add vpa latency counters to kvm_vcpu_arch
docs: ABI: sysfs-bus-event_source-devices-vpa-pmu: Document sysfs event format entries for vpa_pmu
powerpc/perf: Add perf interface to expose vpa counters
MAINTAINERS: powerpc: Mark Maddy as "M"
powerpc/Makefile: Allow overriding CPP
powerpc-km82xx.c: replace of_node_put() with __free
ps3: Correct some typos in comments
powerpc/kexec: Fix return of uninitialized variable
macintosh: Use common error handling code in via_pmu_led_init()
powerpc/powermac: Use of_property_match_string() in pmac_has_backlight_type()
powerpc: remove dead config options for MPC85xx platform support
powerpc/xive: Use cpumask_intersects()
selftests/powerpc: Remove the path after initialization.
powerpc/xmon: symbol lookup length fixed
powerpc/ep8248e: Use %pa to format resource_size_t
powerpc/ps3: Reorganize kerneldoc parameter names
KVM: PPC: Book3S HV: Fix kmv -> kvm typo
powerpc/sstep: make emulate_vsx_load and emulate_vsx_store static
...
report the Field Replaceable Unit text info reported through them
- Add support for handling variable-sized SMCA BERT records
- Add the capability for reporting vendor-specific RAS error info without
adding vendor-specific fields to struct mce
- Cleanups
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmc7OlEACgkQEsHwGGHe
VUpXihAAgVdZExo/1Rmbh6s/259BH38GP6fL+ePaT1SlUzNi770TY2b7I4OYlms4
xa9t8LAIVMrrIMIg6w6q8JN4YHAQoVdcbRBvHQYB1a24xtoyxaEJxLKQNLA1soUQ
Jc9asWMHBuXnLfR/4S8Y2vWrzByOSwxqDBzQCu0Ryqvbg7vdRicNt+Hk9oHHIAYy
cquZpoDGL3W6BA8sXONbEW/6rcQ33JsEQ+Ub4qr1q2g+kNwXrrFuXZlojmz2MxIs
xgqeYKyrxK6heX0l8dSiipCATA+sOXXWWzbZtdPjFtDGzwIlV3p4yXN3fucrmHm1
4Fg1gW5a1V82Qosn0FbGiZPojsahhOE2k1bz+yEMDM3Sg2qeRWcK+V3jiS5zKzPd
WWqUbRtcaxayoEsAXnWrxrp3vxhlUUf1Ivtgk8mlMjhHPLijV5iranrRj+XHEikR
H0D3Vm0T1LHCPf9AUsbmo0GAfAOeO9DTAB9LJdKv+OJ4ESVgSPJW/9NKWLXKq41p
hhs7seJTYNw8sp67cL23TnkSp3S+9kd2U7Od3T1kubtd4fVxVnlowu8Fc6kjqd8v
n+GbdLxhX7GbOgnT0z2OG5Xmc1pNW1JtRbuxSK59NFNia7r6ZkR7BE/OCtL82Rfm
u7i76z1O0lV91y93GMCyP9DYn8K1ceU7gVCveY6mx/AHgzc87d8=
=djpG
-----END PGP SIGNATURE-----
Merge tag 'ras_core_for_v6.13' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull RAS updates from Borislav Petkov:
- Log and handle twp new AMD-specific MCA registers: SYND1 and SYND2
and report the Field Replaceable Unit text info reported through them
- Add support for handling variable-sized SMCA BERT records
- Add the capability for reporting vendor-specific RAS error info
without adding vendor-specific fields to struct mce
- Cleanups
* tag 'ras_core_for_v6.13' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
EDAC/mce_amd: Add support for FRU text in MCA
x86/mce/apei: Handle variable SMCA BERT record size
x86/MCE/AMD: Add support for new MCA_SYND{1,2} registers
tracing: Add __print_dynamic_array() helper
x86/mce: Add wrapper for struct mce to export vendor specific info
x86/mce/intel: Use MCG_BANKCNT_MASK instead of 0xff
x86/mce/mcelog: Use xchg() to get and clear the flags
- Add support for Intel Panther Lake-H to igen6_edac
- Add polling support to igen6_edac as some Intel M100 chips have trouble with
error interrupts
- Add Kaby Lake-S support to ie31200_edac
- Fix memory source detection in the SKX common module which is used by
a couple of Intel EDAC drivers
- Add support for the NXP i.MX9 memory controller to fsl_edac
- The usual fixes and cleanups all over the place
-----BEGIN PGP SIGNATURE-----
iQIzBAABCgAdFiEEzv7L6UO9uDPlPSfHEsHwGGHeVUoFAmc7KAcACgkQEsHwGGHe
VUpzxQ/6Ahr49jXu58M69UQSW3DdzEU+5NNxmUrZdRrdW/oCJXGpuRdmdFWzvWTj
HtfCS7GmaSIUPjLaNisyKdCaZxWysBqyLe0Vaexw5nuyybF5TzdYWETqFef1ij9z
Wqq1j5LPrz+9BiqFqkpbgzo6Y6Ubsv2RKuZu+1GkMT2zRrgEJuJgHi6RlJ8vqj//
7FePl3CFQ3HDdTom0/L/gsMqSObj7HEq9cbalIjIYw/GRVkZol21vDwKrUkM7rpF
tfrN1qq3NuJyqM7Du2jw2VtXDomrQ/ZkABNXCbtbczf8trLYUHR5QqIQjxy2ZFts
jMKIbdCNAfgiqai6bpmm4QHWAIAV3L5DX7OuPmbpQeAzSmOqSEqNbnLbvA1e472f
5upQH4OLOsHgbnnFTQJ7vcU5jHf41DSauMCFp60h2hyn5RIiVY5ASxRfQ3xdh/+a
hp2N+hB/y46AjXAidsGhAuUw8nt44MN2x1gtiUfbtMIx6gTewtuu0SbwOb85JW16
glhD8vxRGTUWoQit+Nh3u/P/rLSGkUJK87mfPr6O/95lleYy5hOizK2jGDbDWkA+
zOnNXnSWKK/WM+B9qnJnU1sCC7vT3j7cTaDXB1XS2MtcJbArkNC0FOd6xD81PoGh
MhfWBAKpirXQEomFqpVziDa2wlaUnZrv7/4GGmaBRO401O9iaE4=
=C3dY
-----END PGP SIGNATURE-----
Merge tag 'edac_updates_for_v6.13' of git://git.kernel.org/pub/scm/linux/kernel/git/ras/ras
Pull EDAC updates from Borislav Petkov:
- Add support for Bluefield-2 SOCs to bluefield_edac
- Add support for Intel Panther Lake-H to igen6_edac
- Add polling support to igen6_edac as some Intel M100 chips have
trouble with error interrupts
- Add Kaby Lake-S support to ie31200_edac
- Fix memory source detection in the SKX common module which is used by
a couple of Intel EDAC drivers
- Add support for the NXP i.MX9 memory controller to fsl_edac
- The usual fixes and cleanups all over the place
* tag 'edac_updates_for_v6.13' of git://git.kernel.org/pub/scm/linux/kernel/git/ras/ras:
EDAC/igen6: Add polling support
EDAC/igen6: Initialize edac_op_state according to the configuration data
EDAC/igen6: Avoid segmentation fault on module unload
EDAC/ie31200: Add Kaby Lake-S dual-core host bridge ID
MAINTAINERS: Change FSL DDR EDAC maintainership
EDAC/{skx_common,i10nm}: Fix incorrect far-memory error source indicator
EDAC/skx_common: Differentiate memory error sources
EDAC/fsl_ddr: Add support for i.MX9 DDR controller
dt-bindings: memory: fsl: Add compatible string nxp,imx9-memory-controller
EDAC/fsl_ddr: Fix bad bit shift operations
EDAC/fsl_ddr: Move global variables into struct fsl_mc_pdata
EDAC/fsl_ddr: Pass down fsl_mc_pdata in ddr_in32() and ddr_out32()
RAS/AMD/ATL: Add debug prints for DF register reads
EDAC/bluefield: Use Arm SMC for EMI access on BlueField-2
EDAC/bluefield: Fix potential integer overflow
EDAC/igen6: Add Intel Panther Lake-H SoCs support
These two drivers are only buildable for the powerpc "maple" platform
(CONFIG_PPC_MAPLE), which has now been removed, see
commit 62f8f307c8 ("powerpc/64: Remove maple platform").
Remove the drivers.
Acked-by: Borislav Petkov (AMD) <bp@alien8.de>
Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
Link: https://patch.msgid.link/20241112084134.411964-1-mpe@ellerman.id.au
Some PCs with Intel N100 (with PCI device 8086:461c, DID_ADL_N_SKU4)
experienced issues with error interrupts not working, even with the
following configuration in the BIOS.
In-Band ECC Support: Enabled
In-Band ECC Operation Mode: 2 (make all requests protected and
ignore range checks)
IBECC Error Injection Control: Inject Correctable Error on insertion
counter
Error Injection Insertion Count: 251658240 (0xf000000)
Add polling mode support for these machines to ensure that memory error
events are handled.
Signed-off-by: Orange Kao <orange@aiven.io>
Signed-off-by: Tony Luck <tony.luck@intel.com>
Reviewed-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com>
Link: https://lore.kernel.org/all/20241106114024.941659-3-orange@aiven.io
Currently, igen6_edac sets edac_op_state to EDAC_OPSTATE_NMI, while the
driver also supports memory errors reported from Machine Check. Initialize
edac_op_state to the correct value according to the configuration data
that the driver probed.
Signed-off-by: Qiuxu Zhuo <qiuxu.zhuo@intel.com>
Signed-off-by: Tony Luck <tony.luck@intel.com>
Link: https://lore.kernel.org/all/20241106114024.941659-2-orange@aiven.io
The segmentation fault happens because:
During modprobe:
1. In igen6_probe(), igen6_pvt will be allocated with kzalloc()
2. In igen6_register_mci(), mci->pvt_info will point to
&igen6_pvt->imc[mc]
During rmmod:
1. In mci_release() in edac_mc.c, it will kfree(mci->pvt_info)
2. In igen6_remove(), it will kfree(igen6_pvt);
Fix this issue by setting mci->pvt_info to NULL to avoid the double
kfree.
Fixes: 10590a9d4f ("EDAC/igen6: Add EDAC driver for Intel client SoCs using IBECC")
Closes: https://bugzilla.kernel.org/show_bug.cgi?id=219360
Signed-off-by: Orange Kao <orange@aiven.io>
Signed-off-by: Tony Luck <tony.luck@intel.com>
Link: https://lore.kernel.org/r/20241104124237.124109-2-orange@aiven.io