summaryrefslogtreecommitdiffstats
path: root/arch (follow)
Commit message (Collapse)AuthorAgeFilesLines
* Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvmLinus Torvalds2024-10-069-42/+82
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull kvm fixes from Paolo Bonzini: "ARM64: - Fix pKVM error path on init, making sure we do not change critical system registers as we're about to fail - Make sure that the host's vector length is at capped by a value common to all CPUs - Fix kvm_has_feat*() handling of "negative" features, as the current code is pretty broken - Promote Joey to the status of official reviewer, while James steps down -- hopefully only temporarly x86: - Fix compilation with KVM_INTEL=KVM_AMD=n - Fix disabling KVM_X86_QUIRK_SLOT_ZAP_ALL when shadow MMU is in use Selftests: - Fix compilation on non-x86 architectures" * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: x86/reboot: emergency callbacks are now registered by common KVM code KVM: x86: leave kvm.ko out of the build if no vendor module is requested KVM: x86/mmu: fix KVM_X86_QUIRK_SLOT_ZAP_ALL for shadow MMU KVM: arm64: Fix kvm_has_feat*() handling of negative features KVM: selftests: Fix build on architectures other than x86_64 KVM: arm64: Another reviewer reshuffle KVM: arm64: Constrain the host to the maximum shared SVE VL with pKVM KVM: arm64: Fix __pkvm_init_vcpu cptr_el2 error path
| * Merge tag 'kvmarm-fixes-6.12-1' of ↵Paolo Bonzini2024-10-061400-19326/+55927
| |\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/kvmarm/kvmarm into HEAD KVM/arm64 fixes for 6.12, take #1 - Fix pKVM error path on init, making sure we do not change critical system registers as we're about to fail - Make sure that the host's vector length is at capped by a value common to all CPUs - Fix kvm_has_feat*() handling of "negative" features, as the current code is pretty broken - Promote Joey to the status of official reviewer, while James steps down -- hopefully only temporarly
| | * KVM: arm64: Fix kvm_has_feat*() handling of negative featuresMarc Zyngier2024-10-031-12/+13
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Oliver reports that the kvm_has_feat() helper is not behaviing as expected for negative feature. On investigation, the main issue seems to be caused by the following construct: #define get_idreg_field(kvm, id, fld) \ (id##_##fld##_SIGNED ? \ get_idreg_field_signed(kvm, id, fld) : \ get_idreg_field_unsigned(kvm, id, fld)) where one side of the expression evaluates as something signed, and the other as something unsigned. In retrospect, this is totally braindead, as the compiler converts this into an unsigned expression. When compared to something that is 0, the test is simply elided. Epic fail. Similar issue exists in the expand_field_sign() macro. The correct way to handle this is to chose between signed and unsigned comparisons, so that both sides of the ternary expression are of the same type (bool). In order to keep the code readable (sort of), we introduce new comparison primitives taking an operator as a parameter, and rewrite the kvm_has_feat*() helpers in terms of these primitives. Fixes: c62d7a23b947 ("KVM: arm64: Add feature checking helpers") Reported-by: Oliver Upton <oliver.upton@linux.dev> Tested-by: Oliver Upton <oliver.upton@linux.dev> Cc: stable@vger.kernel.org Link: https://lore.kernel.org/r/20241002204239.2051637-1-maz@kernel.org Signed-off-by: Marc Zyngier <maz@kernel.org>
| | * KVM: arm64: Constrain the host to the maximum shared SVE VL with pKVMMark Brown2024-10-012-6/+8
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When pKVM saves and restores the host floating point state on a SVE system, it programs the vector length in ZCR_EL2.LEN to be whatever the maximum VL for the PE is. But it uses a buffer allocated with kvm_host_sve_max_vl, the maximum VL shared by all PEs in the system. This means that if we run on a system where the maximum VLs are not consistent, we will overflow the buffer on PEs which support larger VLs. Since the host will not currently attempt to make use of non-shared VLs, fix this by explicitly setting the EL2 VL to be the maximum shared VL when we save and restore. This will enforce the limit on host VL usage. Should we wish to support asymmetric VLs, this code will need to be updated along with the required changes for the host: https://lore.kernel.org/r/20240730-kvm-arm64-fix-pkvm-sve-vl-v6-0-cae8a2e0bd66@kernel.org Fixes: b5b9955617bc ("KVM: arm64: Eagerly restore host fpsimd/sve state in pKVM") Signed-off-by: Mark Brown <broonie@kernel.org> Tested-by: Fuad Tabba <tabba@google.com> Reviewed-by: Fuad Tabba <tabba@google.com> Link: https://lore.kernel.org/r/20240912-kvm-arm64-limit-guest-vl-v2-1-dd2c29cb2ac9@kernel.org [maz: added punctuation to the commit message] Signed-off-by: Marc Zyngier <maz@kernel.org>
| | * KVM: arm64: Fix __pkvm_init_vcpu cptr_el2 error pathVincent Donnefort2024-10-011-2/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | On an error, hyp_vcpu will be accessed while this memory has already been relinquished to the host and unmapped from the hypervisor. Protect the CPTR assignment with an early return. Fixes: b5b9955617bc ("KVM: arm64: Eagerly restore host fpsimd/sve state in pKVM") Reviewed-by: Oliver Upton <oliver.upton@linux.dev> Signed-off-by: Vincent Donnefort <vdonnefort@google.com> Link: https://lore.kernel.org/r/20240919110500.2345927-1-vdonnefort@google.com Signed-off-by: Marc Zyngier <maz@kernel.org>
| * | x86/reboot: emergency callbacks are now registered by common KVM codePaolo Bonzini2024-10-062-4/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Guard them with CONFIG_KVM_X86_COMMON rather than the two vendor modules. In practice this has no functional change, because CONFIG_KVM_X86_COMMON is set if and only if at least one vendor-specific module is being built. However, it is cleaner to specify CONFIG_KVM_X86_COMMON for functions that are used in kvm.ko. Reported-by: Linus Torvalds <torvalds@linux-foundation.org> Fixes: 590b09b1d88e ("KVM: x86: Register "emergency disable" callbacks when virt is enabled") Fixes: 6d55a94222db ("x86/reboot: Unconditionally define cpu_emergency_virt_cb typedef") Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
| * | KVM: x86: leave kvm.ko out of the build if no vendor module is requestedPaolo Bonzini2024-10-062-4/+7
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | kvm.ko is nothing but library code shared by kvm-intel.ko and kvm-amd.ko. It provides no functionality on its own and it is unnecessary unless one of the vendor-specific module is compiled. In particular, /dev/kvm is not created until one of kvm-intel.ko or kvm-amd.ko is loaded. Use CONFIG_KVM to decide if it is built-in or a module, but use the vendor-specific modules for the actual decision on whether to build it. This also fixes a build failure when CONFIG_KVM_INTEL and CONFIG_KVM_AMD are both disabled. The cpu_emergency_register_virt_callback() function is called from kvm.ko, but it is only defined if at least one of CONFIG_KVM_INTEL and CONFIG_KVM_AMD is provided. Fixes: 590b09b1d88e ("KVM: x86: Register "emergency disable" callbacks when virt is enabled") Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
| * | KVM: x86/mmu: fix KVM_X86_QUIRK_SLOT_ZAP_ALL for shadow MMUPaolo Bonzini2024-10-041-14/+46
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | As was tried in commit 4e103134b862 ("KVM: x86/mmu: Zap only the relevant pages when removing a memslot"), all shadow pages, i.e. non-leaf SPTEs, need to be zapped. All of the accounting for a shadow page is tied to the memslot, i.e. the shadow page holds a reference to the memslot, for all intents and purposes. Deleting the memslot without removing all relevant shadow pages, as is done when KVM_X86_QUIRK_SLOT_ZAP_ALL is disabled, results in NULL pointer derefs when tearing down the VM. Reintroduce from that commit the code that walks the whole memslot when there are active shadow MMU pages. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
* | | Merge tag 'powerpc-6.12-3' of ↵Linus Torvalds2024-10-061-1/+1
|\ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux Pull powerpc fix from Michael Ellerman: - Allow r30 to be used in vDSO code generation of getrandom Thanks to Jason A. Donenfeld * tag 'powerpc-6.12-3' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux: powerpc/vdso: allow r30 in vDSO code generation of getrandom
| * | | powerpc/vdso: allow r30 in vDSO code generation of getrandomJason A. Donenfeld2024-09-301-1/+1
| | |/ | |/| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | For gettimeofday, -ffixed-r30 was passed to work around a bug in Go code, where the vDSO trampoline forgot to save and restore this register across function calls. But Go requires a different trampoline for every call, and there's no reason that new Go code needs to be broken and add more bugs. So remove -ffixed-r30 for getrandom. Fixes: 8072b39c3a75 ("powerpc/vdso: Wire up getrandom() vDSO implementation on VDSO64") Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://msgid.link/20240925175021.1526936-2-Jason@zx2c4.com
* | | Merge tag 'arm64-fixes' of ↵Linus Torvalds2024-10-044-4/+10
|\ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux Pull arm64 fixes from Catalin Marinas: "A couple of build/config issues and expanding the speculative SSBS workaround to more CPUs: - Expand the speculative SSBS workaround to cover Cortex-A715, Neoverse-N3 and Microsoft Azure Cobalt 100 - Force position-independent veneers - in some kernel configurations, the LLD linker generates position-dependent veneers for otherwise position-independent code, resulting in early boot-time failures - Fix Kconfig selection of HAVE_DYNAMIC_FTRACE_WITH_ARGS so that it is not enabled when not supported by the combination of clang and GNU ld" * tag 'arm64-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/arm64/linux: arm64: Subscribe Microsoft Azure Cobalt 100 to erratum 3194386 arm64: fix selection of HAVE_DYNAMIC_FTRACE_WITH_ARGS arm64: errata: Expand speculative SSBS workaround once more arm64: cputype: Add Neoverse-N3 definitions arm64: Force position-independent veneers
| * | | arm64: Subscribe Microsoft Azure Cobalt 100 to erratum 3194386Easwar Hariharan2024-10-041-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Add the Microsoft Azure Cobalt 100 CPU to the list of CPUs suffering from erratum 3194386 added in commit 75b3c43eab59 ("arm64: errata: Expand speculative SSBS workaround") CC: Mark Rutland <mark.rutland@arm.com> CC: James More <james.morse@arm.com> CC: Will Deacon <will@kernel.org> CC: stable@vger.kernel.org # 6.6+ Signed-off-by: Easwar Hariharan <eahariha@linux.microsoft.com> Link: https://lore.kernel.org/r/20241003225239.321774-1-eahariha@linux.microsoft.com Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
| * | | arm64: fix selection of HAVE_DYNAMIC_FTRACE_WITH_ARGSMark Rutland2024-10-011-3/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The Kconfig logic to select HAVE_DYNAMIC_FTRACE_WITH_ARGS is incorrect, and HAVE_DYNAMIC_FTRACE_WITH_ARGS may be selected when it is not supported by the combination of clang and GNU LD, resulting in link-time errors: aarch64-linux-gnu-ld: .init.data has both ordered [`__patchable_function_entries' in init/main.o] and unordered [`.meminit.data' in mm/sparse.o] sections aarch64-linux-gnu-ld: final link failed: bad value ... which can be seen when building with CC=clang using a binutils version older than 2.36. We originally fixed that in commit: 45bd8951806eb5e8 ("arm64: Improve HAVE_DYNAMIC_FTRACE_WITH_REGS selection for clang") ... by splitting the "select HAVE_DYNAMIC_FTRACE_WITH_ARGS" statement into separete CLANG_SUPPORTS_DYNAMIC_FTRACE_WITH_ARGS and GCC_SUPPORTS_DYNAMIC_FTRACE_WITH_ARGS options which individually select HAVE_DYNAMIC_FTRACE_WITH_ARGS. Subsequently we accidentally re-introduced the common "select HAVE_DYNAMIC_FTRACE_WITH_ARGS" statement in commit: 26299b3f6ba26bfc ("ftrace: arm64: move from REGS to ARGS") ... then we removed it again in commit: 68a63a412d18bd2e ("arm64: Fix build with CC=clang, CONFIG_FTRACE=y and CONFIG_STACK_TRACER=y") ... then we accidentally re-introduced it again in commit: 2aa6ac03516d078c ("arm64: ftrace: Add direct call support") Fix this for the third time by keeping the unified select statement and making this depend onf either GCC_SUPPORTS_DYNAMIC_FTRACE_WITH_ARGS or CLANG_SUPPORTS_DYNAMIC_FTRACE_WITH_ARGS. This is more consistent with usual style and less likely to go wrong in future. Fixes: 2aa6ac03516d ("arm64: ftrace: Add direct call support") Cc: <stable@vger.kernel.org> # 6.4.x Signed-off-by: Mark Rutland <mark.rutland@arm.com> Cc: Will Deacon <will@kernel.org> Link: https://lore.kernel.org/r/20240930120448.3352564-1-mark.rutland@arm.com Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
| * | | arm64: errata: Expand speculative SSBS workaround once moreMark Rutland2024-10-012-0/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | A number of Arm Ltd CPUs suffer from errata whereby an MSR to the SSBS special-purpose register does not affect subsequent speculative instructions, permitting speculative store bypassing for a window of time. We worked around this for a number of CPUs in commits: * 7187bb7d0b5c7dfa ("arm64: errata: Add workaround for Arm errata 3194386 and 3312417") * 75b3c43eab594bfb ("arm64: errata: Expand speculative SSBS workaround") * 145502cac7ea70b5 ("arm64: errata: Expand speculative SSBS workaround (again)") Since then, a (hopefully final) batch of updates have been published, with two more affected CPUs. For the affected CPUs the existing mitigation is sufficient, as described in their respective Software Developer Errata Notice (SDEN) documents: * Cortex-A715 (MP148) SDEN v15.0, erratum 3456084 https://developer.arm.com/documentation/SDEN-2148827/1500/ * Neoverse-N3 (MP195) SDEN v5.0, erratum 3456111 https://developer.arm.com/documentation/SDEN-3050973/0500/ Enable the existing mitigation by adding the relevant MIDRs to erratum_spec_ssbs_list, and update silicon-errata.rst and the Kconfig text accordingly. Signed-off-by: Mark Rutland <mark.rutland@arm.com> Cc: James Morse <james.morse@arm.com> Cc: Will Deacon <will@kernel.org> Link: https://lore.kernel.org/r/20240930111705.3352047-3-mark.rutland@arm.com Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
| * | | arm64: cputype: Add Neoverse-N3 definitionsMark Rutland2024-10-011-0/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Add cputype definitions for Neoverse-N3. These will be used for errata detection in subsequent patches. These values can be found in Table A-261 ("MIDR_EL1 bit descriptions") in issue 02 of the Neoverse-N3 TRM, which can be found at: https://developer.arm.com/documentation/107997/0000/?lang=en Signed-off-by: Mark Rutland <mark.rutland@arm.com> Cc: James Morse <james.morse@arm.com> Cc: Will Deacon <will@kernel.org> Link: https://lore.kernel.org/r/20240930111705.3352047-2-mark.rutland@arm.com Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
| * | | arm64: Force position-independent veneersMark Rutland2024-10-011-1/+1
| |/ / | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Certain portions of code always need to be position-independent regardless of CONFIG_RELOCATABLE, including code which is executed in an idmap or which is executed before relocations are applied. In some kernel configurations the LLD linker generates position-dependent veneers for such code, and when executed these result in early boot-time failures. Marc Zyngier encountered a boot failure resulting from this when building a (particularly cursed) configuration with LLVM, as he reported to the list: https://lore.kernel.org/linux-arm-kernel/86wmjwvatn.wl-maz@kernel.org/ In Marc's kernel configuration, the .head.text and .rodata.text sections end up more than 128MiB apart, requiring a veneer to branch between the two: | [mark@lakrids:~/src/linux]% usekorg 14.1.0 aarch64-linux-objdump -t vmlinux | grep -w _text | ffff800080000000 g .head.text 0000000000000000 _text | [mark@lakrids:~/src/linux]% usekorg 14.1.0 aarch64-linux-objdump -t vmlinux | grep -w primary_entry | ffff8000889df0e0 g .rodata.text 000000000000006c primary_entry, ... consequently, LLD inserts a position-dependent veneer for the branch from _stext (in .head.text) to primary_entry (in .rodata.text): | ffff800080000000 <_text>: | ffff800080000000: fa405a4d ccmp x18, #0x0, #0xd, pl // pl = nfrst | ffff800080000004: 14003fff b ffff800080010000 <__AArch64AbsLongThunk_primary_entry> ... | ffff800080010000 <__AArch64AbsLongThunk_primary_entry>: | ffff800080010000: 58000050 ldr x16, ffff800080010008 <__AArch64AbsLongThunk_primary_entry+0x8> | ffff800080010004: d61f0200 br x16 | ffff800080010008: 889df0e0 .word 0x889df0e0 | ffff80008001000c: ffff8000 .word 0xffff8000 ... and as this is executed early in boot before the kernel is mapped in TTBR1 this results in a silent boot failure. Fix this by passing '--pic-veneer' to the linker, which will cause the linker to use position-independent veneers, e.g. | ffff800080000000 <_text>: | ffff800080000000: fa405a4d ccmp x18, #0x0, #0xd, pl // pl = nfrst | ffff800080000004: 14003fff b ffff800080010000 <__AArch64ADRPThunk_primary_entry> ... | ffff800080010000 <__AArch64ADRPThunk_primary_entry>: | ffff800080010000: f004e3f0 adrp x16, ffff800089c8f000 <__idmap_text_start> | ffff800080010004: 91038210 add x16, x16, #0xe0 | ffff800080010008: d61f0200 br x16 I've opted to pass '--pic-veneer' unconditionally, as: * In addition to solving the boot failure, these sequences are generally nicer as they require fewer instructions and don't need to perform data accesses. * While the position-independent veneer sequences have a limited +/-2GiB range, this is not a new restriction. Even kernels built with CONFIG_RELOCATABLE=n are limited to 2GiB in size as we have several structues using 32-bit relative offsets and PPREL32 relocations, which are similarly limited to +/-2GiB in range. These include extable entries, jump table entries, and alt_instr entries. * GNU LD defaults to using position-independent veneers, and supports the same '--pic-veneer' option, so this change is not expected to adversely affect GNU LD. I've tested with GNU LD 2.30 to 2.42 inclusive and LLVM 13.0.1 to 19.1.0 inclusive, using the kernel.org binaries from: * https://mirrors.edge.kernel.org/pub/tools/crosstool/ * https://mirrors.edge.kernel.org/pub/tools/llvm/ Signed-off-by: Mark Rutland <mark.rutland@arm.com> Reported-by: Marc Zyngier <maz@kernel.org> Cc: Ard Biesheuvel <ardb@kernel.org> Cc: Nathan Chancellor <nathan@kernel.org> Cc: Nick Desaulniers <ndesaulniers@google.com> Cc: Will Deacon <will@kernel.org> Acked-by: Ard Biesheuvel <ardb@kernel.org> Reviewed-by: Nathan Chancellor <nathan@kernel.org> Link: https://lore.kernel.org/r/20240927101838.3061054-1-mark.rutland@arm.com Signed-off-by: Catalin Marinas <catalin.marinas@arm.com>
* | | Merge tag 'riscv-for-linus-6.12-rc2' of ↵Linus Torvalds2024-10-042-3/+7
|\ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/riscv/linux Pull RISC-V fixes from Palmer Dabbelt: - PERF_TYPE_BREAKPOINT now returns -EOPNOTSUPP instead of -ENOENT, which aligns to other ports and is a saner value - The KASAN-related stack size increasing logic has been moved to a C header, to avoid dependency issues * tag 'riscv-for-linus-6.12-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/riscv/linux: riscv: Fix kernel stack size when KASAN is enabled drivers/perf: riscv: Align errno for unsupported perf event
| * | | riscv: Fix kernel stack size when KASAN is enabledAlexandre Ghiti2024-10-012-3/+7
| |/ / | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | We use Kconfig to select the kernel stack size, doubling the default size if KASAN is enabled. But that actually only works if KASAN is selected from the beginning, meaning that if KASAN config is added later (for example using menuconfig), CONFIG_THREAD_SIZE_ORDER won't be updated, keeping the default size, which is not enough for KASAN as reported in [1]. So fix this by moving the logic to compute the right kernel stack into a header. Fixes: a7555f6b62e7 ("riscv: stack: Add config of thread stack size") Reported-by: syzbot+ba9eac24453387a9d502@syzkaller.appspotmail.com Closes: https://lore.kernel.org/all/000000000000eb301906222aadc2@google.com/ [1] Cc: stable@vger.kernel.org Signed-off-by: Alexandre Ghiti <alexghiti@rivosinc.com> Link: https://lore.kernel.org/r/20240917150328.59831-1-alexghiti@rivosinc.com Signed-off-by: Palmer Dabbelt <palmer@rivosinc.com>
* | | Merge tag 'trace-v6.12-rc1' of ↵Linus Torvalds2024-10-041-0/+2
|\ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace Pull tracing fixes from Steven Rostedt: - Fix tp_printk command line option crashing the kernel With the code that can handle a buffer from a previous boot, the trace_check_vprintf() needed access to the delta of the address space used by the old buffer and the current buffer. To do so, the trace_array (tr) parameter was used. But when tp_printk is enabled on the kernel command line, no trace buffer is used and the trace event is sent directly to printk(). That meant the tr field of the iterator descriptor was NULL, and since tp_printk still uses trace_check_vprintf() it caused a NULL dereference. - Add ptrace.h include to x86 ftrace file for completeness - Fix rtla installation when done with out-of-tree build - Fix the help messages in rtla that were incorrect - Several fixes to fix races with the timerlat and hwlat code Several locking issues were discovered with the coordination between timerlat kthread creation and hotplug. As timerlat has callbacks from hotplug code to start kthreads when CPUs come online. There are also locking issues with grabbing the cpu_read_lock() and the locks within timerlat. * tag 'trace-v6.12-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace: tracing/hwlat: Fix a race during cpuhp processing tracing/timerlat: Fix a race during cpuhp processing tracing/timerlat: Drop interface_lock in stop_kthread() tracing/timerlat: Fix duplicated kthread creation due to CPU online/offline x86/ftrace: Include <asm/ptrace.h> rtla: Fix the help text in osnoise and timerlat top tools tools/rtla: Fix installation from out-of-tree build tracing: Fix trace_check_vprintf() when tp_printk is used
| * | | x86/ftrace: Include <asm/ptrace.h>Sami Tolvanen2024-10-031-0/+2
| |/ / | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | <asm/ftrace.h> uses struct pt_regs in several places. Include <asm/ptrace.h> to ensure it's visible. This is needed to make sure object files that only include <asm/asm-prototypes.h> compile. Cc: Mark Rutland <mark.rutland@arm.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@redhat.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Link: https://lore.kernel.org/20240916221557.846853-2-samitolvanen@google.com Suggested-by: Masahiro Yamada <masahiroy@kernel.org> Signed-off-by: Sami Tolvanen <samitolvanen@google.com> Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
* | | Merge tag 'rust-fixes-6.12' of https://github.com/Rust-for-Linux/linuxLinus Torvalds2024-10-041-1/+17
|\ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull Rust fixes from Miguel Ojeda: "Toolchain and infrastructure: - Fix/improve a couple 'depends on' on the newly added CFI/KASAN suppport to avoid build errors/warnings - Fix ARCH_SLAB_MINALIGN multiple definition error for RISC-V under !CONFIG_MMU - Clean upcoming (Rust 1.83.0) Clippy warnings 'kernel' crate: - 'sync' module: fix soundness issue by requiring 'T: Sync' for 'LockedBy::access'; and fix helpers build error under PREEMPT_RT - Fix trivial sorting issue ('rustfmtcheck') on the v6.12 Rust merge" * tag 'rust-fixes-6.12' of https://github.com/Rust-for-Linux/linux: rust: kunit: use C-string literals to clean warning cfi: encode cfi normalized integers + kasan/gcov bug in Kconfig rust: KASAN+RETHUNK requires rustc 1.83.0 rust: cfi: fix `patchable-function-entry` starting version rust: mutex: fix __mutex_init() usage in case of PREEMPT_RT rust: fix `ARCH_SLAB_MINALIGN` multiple definition error rust: sync: require `T: Sync` for `LockedBy::access` rust: kernel: sort Rust modules
| * | | cfi: encode cfi normalized integers + kasan/gcov bug in KconfigAlice Ryhl2024-09-261-1/+17
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | There is a bug in the LLVM implementation of KASAN and GCOV that makes these options incompatible with the CFI_ICALL_NORMALIZE_INTEGERS option. The bug has already been fixed in llvm/clang [1] and rustc [2]. However, Kconfig currently has no way to gate features on the LLVM version inside rustc, so we cannot write down a precise `depends on` clause in this case. Instead, a `def_bool` option is defined for whether CFI_ICALL_NORMALIZE_INTEGERS is available, and its default value is set to false when GCOV or KASAN are turned on. End users using a patched clang/rustc can turn on the HAVE_CFI_ICALL_NORMALIZE_INTEGERS option directly to override this. An alternative solution is to inspect a binary created by clang or rustc to see whether the faulty CFI tags are in the binary. This would be a precise check, but it would involve hard-coding the *hashed* version of the CFI tag. This is because there's no way to get clang or rustc to output the unhased version of the CFI tag. Relying on the precise hashing algorithm using by CFI seems too fragile, so I have not pursued this option. Besides, this kind of hack is exactly what lead to the LLVM bug in the first place. If the CFI_ICALL_NORMALIZE_INTEGERS option is used without CONFIG_RUST, then we actually can perform a precise check today: just compare the clang version number. This works since clang and llvm are always updated in lockstep. However, encoding this in Kconfig would give the HAVE_CFI_ICALL_NORMALIZE_INTEGERS option a dependency on CONFIG_RUST, which is not possible as the reverse dependency already exists. HAVE_CFI_ICALL_NORMALIZE_INTEGERS is defined to be a `def_bool` instead of `bool` to avoid asking end users whether they want to turn on the option. Turning it on explicitly is something only experts should do, so making it hard to do so is not an issue. I added a `depends on CFI_CLANG` clause to the new Kconfig option. I'm not sure whether that makes sense or not, but it doesn't seem to make a big difference. In a future kernel release, I would like to add a Kconfig option similar to CLANG_VERSION/RUSTC_VERSION for inspecting the version of the LLVM inside rustc. Once that feature lands, this logic will be replaced with a precise version check. This check is not being introduced here to avoid introducing a new _VERSION constant in a fix. Link: https://github.com/llvm/llvm-project/pull/104826 [1] Link: https://github.com/rust-lang/rust/pull/129373 [2] Fixes: ce4a2620985c ("cfi: add CONFIG_CFI_ICALL_NORMALIZE_INTEGERS") Reported-by: kernel test robot <oliver.sang@intel.com> Closes: https://lore.kernel.org/oe-lkp/202409231044.4f064459-oliver.sang@intel.com Signed-off-by: Alice Ryhl <aliceryhl@google.com> Reviewed-by: Sami Tolvanen <samitolvanen@google.com> Link: https://lore.kernel.org/r/20240925-cfi-norm-kasan-fix-v1-1-0328985cdf33@google.com Signed-off-by: Miguel Ojeda <ojeda@kernel.org>
* | | | move asm/unaligned.h to linux/unaligned.hAl Viro2024-10-0243-43/+43
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | asm/unaligned.h is always an include of asm-generic/unaligned.h; might as well move that thing to linux/unaligned.h and include that - there's nothing arch-specific in that header. auto-generated by the following: for i in `git grep -l -w asm/unaligned.h`; do sed -i -e "s/asm\/unaligned.h/linux\/unaligned.h/" $i done for i in `git grep -l -w asm-generic/unaligned.h`; do sed -i -e "s/asm-generic\/unaligned.h/linux\/unaligned.h/" $i done git mv include/asm-generic/unaligned.h include/linux/unaligned.h git mv tools/include/asm-generic/unaligned.h tools/include/linux/unaligned.h sed -i -e "/unaligned.h/d" include/asm-generic/Kbuild sed -i -e "s/__ASM_GENERIC/__LINUX/" include/linux/unaligned.h tools/include/linux/unaligned.h
* | | | arc: get rid of private asm/unaligned.hAl Viro2024-10-025-27/+19
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Declarations local to arch/*/kernel/*.c are better off *not* in a public header - arch/arc/kernel/unaligned.h is just fine for those bits. Unlike the parisc case, here we have an extra twist - asm/mmu.h has an implicit dependency on struct pt_regs, and in some users that used to be satisfied by include of asm/ptrace.h from asm/unaligned.h (note that asm/mmu.h itself did _not_ pull asm/unaligned.h - it relied upon the users having pulled asm/unaligned.h before asm/mmu.h got there). Seeing that asm/mmu.h only wants struct pt_regs * arguments in an extern, just pre-declare it there - less brittle that way. With that done _all_ asm/unaligned.h instances are reduced to include of asm-generic/unaligned.h and can be removed - unaligned.h is in mandatory-y in include/asm-generic/Kbuild. What's more, we can move asm-generic/unaligned.h to linux/unaligned.h and switch includes of <asm/unaligned.h> to <linux/unaligned.h>; that's better off as an auto-generated commit, though, to be done by Linus at -rc1 time next cycle. Acked-by: Vineet Gupta <vgupta@kernel.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* | | | parisc: get rid of private asm/unaligned.hAl Viro2024-10-024-11/+6
| |/ / |/| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Declarations local to arch/*/kernel/*.c are better off *not* in a public header - arch/parisc/kernel/unaligned.h is just fine for those bits. With that done parisc asm/unaligned.h is reduced to include of asm-generic/unaligned.h and can be removed - unaligned.h is in mandatory-y in include/asm-generic/Kbuild. Acked-by: Helge Deller <deller@gmx.de> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* | | x86: kvm: fix build errorLinus Torvalds2024-09-291-0/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The cpu_emergency_register_virt_callback() function is used unconditionally by the x86 kvm code, but it is declared (and defined) conditionally: #if IS_ENABLED(CONFIG_KVM_INTEL) || IS_ENABLED(CONFIG_KVM_AMD) void cpu_emergency_register_virt_callback(cpu_emergency_virt_cb *callback); ... leading to a build error when neither KVM_INTEL nor KVM_AMD support is enabled: arch/x86/kvm/x86.c: In function ‘kvm_arch_enable_virtualization’: arch/x86/kvm/x86.c:12517:9: error: implicit declaration of function ‘cpu_emergency_register_virt_callback’ [-Wimplicit-function-declaration] 12517 | cpu_emergency_register_virt_callback(kvm_x86_ops.emergency_disable_virtualization_cpu); | ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ arch/x86/kvm/x86.c: In function ‘kvm_arch_disable_virtualization’: arch/x86/kvm/x86.c:12522:9: error: implicit declaration of function ‘cpu_emergency_unregister_virt_callback’ [-Wimplicit-function-declaration] 12522 | cpu_emergency_unregister_virt_callback(kvm_x86_ops.emergency_disable_virtualization_cpu); | ^~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Fix the build by defining empty helper functions the same way the old cpu_emergency_disable_virtualization() function was dealt with for the same situation. Maybe we could instead have made the call sites conditional, since the callers (kvm_arch_{en,dis}able_virtualization()) have an empty weak fallback. I'll leave that to the kvm people to argue about, this at least gets the build going for that particular config. Fixes: 590b09b1d88e ("KVM: x86: Register "emergency disable" callbacks when virt is enabled") Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Sean Christopherson <seanjc@google.com> Cc: Kai Huang <kai.huang@intel.com> Cc: Chao Gao <chao.gao@intel.com> Cc: Farrah Chen <farrah.chen@intel.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | | Merge tag 'x86-urgent-2024-09-29' of ↵Linus Torvalds2024-09-292-0/+11
|\ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 fixes from Ingo Molnar: "Fix TDX MMIO #VE fault handling, and add two new Intel model numbers for 'Pantherlake' and 'Diamond Rapids'" * tag 'x86-urgent-2024-09-29' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/cpu: Add two Intel CPU model numbers x86/tdx: Fix "in-kernel MMIO" check
| * | | x86/cpu: Add two Intel CPU model numbersTony Luck2024-09-261-0/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pantherlake is a mobile CPU. Diamond Rapids next generation Xeon. Signed-off-by: Tony Luck <tony.luck@intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Link: https://lore.kernel.org/all/20240923173750.16874-1-tony.luck%40intel.com
| * | | x86/tdx: Fix "in-kernel MMIO" checkAlexey Gladkov (Intel)2024-09-261-0/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | TDX only supports kernel-initiated MMIO operations. The handle_mmio() function checks if the #VE exception occurred in the kernel and rejects the operation if it did not. However, userspace can deceive the kernel into performing MMIO on its behalf. For example, if userspace can point a syscall to an MMIO address, syscall does get_user() or put_user() on it, triggering MMIO #VE. The kernel will treat the #VE as in-kernel MMIO. Ensure that the target MMIO address is within the kernel before decoding instruction. Fixes: 31d58c4e557d ("x86/tdx: Handle in-kernel MMIO") Signed-off-by: Alexey Gladkov (Intel) <legion@kernel.org> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Acked-by: Dave Hansen <dave.hansen@linux.intel.com> Cc:stable@vger.kernel.org Link: https://lore.kernel.org/all/565a804b80387970460a4ebc67c88d1380f61ad1.1726237595.git.legion%40kernel.org
* | | | Merge tag 'locking-urgent-2024-09-29' of ↵Linus Torvalds2024-09-292-6/+9
|\ \ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull locking updates from Ingo Molnar: "lockdep: - Fix potential deadlock between lockdep and RCU (Zhiguo Niu) - Use str_plural() to address Coccinelle warning (Thorsten Blum) - Add debuggability enhancement (Luis Claudio R. Goncalves) static keys & calls: - Fix static_key_slow_dec() yet again (Peter Zijlstra) - Handle module init failure correctly in static_call_del_module() (Thomas Gleixner) - Replace pointless WARN_ON() in static_call_module_notify() (Thomas Gleixner) <linux/cleanup.h>: - Add usage and style documentation (Dan Williams) rwsems: - Move is_rwsem_reader_owned() and rwsem_owner() under CONFIG_DEBUG_RWSEMS (Waiman Long) atomic ops, x86: - Redeclare x86_32 arch_atomic64_{add,sub}() as void (Uros Bizjak) - Introduce the read64_nonatomic macro to x86_32 with cx8 (Uros Bizjak)" Signed-off-by: Ingo Molnar <mingo@kernel.org> * tag 'locking-urgent-2024-09-29' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: locking/rwsem: Move is_rwsem_reader_owned() and rwsem_owner() under CONFIG_DEBUG_RWSEMS jump_label: Fix static_key_slow_dec() yet again static_call: Replace pointless WARN_ON() in static_call_module_notify() static_call: Handle module init failure correctly in static_call_del_module() locking/lockdep: Simplify character output in seq_line() lockdep: fix deadlock issue between lockdep and rcu lockdep: Use str_plural() to fix Coccinelle warning cleanup: Add usage and style documentation lockdep: suggest the fix for "lockdep bfs error:-1" on print_bfs_bug locking/atomic/x86: Redeclare x86_32 arch_atomic64_{add,sub}() as void locking/atomic/x86: Introduce the read64_nonatomic macro to x86_32 with cx8
| * \ \ \ Merge branch 'locking/core' into locking/urgent, to pick up pending commitsIngo Molnar2024-09-292-6/+9
| |\ \ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Merge all pending locking commits into a single branch. Signed-off-by: Ingo Molnar <mingo@kernel.org>
| | * | | | locking/atomic/x86: Redeclare x86_32 arch_atomic64_{add,sub}() as voidUros Bizjak2024-07-171-4/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Correct the return type of x86_32 arch_atomic64_add() and arch_atomic64_sub() functions to 'void' and remove redundant return. Signed-off-by: Uros Bizjak <ubizjak@gmail.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: https://lore.kernel.org/r/20240605181424.3228-2-ubizjak@gmail.com
| | * | | | locking/atomic/x86: Introduce the read64_nonatomic macro to x86_32 with cx8Uros Bizjak2024-07-171-2/+7
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | As described in commit: e73c4e34a0e9 ("locking/atomic/x86: Introduce arch_atomic64_read_nonatomic() to x86_32") the value preload before the CMPXCHG loop does not need to be atomic. Introduce the read64_nonatomic assembly macro to load the value from a atomic_t location in a faster non-atomic way and use it in atomic64_cx8_32.S. Signed-off-by: Uros Bizjak <ubizjak@gmail.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: https://lore.kernel.org/r/20240605181424.3228-1-ubizjak@gmail.com
* | | | | | Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvmLinus Torvalds2024-09-2846-1071/+1368
|\ \ \ \ \ \ | | |_|_|_|/ | |/| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull x86 kvm updates from Paolo Bonzini: "x86: - KVM currently invalidates the entirety of the page tables, not just those for the memslot being touched, when a memslot is moved or deleted. This does not traditionally have particularly noticeable overhead, but Intel's TDX will require the guest to re-accept private pages if they are dropped from the secure EPT, which is a non starter. Actually, the only reason why this is not already being done is a bug which was never fully investigated and caused VM instability with assigned GeForce GPUs, so allow userspace to opt into the new behavior. - Advertise AVX10.1 to userspace (effectively prep work for the "real" AVX10 functionality that is on the horizon) - Rework common MSR handling code to suppress errors on userspace accesses to unsupported-but-advertised MSRs This will allow removing (almost?) all of KVM's exemptions for userspace access to MSRs that shouldn't exist based on the vCPU model (the actual cleanup is non-trivial future work) - Rework KVM's handling of x2APIC ICR, again, because AMD (x2AVIC) splits the 64-bit value into the legacy ICR and ICR2 storage, whereas Intel (APICv) stores the entire 64-bit value at the ICR offset - Fix a bug where KVM would fail to exit to userspace if one was triggered by a fastpath exit handler - Add fastpath handling of HLT VM-Exit to expedite re-entering the guest when there's already a pending wake event at the time of the exit - Fix a WARN caused by RSM entering a nested guest from SMM with invalid guest state, by forcing the vCPU out of guest mode prior to signalling SHUTDOWN (the SHUTDOWN hits the VM altogether, not the nested guest) - Overhaul the "unprotect and retry" logic to more precisely identify cases where retrying is actually helpful, and to harden all retry paths against putting the guest into an infinite retry loop - Add support for yielding, e.g. to honor NEED_RESCHED, when zapping rmaps in the shadow MMU - Refactor pieces of the shadow MMU related to aging SPTEs in prepartion for adding multi generation LRU support in KVM - Don't stuff the RSB after VM-Exit when RETPOLINE=y and AutoIBRS is enabled, i.e. when the CPU has already flushed the RSB - Trace the per-CPU host save area as a VMCB pointer to improve readability and cleanup the retrieval of the SEV-ES host save area - Remove unnecessary accounting of temporary nested VMCB related allocations - Set FINAL/PAGE in the page fault error code for EPT violations if and only if the GVA is valid. If the GVA is NOT valid, there is no guest-side page table walk and so stuffing paging related metadata is nonsensical - Fix a bug where KVM would incorrectly synthesize a nested VM-Exit instead of emulating posted interrupt delivery to L2 - Add a lockdep assertion to detect unsafe accesses of vmcs12 structures - Harden eVMCS loading against an impossible NULL pointer deref (really truly should be impossible) - Minor SGX fix and a cleanup - Misc cleanups Generic: - Register KVM's cpuhp and syscore callbacks when enabling virtualization in hardware, as the sole purpose of said callbacks is to disable and re-enable virtualization as needed - Enable virtualization when KVM is loaded, not right before the first VM is created Together with the previous change, this simplifies a lot the logic of the callbacks, because their very existence implies virtualization is enabled - Fix a bug that results in KVM prematurely exiting to userspace for coalesced MMIO/PIO in many cases, clean up the related code, and add a testcase - Fix a bug in kvm_clear_guest() where it would trigger a buffer overflow _if_ the gpa+len crosses a page boundary, which thankfully is guaranteed to not happen in the current code base. Add WARNs in more helpers that read/write guest memory to detect similar bugs Selftests: - Fix a goof that caused some Hyper-V tests to be skipped when run on bare metal, i.e. NOT in a VM - Add a regression test for KVM's handling of SHUTDOWN for an SEV-ES guest - Explicitly include one-off assets in .gitignore. Past Sean was completely wrong about not being able to detect missing .gitignore entries - Verify userspace single-stepping works when KVM happens to handle a VM-Exit in its fastpath - Misc cleanups" * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (127 commits) Documentation: KVM: fix warning in "make htmldocs" s390: Enable KVM_S390_UCONTROL config in debug_defconfig selftests: kvm: s390: Add VM run test case KVM: SVM: let alternatives handle the cases when RSB filling is required KVM: VMX: Set PFERR_GUEST_{FINAL,PAGE}_MASK if and only if the GVA is valid KVM: x86/mmu: Use KVM_PAGES_PER_HPAGE() instead of an open coded equivalent KVM: x86/mmu: Add KVM_RMAP_MANY to replace open coded '1' and '1ul' literals KVM: x86/mmu: Fold mmu_spte_age() into kvm_rmap_age_gfn_range() KVM: x86/mmu: Morph kvm_handle_gfn_range() into an aging specific helper KVM: x86/mmu: Honor NEED_RESCHED when zapping rmaps and blocking is allowed KVM: x86/mmu: Add a helper to walk and zap rmaps for a memslot KVM: x86/mmu: Plumb a @can_yield parameter into __walk_slot_rmaps() KVM: x86/mmu: Move walk_slot_rmaps() up near for_each_slot_rmap_range() KVM: x86/mmu: WARN on MMIO cache hit when emulating write-protected gfn KVM: x86/mmu: Detect if unprotect will do anything based on invalid_list KVM: x86/mmu: Subsume kvm_mmu_unprotect_page() into the and_retry() version KVM: x86: Rename reexecute_instruction()=>kvm_unprotect_and_retry_on_failure() KVM: x86: Update retry protection fields when forcing retry on emulation failure KVM: x86: Apply retry protection to "unprotect on failure" path KVM: x86: Check EMULTYPE_WRITE_PF_TO_SP before unprotecting gfn ...
| * | | | | Merge tag 'kvm-x86-vmx-6.12' of https://github.com/kvm-x86/linux into HEADPaolo Bonzini2024-09-1712-37/+105
| |\ \ \ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | KVM VMX changes for 6.12: - Set FINAL/PAGE in the page fault error code for EPT Violations if and only if the GVA is valid. If the GVA is NOT valid, there is no guest-side page table walk and so stuffing paging related metadata is nonsensical. - Fix a bug where KVM would incorrectly synthesize a nested VM-Exit instead of emulating posted interrupt delivery to L2. - Add a lockdep assertion to detect unsafe accesses of vmcs12 structures. - Harden eVMCS loading against an impossible NULL pointer deref (really truly should be impossible). - Minor SGX fix and a cleanup.
| | * | | | | KVM: VMX: Set PFERR_GUEST_{FINAL,PAGE}_MASK if and only if the GVA is validSean Christopherson2024-09-101-2/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Set PFERR_GUEST_{FINAL,PAGE}_MASK based on EPT_VIOLATION_GVA_TRANSLATED if and only if EPT_VIOLATION_GVA_IS_VALID is also set in exit qualification. Per the SDM, bit 8 (EPT_VIOLATION_GVA_TRANSLATED) is valid if and only if bit 7 (EPT_VIOLATION_GVA_IS_VALID) is set, and is '0' if bit 7 is '0'. Bit 7 (a.k.a. EPT_VIOLATION_GVA_IS_VALID) Set if the guest linear-address field is valid. The guest linear-address field is valid for all EPT violations except those resulting from an attempt to load the guest PDPTEs as part of the execution of the MOV CR instruction and those due to trace-address pre-translation Bit 8 (a.k.a. EPT_VIOLATION_GVA_TRANSLATED) If bit 7 is 1: • Set if the access causing the EPT violation is to a guest-physical address that is the translation of a linear address. • Clear if the access causing the EPT violation is to a paging-structure entry as part of a page walk or the update of an accessed or dirty bit. Reserved if bit 7 is 0 (cleared to 0). Failure to guard the logic on GVA_IS_VALID results in KVM marking the page fault as PFERR_GUEST_PAGE_MASK when there is no known GVA, which can put the vCPU into an infinite loop due to kvm_mmu_page_fault() getting false positive on its PFERR_NESTED_GUEST_PAGE logic (though only because that logic is also buggy/flawed). In practice, this is largely a non-issue because so GVA_IS_VALID is almost always set. However, when TDX comes along, GVA_IS_VALID will *never* be set, as the TDX Module deliberately clears bits 12:7 in exit qualification, e.g. so that the faulting virtual address and other metadata that aren't practically useful for the hypervisor aren't leaked to the untrusted host. When exit is due to EPT violation, bits 12-7 of the exit qualification are cleared to 0. Fixes: eebed2438923 ("kvm: nVMX: Add support for fast unprotection of nested guest page tables") Reviewed-by: Yuan Yao <yuan.yao@intel.com> Link: https://lore.kernel.org/r/20240831001538.336683-2-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
| | * | | | | KVM: nVMX: Assert that vcpu->mutex is held when accessing secondary VMCSesSean Christopherson2024-09-101-0/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Add lockdep assertions in get_vmcs12() and get_shadow_vmcs12() to verify the vCPU's mutex is held, as the returned VMCS objects are dynamically allocated/freed when nested VMX is turned on/off, i.e. accessing vmcs12 structures without holding vcpu->mutex is susceptible to use-after-free. Waive the assertion if the VM is being destroyed, as KVM currently forces a nested VM-Exit when freeing the vCPU. If/when that wart is fixed, the assertion can/should be converted to an unqualified lockdep assertion. See also https://lore.kernel.org/all/Zsd0TqCeY3B5Sb5b@google.com. Link: https://lore.kernel.org/r/20240906043413.1049633-8-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
| | * | | | | KVM: nVMX: Explicitly invalidate posted_intr_nv if PI is disabled at VM-EnterSean Christopherson2024-09-102-2/+11
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Explicitly invalidate posted_intr_nv when emulating nested VM-Enter and posted interrupts are disabled to make it clear that posted_intr_nv is valid if and only if nested posted interrupts are enabled, and as a cheap way to harden against KVM bugs. KVM initializes posted_intr_nv to -1 at vCPU creation and resets it to -1 when unloading vmcs12 and/or leaving nested mode, i.e. this is not a bug fix (or at least, it's not intended to be a bug fix). Note, tracking nested.posted_intr_nv as a u16 subtly adds a measure of safety, as it prevents unintentionally matching KVM's informal "no IRQ" vector of -1, stored as a signed int. Because a u16 can be always be represented as a signed int, the effective "invalid" value of posted_intr_nv, 65535, will be preserved as-is when comparing against an int, i.e. will be zero-extended, not sign-extended, and thus won't get a false positive if KVM is buggy and compares posted_intr_nv against -1. Opportunistically add a comment in vmx_deliver_nested_posted_interrupt() to call out that it must check vmx->nested.posted_intr_nv, not the vector in vmcs12, which is presumably the _entire_ reason nested.posted_intr_nv exists. E.g. vmcs12 is a KVM-controlled snapshot, so there are no TOCTOU races to worry about, the only potential badness is if the vCPU leaves nested and frees vmcs12 between the sender checking is_guest_mode() and dereferencing the vmcs12 pointer. Link: https://lore.kernel.org/r/20240906043413.1049633-7-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
| | * | | | | KVM: x86: Fold kvm_get_apic_interrupt() into kvm_cpu_get_interrupt()Sean Christopherson2024-09-103-13/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Fold kvm_get_apic_interrupt() into kvm_cpu_get_interrupt() now that nVMX essentially open codes kvm_get_apic_interrupt() in order to correctly emulate nested posted interrupts. Opportunistically stop exporting kvm_cpu_get_interrupt(), as the aforementioned nVMX flow was the only user in vendor code. Link: https://lore.kernel.org/r/20240906043413.1049633-6-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
| | * | | | | KVM: nVMX: Detect nested posted interrupt NV at nested VM-Exit injectionSean Christopherson2024-09-101-0/+14
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When synthensizing a nested VM-Exit due to an external interrupt, pend a nested posted interrupt if the external interrupt vector matches L2's PI notification vector, i.e. if the interrupt is a PI notification for L2. This fixes a bug where KVM will incorrectly inject VM-Exit instead of processing nested posted interrupt when IPI virtualization is enabled. Per the SDM, detection of the notification vector doesn't occur until the interrupt is acknowledge and deliver to the CPU core. If the external-interrupt exiting VM-execution control is 1, any unmasked external interrupt causes a VM exit (see Section 26.2). If the "process posted interrupts" VM-execution control is also 1, this behavior is changed and the processor handles an external interrupt as follows: 1. The local APIC is acknowledged; this provides the processor core with an interrupt vector, called here the physical vector. 2. If the physical vector equals the posted-interrupt notification vector, the logical processor continues to the next step. Otherwise, a VM exit occurs as it would normally due to an external interrupt; the vector is saved in the VM-exit interruption-information field. For the most part, KVM has avoided problems because a PI NV for L2 that arrives will L2 is active will be processed by hardware, and KVM checks for a pending notification vector during nested VM-Enter. Thus, to hit the bug, the PI NV interrupt needs to sneak its way into L1's vIRR while L2 is active. Without IPI virtualization, the scenario is practically impossible to hit, modulo L1 doing weird things (see below), as the ordering between vmx_deliver_posted_interrupt() and nested VM-Enter effectively guarantees that either the sender will see the vCPU as being in_guest_mode(), or the receiver will see the interrupt in its vIRR. With IPI virtualization, introduced by commit d588bb9be1da ("KVM: VMX: enable IPI virtualization"), the sending CPU effectively implements a rough equivalent of vmx_deliver_posted_interrupt(), sans the nested PI NV check. If the target vCPU has a valid PID, the CPU will send a PI NV interrupt based on _L1's_ PID, as the sender's because IPIv table points at L1 PIDs. PIR := 32 bytes at PID_ADDR; // under lock PIR[V] := 1; store PIR at PID_ADDR; // release lock NotifyInfo := 8 bytes at PID_ADDR + 32; // under lock IF NotifyInfo.ON = 0 AND NotifyInfo.SN = 0; THEN NotifyInfo.ON := 1; SendNotify := 1; ELSE SendNotify := 0; FI; store NotifyInfo at PID_ADDR + 32; // release lock IF SendNotify = 1; THEN send an IPI specified by NotifyInfo.NDST and NotifyInfo.NV; FI; As a result, the target vCPU ends up receiving an interrupt on KVM's POSTED_INTR_VECTOR while L2 is running, with an interrupt in L1's PIR for L2's nested PI NV. The POSTED_INTR_VECTOR interrupt triggers a VM-Exit from L2 to L0, KVM moves the interrupt from L1's PIR to vIRR, triggers a KVM_REQ_EVENT prior to re-entry to L2, and calls vmx_check_nested_events(), effectively bypassing all of KVM's "early" checks on nested PI NV. Without IPI virtualization, the bug can likely be hit only if L1 programs an assigned device to _post_ an interrupt to L2's notification vector, by way of L1's PID.PIR. Doing so would allow the interrupt to get into L1's vIRR without KVM checking vmcs12's NV. Which is architecturally allowed, but unlikely behavior for a hypervisor. Cc: Zeng Guang <guang.zeng@intel.com> Reviewed-by: Chao Gao <chao.gao@intel.com> Link: https://lore.kernel.org/r/20240906043413.1049633-5-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
| | * | | | | KVM: nVMX: Suppress external interrupt VM-Exit injection if there's no IRQSean Christopherson2024-09-101-1/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | In the should-be-impossible scenario that kvm_cpu_get_interrupt() doesn't return a valid vector after checking kvm_cpu_has_interrupt(), skip VM-Exit injection to reduce the probability of crashing/confusing L1. Now that KVM gets the IRQ _before_ calling nested_vmx_vmexit(), squashing the VM-Exit injection is trivial since there are no actions that need to be undone. Reviewed-by: Chao Gao <chao.gao@intel.com> Link: https://lore.kernel.org/r/20240906043413.1049633-4-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
| | * | | | | KVM: nVMX: Get to-be-acknowledge IRQ for nested VM-Exit at injection siteSean Christopherson2024-09-103-10/+30
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Move the logic to get the to-be-acknowledge IRQ for a nested VM-Exit from nested_vmx_vmexit() to vmx_check_nested_events(), which is subtly the one and only path where KVM invokes nested_vmx_vmexit() with EXIT_REASON_EXTERNAL_INTERRUPT. A future fix will perform a last-minute check on L2's nested posted interrupt notification vector, just before injecting a nested VM-Exit. To handle that scenario correctly, KVM needs to get the interrupt _before_ injecting VM-Exit, as simply querying the highest priority interrupt, via kvm_cpu_has_interrupt(), would result in TOCTOU bug, as a new, higher priority interrupt could arrive between kvm_cpu_has_interrupt() and kvm_cpu_get_interrupt(). Unfortunately, simply moving the call to kvm_cpu_get_interrupt() doesn't suffice, as a VMWRITE to GUEST_INTERRUPT_STATUS.SVI is hiding in kvm_get_apic_interrupt(), and acknowledging the interrupt before nested VM-Exit would cause the VMWRITE to hit vmcs02 instead of vmcs01. Open code a rough equivalent to kvm_cpu_get_interrupt() so that the IRQ is acknowledged after emulating VM-Exit, taking care to avoid the TOCTOU issue described above. Opportunistically convert the WARN_ON() to a WARN_ON_ONCE(). If KVM has a bug that results in a false positive from kvm_cpu_has_interrupt(), spamming dmesg won't help the situation. Note, nested_vmx_reflect_vmexit() can never reflect external interrupts as they are always "wanted" by L0. Link: https://lore.kernel.org/r/20240906043413.1049633-3-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
| | * | | | | KVM: x86: Move "ack" phase of local APIC IRQ delivery to separate APISean Christopherson2024-09-102-4/+14
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Split the "ack" phase, i.e. the movement of an interrupt from IRR=>ISR, out of kvm_get_apic_interrupt() and into a separate API so that nested VMX can acknowledge a specific interrupt _after_ emulating a VM-Exit from L2 to L1. To correctly emulate nested posted interrupts while APICv is active, KVM must: 1. find the highest pending interrupt. 2. check if that IRQ is L2's notification vector 3. emulate VM-Exit if the IRQ is NOT the notification vector 4. ACK the IRQ in L1 _after_ VM-Exit When APICv is active, the process of moving the IRQ from the IRR to the ISR also requires a VMWRITE to update vmcs01.GUEST_INTERRUPT_STATUS.SVI, and so acknowledging the interrupt before switching to vmcs01 would result in marking the IRQ as in-service in the wrong VMCS. KVM currently fudges around this issue by doing kvm_get_apic_interrupt() smack dab in the middle of emulating VM-Exit, but that hack doesn't play nice with nested posted interrupts, as notification vector IRQs don't trigger a VM-Exit in the first place. Cc: Nathan Chancellor <nathan@kernel.org> Link: https://lore.kernel.org/r/20240906043413.1049633-2-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
| | * | | | | KVM: VMX: Also clear SGX EDECCSSA in KVM CPU caps when SGX is disabledKai Huang2024-09-101-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When SGX EDECCSSA support was added to KVM in commit 16a7fe3728a8 ("KVM/VMX: Allow exposing EDECCSSA user leaf function to KVM guest"), it forgot to clear the X86_FEATURE_SGX_EDECCSSA bit in KVM CPU caps when KVM SGX is disabled. Fix it. Fixes: 16a7fe3728a8 ("KVM/VMX: Allow exposing EDECCSSA user leaf function to KVM guest") Signed-off-by: Kai Huang <kai.huang@intel.com> Link: https://lore.kernel.org/r/20240905120837.579102-1-kai.huang@intel.com Signed-off-by: Sean Christopherson <seanjc@google.com>
| | * | | | | KVM: VMX: hyper-v: Prevent impossible NULL pointer dereference in evmcs_load()Vitaly Kuznetsov2024-08-221-0/+8
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | GCC 12.3.0 complains about a potential NULL pointer dereference in evmcs_load() as hv_get_vp_assist_page() can return NULL. In fact, this cannot happen because KVM verifies (hv_init_evmcs()) that every CPU has a valid VP assist page and aborts enabling the feature otherwise. CPU onlining path is also checked in vmx_hardware_enable(). To make the compiler happy and to future proof the code, add a KVM_BUG_ON() sentinel. It doesn't seem to be possible (and logical) to observe evmcs_load() happening without an active vCPU so it is presumed that kvm_get_running_vcpu() can't return NULL. No functional change intended. Reported-by: Mirsad Todorovac <mtodorovac69@gmail.com> Suggested-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com> Link: https://lore.kernel.org/r/20240816130124.286226-1-vkuznets@redhat.com Signed-off-by: Sean Christopherson <seanjc@google.com>
| | * | | | | KVM: nVMX: Use vmx_segment_cache_clear() instead of open coded equivalentMaxim Levitsky2024-08-223-5/+7
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | In prepare_vmcs02_rare(), call vmx_segment_cache_clear() instead of setting segment_cache.bitmask directly. Using the helper minimizes the chances of prepare_vmcs02_rare() doing the wrong thing in the future, e.g. if KVM ends up doing more than just zero the bitmask when purging the cache. No functional change intended. Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com> Link: https://lore.kernel.org/r/20240725175232.337266-2-mlevitsk@redhat.com [sean: massage changelog] Signed-off-by: Sean Christopherson <seanjc@google.com>
| | * | | | | KVM: nVMX: Honor userspace MSR filter lists for nested VM-Enter/VM-ExitSean Christopherson2024-08-223-8/+12
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Synthesize a consistency check VM-Exit (VM-Enter) or VM-Abort (VM-Exit) if L1 attempts to load/store an MSR via the VMCS MSR lists that userspace has disallowed access to via an MSR filter. Intel already disallows including a handful of "special" MSRs in the VMCS lists, so denying access isn't completely without precedent. More importantly, the behavior is well-defined _and_ can be communicated the end user, e.g. to the customer that owns a VM running as L1 on top of KVM. On the other hand, ignoring userspace MSR filters is all but guaranteed to result in unexpected behavior as the access will hit KVM's internal state, which is likely not up-to-date. Unlike KVM-internal accesses, instruction emulation, and dedicated VMCS fields, the MSRs in the VMCS load/store lists are 100% guest controlled, thus making it all but impossible to reason about the correctness of ignoring the MSR filter. And if userspace *really* wants to deny access to MSRs via the aforementioned scenarios, userspace can hide the associated feature from the guest, e.g. by disabling the PMU to prevent accessing PERF_GLOBAL_CTRL via its VMCS field. But for the MSR lists, KVM is blindly processing MSRs; the MSR filters are the _only_ way for userspace to deny access. This partially reverts commit ac8d6cad3c7b ("KVM: x86: Only do MSR filtering when access MSR by rdmsr/wrmsr"). Cc: Hou Wenlong <houwenlong.hwl@antgroup.com> Cc: Jim Mattson <jmattson@google.com> Link: https://lore.kernel.org/r/20240722235922.3351122-1-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
| | * | | | | KVM: VMX: Do not account for temporary memory allocation in ECREATE emulationKai Huang2024-08-221-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | In handle_encls_ecreate(), a page is allocated to store a copy of SECS structure used by the ENCLS[ECREATE] leaf from the guest. This page is only used temporarily and is freed after use in handle_encls_ecreate(). Don't account for the memory allocation of this page per [1]. Link: https://lore.kernel.org/kvm/b999afeb588eb75d990891855bc6d58861968f23.camel@intel.com/T/#mb81987afc3ab308bbb5861681aa9a20f2aece7fd [1] Signed-off-by: Kai Huang <kai.huang@intel.com> Link: https://lore.kernel.org/r/20240715101224.90958-1-kai.huang@intel.com Signed-off-by: Sean Christopherson <seanjc@google.com>
| | * | | | | KVM: VMX: Modify the BUILD_BUG_ON_MSG of the 32-bit field in the ↵Qiang Liu2024-08-221-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | vmcs_check16 function According to the SDM, the meaning of field bit 0 is: Access type (0 = full; 1 = high); must be full for 16-bit, 32-bit, and natural-width fields. So there is no 32-bit high field here, it should be a 32-bit field instead. Signed-off-by: Qiang Liu <liuq131@chinatelecom.cn> Link: https://lore.kernel.org/r/20240702064609.52487-1-liuq131@chinatelecom.cn Signed-off-by: Sean Christopherson <seanjc@google.com>
| * | | | | | Merge tag 'kvm-x86-svm-6.12' of https://github.com/kvm-x86/linux into HEADPaolo Bonzini2024-09-175-27/+47
| |\ \ \ \ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | KVM SVM changes for 6.12: - Don't stuff the RSB after VM-Exit when RETPOLINE=y and AutoIBRS is enabled, i.e. when the CPU has already flushed the RSB. - Trace the per-CPU host save area as a VMCB pointer to improve readability and cleanup the retrieval of the SEV-ES host save area. - Remove unnecessary accounting of temporary nested VMCB related allocations.