linux - linux

	Commit message (Collapse)	Author	Age	Files	Lines
*	Merge tag 'kvm-x86-xen-6.9' of https://github.com/kvm-x86/linux into HEAD	Paolo Bonzini	2024-03-11	16	-268/+601
\|\ \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	KVM Xen and pfncache changes for 6.9: - Rip out the half-baked support for using gfn_to_pfn caches to manage pages that are "mapped" into guests via physical addresses. - Add support for using gfn_to_pfn caches with only a host virtual address, i.e. to bypass the "gfn" stage of the cache. The primary use case is overlay pages, where the guest may change the gfn used to reference the overlay page, but the backing hva+pfn remains the same. - Add an ioctl() to allow mapping Xen's shared_info page using an hva instead of a gpa, so that userspace doesn't need to reconfigure and invalidate the cache/mapping if the guest changes the gpa (but userspace keeps the resolved hva the same). - When possible, use a single host TSC value when computing the deadline for Xen timers in order to improve the accuracy of the timer emulation. - Inject pending upcall events when the vCPU software-enables its APIC to fix a bug where an upcall can be lost (and to follow Xen's behavior). - Fall back to the slow path instead of warning if "fast" IRQ delivery of Xen events fails, e.g. if the guest has aliased xAPIC IDs. - Extend gfn_to_pfn_cache's mutex to cover (de)activation (in addition to refresh), and drop a now-redundant acquisition of xen_lock (that was protecting the shared_info cache) to fix a deadlock due to recursively acquiring xen_lock.
\| *	KVM: x86/xen: fix recursive deadlock in timer injection	David Woodhouse	2024-03-05	1	-4/+0
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	The fast-path timer delivery introduced a recursive locking deadlock when userspace configures a timer which has already expired and is delivered immediately. The call to kvm_xen_inject_timer_irqs() can call to kvm_xen_set_evtchn() which may take kvm->arch.xen.xen_lock, which is already held in kvm_xen_vcpu_get_attr(). ============================================ WARNING: possible recursive locking detected 6.8.0-smp--5e10b4d51d77-drs #232 Tainted: G O -------------------------------------------- xen_shinfo_test/250013 is trying to acquire lock: ffff938c9930cc30 (&kvm->arch.xen.xen_lock){+.+.}-{3:3}, at: kvm_xen_set_evtchn+0x74/0x170 [kvm] but task is already holding lock: ffff938c9930cc30 (&kvm->arch.xen.xen_lock){+.+.}-{3:3}, at: kvm_xen_vcpu_get_attr+0x38/0x250 [kvm] Now that the gfn_to_pfn_cache has its own self-sufficient locking, its callers no longer need to ensure serialization, so just stop taking kvm->arch.xen.xen_lock from kvm_xen_set_evtchn(). Fixes: 77c9b9dea4fb ("KVM: x86/xen: Use fast path for Xen timer delivery") Signed-off-by: David Woodhouse <dwmw@amazon.co.uk> Reviewed-by: Paul Durrant <paul@xen.org> Link: https://lore.kernel.org/r/20240227115648.3104-6-dwmw2@infradead.org Signed-off-by: Sean Christopherson <seanjc@google.com>
\| *	KVM: pfncache: simplify locking and make more self-contained	David Woodhouse	2024-03-05	1	-10/+11
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	The locking on the gfn_to_pfn_cache is... interesting. And awful. There is a rwlock in ->lock which readers take to ensure protection against concurrent changes. But __kvm_gpc_refresh() makes assumptions that certain fields will not change even while it drops the write lock and performs MM operations to revalidate the target PFN and kernel mapping. Commit 93984f19e7bc ("KVM: Fully serialize gfn=>pfn cache refresh via mutex") partly addressed that — not by fixing it, but by adding a new mutex, ->refresh_lock. This prevented concurrent __kvm_gpc_refresh() calls on a given gfn_to_pfn_cache, but is still only a partial solution. There is still a theoretical race where __kvm_gpc_refresh() runs in parallel with kvm_gpc_deactivate(). While __kvm_gpc_refresh() has dropped the write lock, kvm_gpc_deactivate() clears the ->active flag and unmaps ->khva. Then __kvm_gpc_refresh() determines that the previous ->pfn and ->khva are still valid, and reinstalls those values into the structure. This leaves the gfn_to_pfn_cache with the ->valid bit set, but ->active clear. And a ->khva which looks like a reasonable kernel address but is actually unmapped. All it takes is a subsequent reactivation to cause that ->khva to be dereferenced. This would theoretically cause an oops which would look something like this: [1724749.564994] BUG: unable to handle page fault for address: ffffaa3540ace0e0 [1724749.565039] RIP: 0010:__kvm_xen_has_interrupt+0x8b/0xb0 I say "theoretically" because theoretically, that oops that was seen in production cannot happen. The code which uses the gfn_to_pfn_cache is supposed to have its own locking, to further paper over the fact that the gfn_to_pfn_cache's own papering-over (->refresh_lock) of its own rwlock abuse is not sufficient. For the Xen vcpu_info that external lock is the vcpu->mutex, and for the shared info it's kvm->arch.xen.xen_lock. Those locks ought to protect the gfn_to_pfn_cache against concurrent deactivation vs. refresh in all but the cases where the vcpu or kvm object is being destroyed, in which case the subsequent reactivation should never happen. Theoretically. Nevertheless, this locking abuse is awful and should be fixed, even if no clear explanation can be found for how the oops happened. So expand the use of the ->refresh_lock mutex to ensure serialization of activate/deactivate vs. refresh and make the pfncache locking entirely self-sufficient. This means that a future commit can simplify the locking in the callers, such as the Xen emulation code which has an outstanding problem with recursive locking of kvm->arch.xen.xen_lock, which will no longer be necessary. The rwlock abuse described above is still not best practice, although it's harmless now that the ->refresh_lock is held for the entire duration while the offending code drops the write lock, does some other stuff, then takes the write lock again and assumes nothing changed. That can also be fixed^W cleaned up in a subsequent commit, but this commit is a simpler basis for the Xen deadlock fix mentioned above. Signed-off-by: David Woodhouse <dwmw@amazon.co.uk> Reviewed-by: Paul Durrant <paul@xen.org> Link: https://lore.kernel.org/r/20240227115648.3104-5-dwmw2@infradead.org [sean: use guard(mutex) to fix a missed unlock] Signed-off-by: Sean Christopherson <seanjc@google.com>
\| *	KVM: x86/xen: remove WARN_ON_ONCE() with false positives in evtchn delivery	David Woodhouse	2024-03-05	1	-4/+2
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	The kvm_xen_inject_vcpu_vector() function has a comment saying "the fast version will always work for physical unicast", justifying its use of kvm_irq_delivery_to_apic_fast() and the WARN_ON_ONCE() when that fails. In fact that assumption isn't true if X2APIC isn't in use by the guest and there is (8-bit x)APIC ID aliasing. A single "unicast" destination APIC ID may then be delivered to multiple vCPUs. Remove the warning, and in fact it might as well just call kvm_irq_delivery_to_apic(). Reported-by: Michal Luczaj <mhal@rbox.co> Fixes: fde0451be8fb3 ("KVM: x86/xen: Support per-vCPU event channel upcall via local APIC") Signed-off-by: David Woodhouse <dwmw@amazon.co.uk> Reviewed-by: Paul Durrant <paul@xen.org> Link: https://lore.kernel.org/r/20240227115648.3104-4-dwmw2@infradead.org Signed-off-by: Sean Christopherson <seanjc@google.com>
\| *	KVM: x86/xen: inject vCPU upcall vector when local APIC is enabled	David Woodhouse	2024-03-05	3	-2/+23
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Linux guests since commit b1c3497e604d ("x86/xen: Add support for HVMOP_set_evtchn_upcall_vector") in v6.0 onwards will use the per-vCPU upcall vector when it's advertised in the Xen CPUID leaves. This upcall is injected through the guest's local APIC as an MSI, unlike the older system vector which was merely injected by the hypervisor any time the CPU was able to receive an interrupt and the upcall_pending flags is set in its vcpu_info. Effectively, that makes the per-CPU upcall edge triggered instead of level triggered, which results in the upcall being lost if the MSI is delivered when the local APIC is disabled. Xen checks the vcpu_info->evtchn_upcall_pending flag when the local APIC for a vCPU is software enabled (in fact, on any write to the SPIV register which doesn't disable the APIC). Do the same in KVM since KVM doesn't provide a way for userspace to intervene and trap accesses to the SPIV register of a local APIC emulated by KVM. Fixes: fde0451be8fb3 ("KVM: x86/xen: Support per-vCPU event channel upcall via local APIC") Signed-off-by: David Woodhouse <dwmw@amazon.co.uk> Reviewed-by: Paul Durrant <paul@xen.org> Cc: stable@vger.kernel.org Link: https://lore.kernel.org/r/20240227115648.3104-3-dwmw2@infradead.org Signed-off-by: Sean Christopherson <seanjc@google.com>
\| *	KVM: x86/xen: improve accuracy of Xen timers	David Woodhouse	2024-03-05	3	-40/+152
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	A test program such as http://david.woodhou.se/timerlat.c confirms user reports that timers are increasingly inaccurate as the lifetime of a guest increases. Reporting the actual delay observed when asking for 100µs of sleep, it starts off OK on a newly-launched guest but gets worse over time, giving incorrect sleep times: root@ip-10-0-193-21:~# ./timerlat -c -n 5 00000000 latency 103243/100000 (3.2430%) 00000001 latency 103243/100000 (3.2430%) 00000002 latency 103242/100000 (3.2420%) 00000003 latency 103245/100000 (3.2450%) 00000004 latency 103245/100000 (3.2450%) The biggest problem is that get_kvmclock_ns() returns inaccurate values when the guest TSC is scaled. The guest sees a TSC value scaled from the host TSC by a mul/shift conversion (hopefully done in hardware). The guest then converts that guest TSC value into nanoseconds using the mul/shift conversion given to it by the KVM pvclock information. But get_kvmclock_ns() performs only a single conversion directly from host TSC to nanoseconds, giving a different result. A test program at http://david.woodhou.se/tsdrift.c demonstrates the cumulative error over a day. It's non-trivial to fix get_kvmclock_ns(), although I'll come back to that. The actual guest hv_clock is per-CPU, and theoretically each vCPU could be running at a different frequency. But this patch is needed anyway because... The other issue with Xen timers was that the code would snapshot the host CLOCK_MONOTONIC at some point in time, and then... after a few interrupts may have occurred, some preemption perhaps... would also read the guest's kvmclock. Then it would proceed under the false assumption that those two happened at the same time. Any time which actually elapsed between reading the two clocks was introduced as inaccuracies in the time at which the timer fired. Fix it to use a variant of kvm_get_time_and_clockread(), which reads the host TSC just once, then use the returned TSC value to calculate the kvmclock (making sure to do that the way the guest would instead of making the same mistake get_kvmclock_ns() does). Sadly, hrtimers based on CLOCK_MONOTONIC_RAW are not supported, so Xen timers still have to use CLOCK_MONOTONIC. In practice the difference between the two won't matter over the timescales involved, as the absolute values don't matter; just the delta. This does mean a new variant of kvm_get_time_and_clockread() is needed; called kvm_get_monotonic_and_clockread() because that's what it does. Fixes: 536395260582 ("KVM: x86/xen: handle PV timers oneshot mode") Signed-off-by: David Woodhouse <dwmw@amazon.co.uk> Reviewed-by: Paul Durrant <paul@xen.org> Link: https://lore.kernel.org/r/20240227115648.3104-2-dwmw2@infradead.org [sean: massage moved comment, tweak if statement formatting] Signed-off-by: Sean Christopherson <seanjc@google.com>
\| *	KVM: x86/xen: allow vcpu_info content to be 'safely' copied	Paul Durrant	2024-02-22	1	-3/+0
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	If the guest sets an explicit vcpu_info GPA then, for any of the first 32 vCPUs, the content of the default vcpu_info in the shared_info page must be copied into the new location. Because this copy may race with event delivery (which updates the 'evtchn_pending_sel' field in vcpu_info), event delivery needs to be deferred until the copy is complete. Happily there is already a shadow of 'evtchn_pending_sel' in kvm_vcpu_xen that is used in atomic context if the vcpu_info PFN cache has been invalidated so that the update of vcpu_info can be deferred until the cache can be refreshed (on vCPU thread's the way back into guest context). Use this shadow if the vcpu_info cache has been deactivated, so that the VMM can safely copy the vcpu_info content and then re-activate the cache with the new GPA. To do this, stop considering an inactive vcpu_info cache as a hard error in kvm_xen_set_evtchn_fast(), and let the existing kvm_gpc_check() fail and kick the vCPU (if necessary). Signed-off-by: Paul Durrant <pdurrant@amazon.com> Reviewed-by: David Woodhouse <dwmw@amazon.co.uk> Link: https://lore.kernel.org/r/20240215152916.1158-21-paul@xen.org [sean: add a bit of verbosity to the changelog] Signed-off-by: Sean Christopherson <seanjc@google.com>
\| *	KVM: pfncache: check the need for invalidation under read lock first	Paul Durrant	2024-02-22	1	-3/+19
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	When processing mmu_notifier invalidations for gpc caches, pre-check for overlap with the invalidation event while holding gpc->lock for read, and only take gpc->lock for write if the cache needs to be invalidated. Doing a pre-check without taking gpc->lock for write avoids unnecessarily contending the lock for unrelated invalidations, which is very beneficial for caches that are heavily used (but rarely subjected to mmu_notifier invalidations). Signed-off-by: Paul Durrant <pdurrant@amazon.com> Reviewed-by: David Woodhouse <dwmw@amazon.co.uk> Link: https://lore.kernel.org/r/20240215152916.1158-20-paul@xen.org Signed-off-by: Sean Christopherson <seanjc@google.com>
\| *	KVM: x86/xen: advertize the KVM_XEN_HVM_CONFIG_SHARED_INFO_HVA capability	Paul Durrant	2024-02-22	1	-1/+2
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Now that all relevant kernel changes and selftests are in place, enable the new capability. Signed-off-by: Paul Durrant <pdurrant@amazon.com> Reviewed-by: David Woodhouse <dwmw@amazon.co.uk> Link: https://lore.kernel.org/r/20240215152916.1158-17-paul@xen.org Signed-off-by: Sean Christopherson <seanjc@google.com>
\| *	KVM: selftests: re-map Xen's vcpu_info using HVA rather than GPA	Paul Durrant	2024-02-22	1	-0/+15
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	If the relevant capability (KVM_XEN_HVM_CONFIG_SHARED_INFO_HVA) is present then re-map vcpu_info using the HVA part way through the tests to make sure then there is no functional change. Signed-off-by: Paul Durrant <pdurrant@amazon.com> Reviewed-by: David Woodhouse <dwmw@amazon.co.uk> Link: https://lore.kernel.org/r/20240215152916.1158-16-paul@xen.org Signed-off-by: Sean Christopherson <seanjc@google.com>
\| *	KVM: selftests: map Xen's shared_info page using HVA rather than GFN	Paul Durrant	2024-02-22	1	-9/+35
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Using the HVA of the shared_info page is more efficient, so if the capability (KVM_XEN_HVM_CONFIG_SHARED_INFO_HVA) is present use that method to do the mapping. NOTE: Have the juggle_shinfo_state() thread map and unmap using both GFN and HVA, to make sure the older mechanism is not broken. Signed-off-by: Paul Durrant <pdurrant@amazon.com> Reviewed-by: David Woodhouse <dwmw@amazon.co.uk> Link: https://lore.kernel.org/r/20240215152916.1158-15-paul@xen.org Signed-off-by: Sean Christopherson <seanjc@google.com>
\| *	KVM: x86/xen: allow vcpu_info to be mapped by fixed HVA	Paul Durrant	2024-02-22	3	-12/+52
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	If the guest does not explicitly set the GPA of vcpu_info structure in memory then, for guests with 32 vCPUs or fewer, the vcpu_info embedded in the shared_info page may be used. As described in a previous commit, the shared_info page is an overlay at a fixed HVA within the VMM, so in this case it also more optimal to activate the vcpu_info cache with a fixed HVA to avoid unnecessary invalidation if the guest memory layout is modified. Signed-off-by: Paul Durrant <pdurrant@amazon.com> Reviewed-by: David Woodhouse <dwmw@amazon.co.uk> Link: https://lore.kernel.org/r/20240215152916.1158-14-paul@xen.org [sean: use kvm_gpc_is_{gpa,hva}_active()] Signed-off-by: Sean Christopherson <seanjc@google.com>
\| *	KVM: x86/xen: allow shared_info to be mapped by fixed HVA	Paul Durrant	2024-02-22	3	-15/+56
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	The shared_info page is not guest memory as such. It is a dedicated page allocated by the VMM and overlaid onto guest memory in a GFN chosen by the guest and specified in the XENMEM_add_to_physmap hypercall. The guest may even request that shared_info be moved from one GFN to another by re-issuing that hypercall, but the HVA is never going to change. Because the shared_info page is an overlay the memory slots need to be updated in response to the hypercall. However, memory slot adjustment is not atomic and, whilst all vCPUs are paused, there is still the possibility that events may be delivered (which requires the shared_info page to be updated) whilst the shared_info GPA is absent. The HVA is never absent though, so it makes much more sense to use that as the basis for the kernel's mapping. Hence add a new KVM_XEN_ATTR_TYPE_SHARED_INFO_HVA attribute type for this purpose and a KVM_XEN_HVM_CONFIG_SHARED_INFO_HVA flag to advertize its availability. Don't actually advertize it yet though. That will be done in a subsequent patch, which will also add tests for the new attribute type. Also update the KVM API documentation with the new attribute and also fix it up to consistently refer to 'shared_info' (with the underscore). Signed-off-by: Paul Durrant <pdurrant@amazon.com> Reviewed-by: David Woodhouse <dwmw@amazon.co.uk> Link: https://lore.kernel.org/r/20240215152916.1158-13-paul@xen.org [sean: store "hva" as a user pointer, use kvm_gpc_is_{gpa,hva}_active()] Signed-off-by: Sean Christopherson <seanjc@google.com>
\| *	KVM: x86/xen: re-initialize shared_info if guest (32/64-bit) mode is set	Paul Durrant	2024-02-20	1	-3/+26
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	If the shared_info PFN cache has already been initialized then the content of the shared_info page needs to be re-initialized whenever the guest mode is (re)set. Setting the guest mode is either done explicitly by the VMM via the KVM_XEN_ATTR_TYPE_LONG_MODE attribute, or implicitly when the guest writes the MSR to set up the hypercall page. Signed-off-by: Paul Durrant <pdurrant@amazon.com> Reviewed-by: David Woodhouse <dwmw@amazon.co.uk> Link: https://lore.kernel.org/r/20240215152916.1158-12-paul@xen.org Signed-off-by: Sean Christopherson <seanjc@google.com>
\| *	KVM: x86/xen: separate initialization of shared_info cache and content	Paul Durrant	2024-02-20	1	-23/+32
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	A subsequent patch will allow shared_info to be initialized using either a GPA or a user-space (i.e. VMM) HVA. To make that patch cleaner, separate the initialization of the shared_info content from the activation of the pfncache. Signed-off-by: Paul Durrant <pdurrant@amazon.com> Reviewed-by: David Woodhouse <dwmw@amazon.co.uk> Link: https://lore.kernel.org/r/20240215152916.1158-11-paul@xen.org Signed-off-by: Sean Christopherson <seanjc@google.com>
\| *	KVM: pfncache: allow a cache to be activated with a fixed (userspace) HVA	Paul Durrant	2024-02-20	2	-28/+101
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Some pfncache pages may actually be overlays on guest memory that have a fixed HVA within the VMM. It's pointless to invalidate such cached mappings if the overlay is moved so allow a cache to be activated directly with the HVA to cater for such cases. A subsequent patch will make use of this facility. Signed-off-by: Paul Durrant <pdurrant@amazon.com> Reviewed-by: David Woodhouse <dwmw@amazon.co.uk> Link: https://lore.kernel.org/r/20240215152916.1158-10-paul@xen.org Signed-off-by: Sean Christopherson <seanjc@google.com>
\| *	KVM: s390: Refactor kvm_is_error_gpa() into kvm_is_gpa_in_memslot()	Sean Christopherson	2024-02-20	6	-15/+15
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Rename kvm_is_error_gpa() to kvm_is_gpa_in_memslot() and invert the polarity accordingly in order to (a) free up kvm_is_error_gpa() to match with kvm_is_error_{hva,page}(), and (b) to make it more obvious that the helper is doing a memslot lookup, i.e. not simply checking for INVALID_GPA. No functional change intended. Link: https://lore.kernel.org/r/20240215152916.1158-9-paul@xen.org Signed-off-by: Sean Christopherson <seanjc@google.com>
\| *	KVM: pfncache: include page offset in uhva and use it consistently	Paul Durrant	2024-02-20	1	-8/+21
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Currently the pfncache page offset is sometimes determined using the gpa and sometimes the khva, whilst the uhva is always page-aligned. After a subsequent patch is applied the gpa will not always be valid so adjust the code to include the page offset in the uhva and use it consistently as the source of truth. Also, where a page-aligned address is required, use PAGE_ALIGN_DOWN() for clarity. No functional change intended. Signed-off-by: Paul Durrant <pdurrant@amazon.com> Reviewed-by: David Woodhouse <dwmw@amazon.co.uk> Link: https://lore.kernel.org/r/20240215152916.1158-8-paul@xen.org Signed-off-by: Sean Christopherson <seanjc@google.com>
\| *	KVM: pfncache: stop open-coding offset_in_page()	Paul Durrant	2024-02-20	1	-3/+3
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Some code in pfncache uses offset_in_page() but in other places it is open- coded. Use offset_in_page() consistently everywhere. Signed-off-by: Paul Durrant <pdurrant@amazon.com> Reviewed-by: David Woodhouse <dwmw@amazon.co.uk> Link: https://lore.kernel.org/r/20240215152916.1158-7-paul@xen.org Signed-off-by: Sean Christopherson <seanjc@google.com>
\| *	KVM: pfncache: remove KVM_GUEST_USES_PFN usage	Paul Durrant	2024-02-20	5	-80/+16
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	As noted in [1] the KVM_GUEST_USES_PFN usage flag is never set by any callers of kvm_gpc_init(), and for good reason: the implementation is incomplete/broken. And it's not clear that there will ever be a user of KVM_GUEST_USES_PFN, as coordinating vCPUs with mmu_notifier events is non-trivial. Remove KVM_GUEST_USES_PFN and all related code, e.g. dropping KVM_GUEST_USES_PFN also makes the 'vcpu' argument redundant, to avoid having to reason about broken code as __kvm_gpc_refresh() evolves. Moreover, all existing callers specify KVM_HOST_USES_PFN so the usage check in hva_to_pfn_retry() and hence the 'usage' argument to kvm_gpc_init() are also redundant. [1] https://lore.kernel.org/all/ZQiR8IpqOZrOpzHC@google.com Signed-off-by: Paul Durrant <pdurrant@amazon.com> Reviewed-by: David Woodhouse <dwmw@amazon.co.uk> Link: https://lore.kernel.org/r/20240215152916.1158-6-paul@xen.org [sean: explicitly call out that guest usage is incomplete] Signed-off-by: Sean Christopherson <seanjc@google.com>
\| *	KVM: pfncache: add a mark-dirty helper	Paul Durrant	2024-02-20	3	-4/+14
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	At the moment pages are marked dirty by open-coded calls to mark_page_dirty_in_slot(), directly deferefencing the gpa and memslot from the cache. After a subsequent patch these may not always be set so add a helper now so that caller will protected from the need to know about this detail. Signed-off-by: Paul Durrant <pdurrant@amazon.com> Reviewed-by: David Woodhouse <dwmw@amazon.co.uk> Link: https://lore.kernel.org/r/20240215152916.1158-5-paul@xen.org [sean: decrease indentation, use gpa_to_gfn()] Signed-off-by: Sean Christopherson <seanjc@google.com>
\| *	KVM: x86/xen: mark guest pages dirty with the pfncache lock held	Paul Durrant	2024-02-20	1	-7/+6
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Sampling gpa and memslot from an unlocked pfncache may yield inconsistent values so, since there is no problem with calling mark_page_dirty_in_slot() with the pfncache lock held, relocate the calls in kvm_xen_update_runstate_guest() and kvm_xen_inject_pending_events() accordingly. Signed-off-by: Paul Durrant <pdurrant@amazon.com> Reviewed-by: David Woodhouse <dwmw@amazon.co.uk> Link: https://lore.kernel.org/r/20240215152916.1158-4-paul@xen.org Signed-off-by: Sean Christopherson <seanjc@google.com>
\| *	KVM: pfncache: remove unnecessary exports	Paul Durrant	2024-02-20	1	-5/+0
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	There is no need for the existing kvm_gpc_XXX() functions to be exported. Clean up now before additional functions are added in subsequent patches. Signed-off-by: Paul Durrant <pdurrant@amazon.com> Reviewed-by: David Woodhouse <dwmw@amazon.co.uk> Link: https://lore.kernel.org/r/20240215152916.1158-3-paul@xen.org Signed-off-by: Sean Christopherson <seanjc@google.com>
\| *	KVM: pfncache: Add a map helper function	Paul Durrant	2024-02-20	1	-18/+29
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	There is a pfncache unmap helper but mapping is open-coded. Arguably this is fine because mapping is done in only one place, hva_to_pfn_retry(), but adding the helper does make that function more readable. No functional change intended. Signed-off-by: Paul Durrant <pdurrant@amazon.com> Reviewed-by: David Woodhouse <dwmw@amazon.co.uk> Link: https://lore.kernel.org/r/20240215152916.1158-2-paul@xen.org Signed-off-by: Sean Christopherson <seanjc@google.com>
* \|	Merge tag 'kvm-x86-pmu-6.9' of https://github.com/kvm-x86/linux into HEAD	Paolo Bonzini	2024-03-11	23	-398/+1262
\|\ \ \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	KVM x86 PMU changes for 6.9: - Fix several bugs where KVM speciously prevents the guest from utilizing fixed counters and architectural event encodings based on whether or not guest CPUID reports support for the _architectural_ encoding. - Fix a variety of bugs in KVM's emulation of RDPMC, e.g. for "fast" reads, priority of VMX interception vs #GP, PMC types in architectural PMUs, etc. - Add a selftest to verify KVM correctly emulates RDMPC, counter availability, and a variety of other PMC-related behaviors that depend on guest CPUID, i.e. are difficult to validate via KVM-Unit-Tests. - Zero out PMU metadata on AMD if the virtual PMU is disabled to avoid wasting cycles, e.g. when checking if a PMC event needs to be synthesized when skipping an instruction. - Optimize triggering of emulated events, e.g. for "count instructions" events when skipping an instruction, which yields a ~10% performance improvement in VM-Exit microbenchmarks when a vPMU is exposed to the guest. - Tighten the check for "PMI in guest" to reduce false positives if an NMI arrives in the host while KVM is handling an IRQ VM-Exit.
\| * \|	KVM: x86/pmu: Explicitly check NMI from guest to reducee false positives	Like Xu	2024-02-27	2	-7/+9
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Explicitly check that the source of external interrupt is indeed an NMI in kvm_arch_pmi_in_guest(), which reduces perf-kvm false positive samples (host samples labelled as guest samples) generated by perf/core NMI mode if an NMI arrives after VM-Exit, but before kvm_after_interrupt(): # test: perf-record + cpu-cycles:HP (which collects host-only precise samples) # Symbol Overhead sys usr guest sys guest usr # ....................................... ........ ........ ........ ......... ......... # # Before: [g] entry_SYSCALL_64 24.63% 0.00% 0.00% 24.63% 0.00% [g] syscall_return_via_sysret 23.23% 0.00% 0.00% 23.23% 0.00% [g] files_lookup_fd_raw 6.35% 0.00% 0.00% 6.35% 0.00% # After: [k] perf_adjust_freq_unthr_context 57.23% 57.23% 0.00% 0.00% 0.00% [k] __vmx_vcpu_run 4.09% 4.09% 0.00% 0.00% 0.00% [k] vmx_update_host_rsp 3.17% 3.17% 0.00% 0.00% 0.00% In the above case, perf records the samples labelled '[g]', the RIPs behind the weird samples are actually being queried by perf_instruction_pointer() after determining whether it's in GUEST state or not, and here's the issue: If VM-Exit is caused by a non-NMI interrupt (such as hrtimer_interrupt) and at least one PMU counter is enabled on host, the kvm_arch_pmi_in_guest() will remain true (KVM_HANDLING_IRQ is set) until kvm_before_interrupt(). During this window, if a PMI occurs on host (since the KVM instructions on host are being executed), the control flow, with the help of the host NMI context, will be transferred to perf/core to generate performance samples, thus perf_instruction_pointer() and perf_guest_get_ip() is called. Since kvm_arch_pmi_in_guest() only checks if there is an interrupt, it may cause perf/core to mistakenly assume that the source RIP of the host NMI belongs to the guest world and use perf_guest_get_ip() to get the RIP of a vCPU that has already exited by a non-NMI interrupt. Error samples are recorded and presented to the end-user via perf-report. Such false positive samples could be eliminated by explicitly determining if the exit reason is KVM_HANDLING_NMI. Note that when VM-exit is indeed triggered by PMI and before HANDLING_NMI is cleared, it's also still possible that another PMI is generated on host. Also for perf/core timer mode, the false positives are still possible since those non-NMI sources of interrupts are not always being used by perf/core. For events that are host-only, perf/core can and should eliminate false positives by checking event->attr.exclude_guest, i.e. events that are configured to exclude KVM guests should never fire in the guest. Events that are configured to count host and guest are trickier, perhaps impossible to handle with 100% accuracy? And regardless of what accuracy is provided by perf/core, improving KVM's accuracy is cheap and easy, with no real downsides. Fixes: dd60d217062f ("KVM: x86: Fix perf timer mode IP reporting") Signed-off-by: Like Xu <likexu@tencent.com> Reviewed-by: Maxim Levitsky <mlevitsk@redhat.com> Link: https://lore.kernel.org/r/20231206032054.55070-1-likexu@tencent.com [sean: massage changelog, squash !!in_nmi() fixup from Like] Signed-off-by: Sean Christopherson <seanjc@google.com>
\| * \|	KVM: selftests: Test top-down slots event in x86's pmu_counters_test	Dapeng Mi	2024-02-21	1	-0/+3
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Although the fixed counter 3 and its exclusive pseudo slots event are not supported by KVM yet, the architectural slots event is supported by KVM and can be programmed on any GP counter. Thus add validation for this architectural slots event. Top-down slots event "counts the total number of available slots for an unhalted logical processor, and increments by machine-width of the narrowest pipeline as employed by the Top-down Microarchitecture Analysis method." As for the slot, it's an abstract concept which indicates how many uops (decoded from instructions) can be processed simultaneously (per cycle) on HW. In Top-down Microarchitecture Analysis (TMA) method, the processor is divided into two parts, frond-end and back-end. Assume there is a processor with classic 5-stage pipeline, fetch, decode, execute, memory access and register writeback. The former 2 stages (fetch/decode) are classified to frond-end and the latter 3 stages are classified to back-end. In modern Intel processors, a complicated instruction would be decoded into several uops (micro-operations) and so these uops can be processed simultaneously and then improve the performance. Thus, assume a processor can decode and dispatch 4 uops in front-end and execute 4 uops in back-end simultaneously (per-cycle), so the machine-width of this processor is 4 and this processor has 4 topdown slots per-cycle. If a slot is spare and can be used to process a new upcoming uop, then the slot is available, but if a uop occupies a slot for several cycles and can't be retired (maybe blocked by memory access), then this slot is stall and unavailable. Considering the testing instruction sequence can't be macro-fused on x86 platforms, the measured slots count should not be less than NUM_INSNS_RETIRED. Thus assert the slots count against NUM_INSNS_RETIRED. pmu_counters_test passed with this patch on Intel Sapphire Rapids. About the more information about TMA method, please refer the below link. https://www.intel.com/content/www/us/en/docs/vtune-profiler/cookbook/2023-0/top-down-microarchitecture-analysis-method.html Signed-off-by: Dapeng Mi <dapeng1.mi@linux.intel.com> Link: https://lore.kernel.org/r/20240218043003.2424683-1-dapeng1.mi@linux.intel.com Signed-off-by: Sean Christopherson <seanjc@google.com>
\| * \|	KVM: x86/pmu: Avoid CPL lookup if PMC enabline for USER and KERNEL is the same	Sean Christopherson	2024-02-01	1	-0/+7
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Don't bother querying the CPL if a PMC is (not) counting for both USER and KERNEL, i.e. if the end result is guaranteed to be the same regardless of the CPL. Querying the CPL on Intel requires a VMREAD, i.e. isn't free, and a single CMP+Jcc is cheap. Link: https://lore.kernel.org/r/20231110022857.1273836-11-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
\| * \|	KVM: x86/pmu: Check eventsel first when emulating (branch) insns retired	Sean Christopherson	2024-02-01	1	-6/+3
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	When triggering events, i.e. emulating PMC events in software, check for a matching event selector before checking the event is allowed. The "is allowed" check might be cheap, but it could also be very costly, e.g. if userspace has defined a large PMU event filter. The event selector check on the other hand is all but guaranteed to be <10 uops, e.g. looks something like: 0xffffffff8105e615 <+5>: movabs $0xf0000ffff,%rax 0xffffffff8105e61f <+15>: xor %rdi,%rsi 0xffffffff8105e622 <+18>: test %rax,%rsi 0xffffffff8105e625 <+21>: sete %al Link: https://lore.kernel.org/r/20231110022857.1273836-10-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
\| * \|	KVM: x86/pmu: Expand the comment about what bits are check emulating events	Sean Christopherson	2024-02-01	1	-1/+14
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Expand the comment about what bits are and aren't checked when emulating PMC events in software. As pointed out by Jim, AMD's mask includes bits 35:32, which on Intel overlap with the IN_TX and IN_TXCP bits (32 and 33) as well as reserved bits (34 and 45). Checking The IN_TX* bits is actually correct, as it's safe to assert that the vCPU can't be in an HLE/RTM transaction if KVM is emulating an instruction, i.e. KVM shouldn't count if either of those bits is set. For the reserved bits, KVM is has equal odds of being right if Intel adds new behavior, i.e. ignoring them is just as likely to be correct as checking them. Opportunistically explain why* the other flags aren't checked. Suggested-by: Jim Mattson <jmattson@google.com> Link: https://lore.kernel.org/r/20231110022857.1273836-9-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
\| * \|	KVM: x86/pmu: Snapshot event selectors that KVM emulates in software	Sean Christopherson	2024-02-01	4	-14/+24
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Snapshot the event selectors for the events that KVM emulates in software, which is currently instructions retired and branch instructions retired. The event selectors a tied to the underlying CPU, i.e. are constant for a given platform even though perf doesn't manage the mappings as such. Getting the event selectors from perf isn't exactly cheap, especially if mitigations are enabled, as at least one indirect call is involved. Snapshot the values in KVM instead of optimizing perf as working with the raw event selectors will be required if KVM ever wants to emulate events that aren't part of perf's uABI, i.e. that don't have an "enum perf_hw_id" entry. Link: https://lore.kernel.org/r/20231110022857.1273836-8-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
\| * \|	KVM: x86/pmu: Process only enabled PMCs when emulating events in software	Sean Christopherson	2024-02-01	1	-1/+10
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Mask off disabled counters based on PERF_GLOBAL_CTRL before iterating over PMCs to emulate (branch) instruction required events in software. In the common case where the guest isn't utilizing the PMU, pre-checking for enabled counters turns a relatively expensive search into a few AND uops and a Jcc. Sadly, PMUs without PERF_GLOBAL_CTRL, e.g. most existing AMD CPUs, are out of luck as there is no way to check that a PMC isn't being used without checking the PMC's event selector. Cc: Konstantin Khorenko <khorenko@virtuozzo.com> Link: https://lore.kernel.org/r/20231110022857.1273836-7-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
\| * \|	KVM: x86/pmu: Add macros to iterate over all PMCs given a bitmap	Sean Christopherson	2024-02-01	3	-24/+15
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Add and use kvm_for_each_pmc() to dedup a variety of open coded for-loops that iterate over valid PMCs given a bitmap (and because seeing checkpatch whine about bad macro style is always amusing). No functional change intended. Link: https://lore.kernel.org/r/20231110022857.1273836-6-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
\| * \|	KVM: x86/pmu: Snapshot and clear reprogramming bitmap before reprogramming	Sean Christopherson	2024-02-01	2	-23/+30
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Refactor the handling of the reprogramming bitmap to snapshot and clear to-be-processed bits before doing the reprogramming, and then explicitly set bits for PMCs that need to be reprogrammed (again). This will allow adding a macro to iterate over all valid PMCs without having to add special handling for the reprogramming bit, which (a) can have bits set for non-existent PMCs and (b) needs to clear such bits to avoid wasting cycles in perpetuity. Note, the existing behavior of clearing bits after reprogramming does NOT have a race with kvm_vm_ioctl_set_pmu_event_filter(). Setting a new PMU filter synchronizes SRCU _before_ setting the bitmap, i.e. guarantees that the vCPU isn't in the middle of reprogramming with a stale filter prior to setting the bitmap. Link: https://lore.kernel.org/r/20231110022857.1273836-5-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
\| * \|	KVM: x86/pmu: Move pmc_idx => pmc translation helper to common code	Sean Christopherson	2024-02-01	5	-24/+36
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Add a common helper for internal PMC lookups, and delete the ops hook and Intel's implementation. Keep AMD's implementation, but rename it to amd_pmu_get_pmc() to make it somewhat more obvious that it's suited for both KVM-internal and guest-initiated lookups. Because KVM tracks all counters in a single bitmap, getting a counter when iterating over a bitmap, e.g. of all valid PMCs, requires a small amount of math, that while simple, isn't super obvious and doesn't use the same semantics as PMC lookups from RDPMC! Although AMD doesn't support fixed counters, the common PMU code still behaves as if there a split, the high half of which just happens to always be empty. Opportunstically add a comment to explain both what is going on, and why KVM uses a single bitmap, e.g. the boilerplate for iterating over separate bitmaps could be done via macros, so it's not (just) about deduplicating code. Link: https://lore.kernel.org/r/20231110022857.1273836-4-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
\| * \|	KVM: x86/pmu: Add common define to capture fixed counters offset	Sean Christopherson	2024-02-01	3	-11/+13
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Add a common define to "officially" solidify KVM's split of counters, i.e. to commit to using bits 31:0 to track general purpose counters and bits 63:32 to track fixed counters (which only Intel supports). KVM already bleeds this behavior all over common PMU code, and adding a KVM- defined macro allows clarifying that the value is a _base_, as oppposed to the _flag_ that is used to access fixed PMCs via RDPMC (which perf confusingly calls INTEL_PMC_FIXED_RDPMC_BASE). No functional change intended. Link: https://lore.kernel.org/r/20231110022857.1273836-3-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
\| * \|	KVM: x86/pmu: Zero out PMU metadata on AMD if PMU is disabled	Sean Christopherson	2024-02-01	2	-16/+20
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Move the purging of common PMU metadata from intel_pmu_refresh() to kvm_pmu_refresh(), and invoke the vendor refresh() hook if and only if the VM is supposed to have a vPMU. KVM already denies access to the PMU based on kvm->arch.enable_pmu, as get_gp_pmc_amd() returns NULL for all PMCs in that case, i.e. KVM already violates AMD's architecture by not virtualizing a PMU (kernels have long since learned to not panic when the PMU is unavailable). But configuring the PMU as if it were enabled causes unwanted side effects, e.g. calls to kvm_pmu_trigger_event() waste an absurd number of cycles due to the all_valid_pmc_idx bitmap being non-zero. Fixes: b1d66dad65dc ("KVM: x86/svm: Add module param to control PMU virtualization") Reported-by: Konstantin Khorenko <khorenko@virtuozzo.com> Closes: https://lore.kernel.org/all/20231109180646.2963718-2-khorenko@virtuozzo.com Link: https://lore.kernel.org/r/20231110022857.1273836-2-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
\| * \|	KVM: selftests: Extend PMU counters test to validate RDPMC after WRMSR	Sean Christopherson	2024-01-31	1	-0/+41
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Extend the read/write PMU counters subtest to verify that RDPMC also reads back the written value. Opportunsitically verify that attempting to use the "fast" mode of RDPMC fails, as the "fast" flag is only supported by non-architectural PMUs, which KVM doesn't virtualize. Tested-by: Dapeng Mi <dapeng1.mi@linux.intel.com> Link: https://lore.kernel.org/r/20240109230250.424295-30-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
\| * \|	KVM: selftests: Add helpers for safe and safe+forced RDMSR, RDPMC, and XGETBV	Sean Christopherson	2024-01-31	1	-12/+26
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Add helpers for safe and safe-with-forced-emulations versions of RDMSR, RDPMC, and XGETBV. Use macro shenanigans to eliminate the rather large amount of boilerplate needed to get values in and out of registers. Tested-by: Dapeng Mi <dapeng1.mi@linux.intel.com> Link: https://lore.kernel.org/r/20240109230250.424295-29-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
\| * \|	KVM: selftests: Add a forced emulation variation of KVM_ASM_SAFE()	Sean Christopherson	2024-01-31	1	-2/+28
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Add KVM_ASM_SAFE_FEP() to allow forcing emulation on an instruction that might fault. Note, KVM skips RIP past the FEP prefix before injecting an exception, i.e. the fixup needs to be on the instruction itself. Do not check for FEP support, that is firmly the responsibility of whatever code wants to use KVM_ASM_SAFE_FEP(). Sadly, chaining variadic arguments that contain commas doesn't work, thus the unfortunate amount of copy+paste. Tested-by: Dapeng Mi <dapeng1.mi@linux.intel.com> Link: https://lore.kernel.org/r/20240109230250.424295-28-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
\| * \|	KVM: selftests: Test PMC virtualization with forced emulation	Sean Christopherson	2024-01-31	1	-14/+30
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Extend the PMC counters test to use forced emulation to verify that KVM emulates counter events for instructions retired and branches retired. Force emulation for only a subset of the measured code to test that KVM does the right thing when mixing perf events with emulated events. Reviewed-by: Dapeng Mi <dapeng1.mi@linux.intel.com> Tested-by: Dapeng Mi <dapeng1.mi@linux.intel.com> Link: https://lore.kernel.org/r/20240109230250.424295-27-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
\| * \|	KVM: selftests: Move KVM_FEP macro into common library header	Sean Christopherson	2024-01-31	2	-2/+3
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Move the KVM_FEP definition, a.k.a. the KVM force emulation prefix, into processor.h so that it can be used for other tests besides the MSR filter test. Tested-by: Dapeng Mi <dapeng1.mi@linux.intel.com> Link: https://lore.kernel.org/r/20240109230250.424295-26-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
\| * \|	KVM: selftests: Query module param to detect FEP in MSR filtering test	Sean Christopherson	2024-01-31	2	-18/+14
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Add a helper to detect KVM support for forced emulation by querying the module param, and use the helper to detect support for the MSR filtering test instead of throwing a noodle/NOP at KVM to see if it sticks. Cc: Aaron Lewis <aaronlewis@google.com> Tested-by: Dapeng Mi <dapeng1.mi@linux.intel.com> Link: https://lore.kernel.org/r/20240109230250.424295-25-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
\| * \|	KVM: selftests: Add helpers to read integer module params	Sean Christopherson	2024-01-31	2	-6/+60
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Add helpers to read integer module params, which is painfully non-trivial because the pain of dealing with strings in C is exacerbated by the kernel inserting a newline. Don't bother differentiating between int, uint, short, etc. They all fit in an int, and KVM (thankfully) doesn't have any integer params larger than an int. Tested-by: Dapeng Mi <dapeng1.mi@linux.intel.com> Link: https://lore.kernel.org/r/20240109230250.424295-24-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
\| * \|	KVM: selftests: Add a helper to query if the PMU module param is enabled	Sean Christopherson	2024-01-31	4	-3/+8
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Add a helper to probe KVM's "enable_pmu" param, open coding strings in multiple places is just asking for false negatives and/or runtime errors due to typos. Reviewed-by: Dapeng Mi <dapeng1.mi@linux.intel.com> Tested-by: Dapeng Mi <dapeng1.mi@linux.intel.com> Link: https://lore.kernel.org/r/20240109230250.424295-23-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
\| * \|	KVM: selftests: Expand PMU counters test to verify LLC events	Sean Christopherson	2024-01-31	1	-19/+40
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Expand the PMU counters test to verify that LLC references and misses have non-zero counts when the code being executed while the LLC event(s) is active is evicted via CFLUSH{,OPT}. Note, CLFLUSH{,OPT} requires a fence of some kind to ensure the cache lines are flushed before execution continues. Use MFENCE for simplicity (performance is not a concern). Suggested-by: Jim Mattson <jmattson@google.com> Reviewed-by: Dapeng Mi <dapeng1.mi@linux.intel.com> Tested-by: Dapeng Mi <dapeng1.mi@linux.intel.com> Link: https://lore.kernel.org/r/20240109230250.424295-22-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
\| * \|	KVM: selftests: Add functional test for Intel's fixed PMU counters	Jinrong Liang	2024-01-31	1	-1/+30
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Extend the fixed counters test to verify that supported counters can actually be enabled in the control MSRs, that unsupported counters cannot, and that enabled counters actually count. Co-developed-by: Like Xu <likexu@tencent.com> Signed-off-by: Like Xu <likexu@tencent.com> Signed-off-by: Jinrong Liang <cloudliang@tencent.com> [sean: fold into the rd/wr access test, massage changelog] Reviewed-by: Dapeng Mi <dapeng1.mi@linux.intel.com> Tested-by: Dapeng Mi <dapeng1.mi@linux.intel.com> Link: https://lore.kernel.org/r/20240109230250.424295-21-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
\| * \|	KVM: selftests: Test consistency of CPUID with num of fixed counters	Jinrong Liang	2024-01-31	1	-3/+57
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Extend the PMU counters test to verify KVM emulation of fixed counters in addition to general purpose counters. Fixed counters add an extra wrinkle in the form of an extra supported bitmask. Thus quoth the SDM: fixed-function performance counter 'i' is supported if ECX[i] \|\| (EDX[4:0] > i) Test that KVM handles a counter being available through either method. Reviewed-by: Dapeng Mi <dapeng1.mi@linux.intel.com> Co-developed-by: Like Xu <likexu@tencent.com> Signed-off-by: Like Xu <likexu@tencent.com> Signed-off-by: Jinrong Liang <cloudliang@tencent.com> Co-developed-by: Sean Christopherson <seanjc@google.com> Tested-by: Dapeng Mi <dapeng1.mi@linux.intel.com> Link: https://lore.kernel.org/r/20240109230250.424295-20-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
\| * \|	KVM: selftests: Test consistency of CPUID with num of gp counters	Jinrong Liang	2024-01-31	1	-0/+99
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Add a test to verify that KVM correctly emulates MSR-based accesses to general purpose counters based on guest CPUID, e.g. that accesses to non-existent counters #GP and accesses to existent counters succeed. Note, for compatibility reasons, KVM does not emulate #GP when MSR_P6_PERFCTR[0\|1] is not present (writes should be dropped). Co-developed-by: Like Xu <likexu@tencent.com> Signed-off-by: Like Xu <likexu@tencent.com> Signed-off-by: Jinrong Liang <cloudliang@tencent.com> Co-developed-by: Sean Christopherson <seanjc@google.com> Tested-by: Dapeng Mi <dapeng1.mi@linux.intel.com> Link: https://lore.kernel.org/r/20240109230250.424295-19-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
\| * \|	KVM: selftests: Test Intel PMU architectural events on fixed counters	Jinrong Liang	2024-01-31	1	-9/+45
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Extend the PMU counters test to validate architectural events using fixed counters. The core logic is largely the same, the biggest difference being that if a fixed counter exists, its associated event is available (the SDM doesn't explicitly state this to be true, but it's KVM's ABI and letting software program a fixed counter that doesn't actually count would be quite bizarre). Note, fixed counters rely on PERF_GLOBAL_CTRL. Reviewed-by: Jim Mattson <jmattson@google.com> Reviewed-by: Dapeng Mi <dapeng1.mi@linux.intel.com> Co-developed-by: Like Xu <likexu@tencent.com> Signed-off-by: Like Xu <likexu@tencent.com> Signed-off-by: Jinrong Liang <cloudliang@tencent.com> Co-developed-by: Sean Christopherson <seanjc@google.com> Tested-by: Dapeng Mi <dapeng1.mi@linux.intel.com> Link: https://lore.kernel.org/r/20240109230250.424295-18-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>