summaryrefslogtreecommitdiffstats
path: root/arch/powerpc/kvm (follow)
Commit message (Collapse)AuthorAgeFilesLines
* treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 266Thomas Gleixner2019-06-0527-327/+27
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | Based on 1 normalized pattern(s): this program is free software you can redistribute it and or modify it under the terms of the gnu general public license version 2 as published by the free software foundation this program is distributed in the hope that it will be useful but without any warranty without even the implied warranty of merchantability or fitness for a particular purpose see the gnu general public license for more details you should have received a copy of the gnu general public license along with this program if not write to the free software foundation 51 franklin street fifth floor boston ma 02110 1301 usa extracted by the scancode license scanner the SPDX license identifier GPL-2.0-only has been chosen to replace the boilerplate/reference in 67 file(s). Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Allison Randal <allison@lohutok.net> Reviewed-by: Richard Fontana <rfontana@redhat.com> Reviewed-by: Alexios Zavras <alexios.zavras@intel.com> Cc: linux-spdx@vger.kernel.org Link: https://lkml.kernel.org/r/20190529141333.953658117@linutronix.de Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvmLinus Torvalds2019-06-028-115/+143
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull KVM fixes from Paolo Bonzini: "Fixes for PPC and s390" * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: KVM: PPC: Book3S HV: Restore SPRG3 in kvmhv_p9_guest_entry() KVM: PPC: Book3S HV: Fix lockdep warning when entering guest on POWER9 KVM: PPC: Book3S HV: XIVE: Fix page offset when clearing ESB pages KVM: PPC: Book3S HV: XIVE: Take the srcu read lock when accessing memslots KVM: PPC: Book3S HV: XIVE: Do not clear IRQ data of passthrough interrupts KVM: PPC: Book3S HV: XIVE: Introduce a new mutex for the XIVE device KVM: PPC: Book3S HV: XIVE: Fix the enforced limit on the vCPU identifier KVM: PPC: Book3S HV: XIVE: Do not test the EQ flag validity when resetting KVM: PPC: Book3S HV: XIVE: Clear file mapping when device is released KVM: PPC: Book3S HV: Don't take kvm->lock around kvm_for_each_vcpu KVM: PPC: Book3S: Use new mutex to synchronize access to rtas token list KVM: PPC: Book3S HV: Use new mutex to synchronize MMU setup KVM: PPC: Book3S HV: Avoid touching arch.mmu_ready in XIVE release functions KVM: s390: Do not report unusabled IDs via KVM_CAP_MAX_VCPU_ID kvm: fix compile on s390 part 2
| * Merge tag 'kvm-s390-master-5.2-2' of ↵Paolo Bonzini2019-06-011-0/+3
| |\ | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/kvms390/linux into kvm-master KVM: s390: Fixes - fix compilation for !CONFIG_PCI - fix the output of KVM_CAP_MAX_VCPU_ID
| | * KVM: s390: Do not report unusabled IDs via KVM_CAP_MAX_VCPU_IDThomas Huth2019-05-281-0/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | KVM_CAP_MAX_VCPU_ID is currently always reporting KVM_MAX_VCPU_ID on all architectures. However, on s390x, the amount of usable CPUs is determined during runtime - it is depending on the features of the machine the code is running on. Since we are using the vcpu_id as an index into the SCA structures that are defined by the hardware (see e.g. the sca_add_vcpu() function), it is not only the amount of CPUs that is limited by the hard- ware, but also the range of IDs that we can use. Thus KVM_CAP_MAX_VCPU_ID must be determined during runtime on s390x, too. So the handling of KVM_CAP_MAX_VCPU_ID has to be moved from the common code into the architecture specific code, and on s390x we have to return the same value here as for KVM_CAP_MAX_VCPUS. This problem has been discovered with the kvm_create_max_vcpus selftest. With this change applied, the selftest now passes on s390x, too. Reviewed-by: Andrew Jones <drjones@redhat.com> Reviewed-by: Cornelia Huck <cohuck@redhat.com> Reviewed-by: David Hildenbrand <david@redhat.com> Signed-off-by: Thomas Huth <thuth@redhat.com> Message-Id: <20190523164309.13345-9-thuth@redhat.com> Cc: stable@vger.kernel.org Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com>
| * | KVM: PPC: Book3S HV: Restore SPRG3 in kvmhv_p9_guest_entry()Suraj Jitindar Singh2019-05-301-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The sprgs are a set of 4 general purpose sprs provided for software use. SPRG3 is special in that it can also be read from userspace. Thus it is used on linux to store the cpu and numa id of the process to speed up syscall access to this information. This register is overwritten with the guest value on kvm guest entry, and so needs to be restored on exit again. Thus restore the value on the guest exit path in kvmhv_p9_guest_entry(). Cc: stable@vger.kernel.org # v4.20+ Fixes: 95a6432ce9038 ("KVM: PPC: Book3S HV: Streamlined guest entry/exit path on P9 for radix guests") Signed-off-by: Suraj Jitindar Singh <sjitindarsingh@gmail.com> Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
| * | KVM: PPC: Book3S HV: Fix lockdep warning when entering guest on POWER9Paul Mackerras2019-05-301-2/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Commit 3309bec85e60 ("KVM: PPC: Book3S HV: Fix lockdep warning when entering the guest") moved calls to trace_hardirqs_{on,off} in the entry path used for HPT guests. Similar code exists in the new streamlined entry path used for radix guests on POWER9. This makes the same change there, so as to avoid lockdep warnings such as this: [ 228.686461] DEBUG_LOCKS_WARN_ON(current->hardirqs_enabled) [ 228.686480] WARNING: CPU: 116 PID: 3803 at ../kernel/locking/lockdep.c:4219 check_flags.part.23+0x21c/0x270 [ 228.686544] Modules linked in: vhost_net vhost xt_CHECKSUM iptable_mangle xt_MASQUERADE iptable_nat nf_nat +xt_conntrack nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ipt_REJECT nf_reject_ipv4 tun bridge stp llc ebtable_filter +ebtables ip6table_filter ip6_tables iptable_filter fuse kvm_hv kvm at24 ipmi_powernv regmap_i2c ipmi_devintf +uio_pdrv_genirq ofpart ipmi_msghandler uio powernv_flash mtd ibmpowernv opal_prd ip_tables ext4 mbcache jbd2 btrfs +zstd_decompress zstd_compress raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx libcrc32c xor +raid6_pq raid1 raid0 ses sd_mod enclosure scsi_transport_sas ast i2c_opal i2c_algo_bit drm_kms_helper syscopyarea +sysfillrect sysimgblt fb_sys_fops ttm drm i40e e1000e cxl aacraid tg3 drm_panel_orientation_quirks i2c_core [ 228.686859] CPU: 116 PID: 3803 Comm: qemu-system-ppc Kdump: loaded Not tainted 5.2.0-rc1-xive+ #42 [ 228.686911] NIP: c0000000001b394c LR: c0000000001b3948 CTR: c000000000bfad20 [ 228.686963] REGS: c000200cdb50f570 TRAP: 0700 Not tainted (5.2.0-rc1-xive+) [ 228.687001] MSR: 9000000002823033 <SF,HV,VEC,VSX,FP,ME,IR,DR,RI,LE> CR: 48222222 XER: 20040000 [ 228.687060] CFAR: c000000000116db0 IRQMASK: 1 [ 228.687060] GPR00: c0000000001b3948 c000200cdb50f800 c0000000015e7600 000000000000002e [ 228.687060] GPR04: 0000000000000001 c0000000001c71a0 000000006e655f73 72727563284e4f5f [ 228.687060] GPR08: 0000200e60680000 0000000000000000 c000200cdb486180 0000000000000000 [ 228.687060] GPR12: 0000000000002000 c000200fff61a680 0000000000000000 00007fffb75c0000 [ 228.687060] GPR16: 0000000000000000 0000000000000000 c0000000017d6900 c000000001124900 [ 228.687060] GPR20: 0000000000000074 c008000006916f68 0000000000000074 0000000000000074 [ 228.687060] GPR24: ffffffffffffffff ffffffffffffffff 0000000000000003 c000200d4b600000 [ 228.687060] GPR28: c000000001627e58 c000000001489908 c000000001627e58 c000000002304de0 [ 228.687377] NIP [c0000000001b394c] check_flags.part.23+0x21c/0x270 [ 228.687415] LR [c0000000001b3948] check_flags.part.23+0x218/0x270 [ 228.687466] Call Trace: [ 228.687488] [c000200cdb50f800] [c0000000001b3948] check_flags.part.23+0x218/0x270 (unreliable) [ 228.687542] [c000200cdb50f870] [c0000000001b6548] lock_is_held_type+0x188/0x1c0 [ 228.687595] [c000200cdb50f8d0] [c0000000001d939c] rcu_read_lock_sched_held+0xdc/0x100 [ 228.687646] [c000200cdb50f900] [c0000000001dd704] rcu_note_context_switch+0x304/0x340 [ 228.687701] [c000200cdb50f940] [c0080000068fcc58] kvmhv_run_single_vcpu+0xdb0/0x1120 [kvm_hv] [ 228.687756] [c000200cdb50fa20] [c0080000068fd5b0] kvmppc_vcpu_run_hv+0x5e8/0xe40 [kvm_hv] [ 228.687816] [c000200cdb50faf0] [c0080000071797dc] kvmppc_vcpu_run+0x34/0x48 [kvm] [ 228.687863] [c000200cdb50fb10] [c0080000071755dc] kvm_arch_vcpu_ioctl_run+0x244/0x420 [kvm] [ 228.687916] [c000200cdb50fba0] [c008000007165ccc] kvm_vcpu_ioctl+0x424/0x838 [kvm] [ 228.687957] [c000200cdb50fd10] [c000000000433a24] do_vfs_ioctl+0xd4/0xcd0 [ 228.687995] [c000200cdb50fdb0] [c000000000434724] ksys_ioctl+0x104/0x120 [ 228.688033] [c000200cdb50fe00] [c000000000434768] sys_ioctl+0x28/0x80 [ 228.688072] [c000200cdb50fe20] [c00000000000b888] system_call+0x5c/0x70 [ 228.688109] Instruction dump: [ 228.688142] 4bf6342d 60000000 0fe00000 e8010080 7c0803a6 4bfffe60 3c82ff87 3c62ff87 [ 228.688196] 388472d0 3863d738 4bf63405 60000000 <0fe00000> 4bffff4c 3c82ff87 3c62ff87 [ 228.688251] irq event stamp: 205 [ 228.688287] hardirqs last enabled at (205): [<c0080000068fc1b4>] kvmhv_run_single_vcpu+0x30c/0x1120 [kvm_hv] [ 228.688344] hardirqs last disabled at (204): [<c0080000068fbff0>] kvmhv_run_single_vcpu+0x148/0x1120 [kvm_hv] [ 228.688412] softirqs last enabled at (180): [<c000000000c0b2ac>] __do_softirq+0x4ac/0x5d4 [ 228.688464] softirqs last disabled at (169): [<c000000000122aa8>] irq_exit+0x1f8/0x210 [ 228.688513] ---[ end trace eb16f6260022a812 ]--- [ 228.688548] possible reason: unannotated irqs-off. [ 228.688571] irq event stamp: 205 [ 228.688607] hardirqs last enabled at (205): [<c0080000068fc1b4>] kvmhv_run_single_vcpu+0x30c/0x1120 [kvm_hv] [ 228.688664] hardirqs last disabled at (204): [<c0080000068fbff0>] kvmhv_run_single_vcpu+0x148/0x1120 [kvm_hv] [ 228.688719] softirqs last enabled at (180): [<c000000000c0b2ac>] __do_softirq+0x4ac/0x5d4 [ 228.688758] softirqs last disabled at (169): [<c000000000122aa8>] irq_exit+0x1f8/0x210 Cc: stable@vger.kernel.org # v4.20+ Fixes: 95a6432ce903 ("KVM: PPC: Book3S HV: Streamlined guest entry/exit path on P9 for radix guests") Signed-off-by: Paul Mackerras <paulus@ozlabs.org> Reviewed-by: Cédric Le Goater <clg@kaod.org> Tested-by: Cédric Le Goater <clg@kaod.org> Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
| * | KVM: PPC: Book3S HV: XIVE: Fix page offset when clearing ESB pagesCédric Le Goater2019-05-301-1/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Under XIVE, the ESB pages of an interrupt are used for interrupt management (EOI) and triggering. They are made available to guests through a mapping of the XIVE KVM device. When a device is passed-through, the passthru_irq helpers, kvmppc_xive_set_mapped() and kvmppc_xive_clr_mapped(), clear the ESB pages of the guest IRQ number being mapped and let the VM fault handler repopulate with the correct page. The ESB pages are mapped at offset 4 (KVM_XIVE_ESB_PAGE_OFFSET) in the KVM device mapping. Unfortunately, this offset was not taken into account when clearing the pages. This lead to issues with the passthrough devices for which the interrupts were not functional under some guest configuration (tg3 and single CPU) or in any configuration (e1000e adapter). Reviewed-by: Greg Kurz <groug@kaod.org> Tested-by: Greg Kurz <groug@kaod.org> Signed-off-by: Cédric Le Goater <clg@kaod.org> Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
| * | KVM: PPC: Book3S HV: XIVE: Take the srcu read lock when accessing memslotsCédric Le Goater2019-05-301-0/+8
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | According to Documentation/virtual/kvm/locking.txt, the srcu read lock should be taken when accessing the memslots of the VM. The XIVE KVM device needs to do so when configuring the page of the OS event queue of vCPU for a given priority and when marking the same page dirty before migration. This avoids warnings such as : [ 208.224882] ============================= [ 208.224884] WARNING: suspicious RCU usage [ 208.224889] 5.2.0-rc2-xive+ #47 Not tainted [ 208.224890] ----------------------------- [ 208.224894] ../include/linux/kvm_host.h:633 suspicious rcu_dereference_check() usage! [ 208.224896] other info that might help us debug this: [ 208.224898] rcu_scheduler_active = 2, debug_locks = 1 [ 208.224901] no locks held by qemu-system-ppc/3923. [ 208.224902] stack backtrace: [ 208.224907] CPU: 64 PID: 3923 Comm: qemu-system-ppc Kdump: loaded Not tainted 5.2.0-rc2-xive+ #47 [ 208.224909] Call Trace: [ 208.224918] [c000200cdd98fa30] [c000000000be1934] dump_stack+0xe8/0x164 (unreliable) [ 208.224924] [c000200cdd98fa80] [c0000000001aec80] lockdep_rcu_suspicious+0x110/0x180 [ 208.224935] [c000200cdd98fb00] [c0080000075933a0] gfn_to_memslot+0x1c8/0x200 [kvm] [ 208.224943] [c000200cdd98fb40] [c008000007599600] gfn_to_pfn+0x28/0x60 [kvm] [ 208.224951] [c000200cdd98fb70] [c008000007599658] gfn_to_page+0x20/0x40 [kvm] [ 208.224959] [c000200cdd98fb90] [c0080000075b495c] kvmppc_xive_native_set_attr+0x8b4/0x1480 [kvm] [ 208.224967] [c000200cdd98fca0] [c00800000759261c] kvm_device_ioctl_attr+0x64/0xb0 [kvm] [ 208.224974] [c000200cdd98fcf0] [c008000007592730] kvm_device_ioctl+0xc8/0x110 [kvm] [ 208.224979] [c000200cdd98fd10] [c000000000433a24] do_vfs_ioctl+0xd4/0xcd0 [ 208.224981] [c000200cdd98fdb0] [c000000000434724] ksys_ioctl+0x104/0x120 [ 208.224984] [c000200cdd98fe00] [c000000000434768] sys_ioctl+0x28/0x80 [ 208.224988] [c000200cdd98fe20] [c00000000000b888] system_call+0x5c/0x70 legoater@boss01:~$ Fixes: 13ce3297c576 ("KVM: PPC: Book3S HV: XIVE: Add controls for the EQ configuration") Fixes: e6714bd1671d ("KVM: PPC: Book3S HV: XIVE: Add a control to dirty the XIVE EQ pages") Signed-off-by: Cédric Le Goater <clg@kaod.org> Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
| * | KVM: PPC: Book3S HV: XIVE: Do not clear IRQ data of passthrough interruptsCédric Le Goater2019-05-301-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The passthrough interrupts are defined at the host level and their IRQ data should not be cleared unless specifically deconfigured (shutdown) by the host. They differ from the IPI interrupts which are allocated by the XIVE KVM device and reserved to the guest usage only. This fixes a host crash when destroying a VM in which a PCI adapter was passed-through. In this case, the interrupt is cleared and freed by the KVM device and then shutdown by vfio at the host level. [ 1007.360265] BUG: Kernel NULL pointer dereference at 0x00000d00 [ 1007.360285] Faulting instruction address: 0xc00000000009da34 [ 1007.360296] Oops: Kernel access of bad area, sig: 7 [#1] [ 1007.360303] LE PAGE_SIZE=64K MMU=Radix MMU=Hash SMP NR_CPUS=2048 NUMA PowerNV [ 1007.360314] Modules linked in: vhost_net vhost iptable_mangle ipt_MASQUERADE iptable_nat nf_nat xt_conntrack nf_conntrack nf_defrag_ipv4 ipt_REJECT nf_reject_ipv4 tun bridge stp llc kvm_hv kvm xt_tcpudp iptable_filter squashfs fuse binfmt_misc vmx_crypto ib_iser rdma_cm iw_cm ib_cm libiscsi scsi_transport_iscsi nfsd ip_tables x_tables autofs4 btrfs zstd_decompress zstd_compress lzo_compress raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx xor raid6_pq multipath mlx5_ib ib_uverbs ib_core crc32c_vpmsum mlx5_core [ 1007.360425] CPU: 9 PID: 15576 Comm: CPU 18/KVM Kdump: loaded Not tainted 5.1.0-gad7e7d0ef #4 [ 1007.360454] NIP: c00000000009da34 LR: c00000000009e50c CTR: c00000000009e5d0 [ 1007.360482] REGS: c000007f24ccf330 TRAP: 0300 Not tainted (5.1.0-gad7e7d0ef) [ 1007.360500] MSR: 900000000280b033 <SF,HV,VEC,VSX,EE,FP,ME,IR,DR,RI,LE> CR: 24002484 XER: 00000000 [ 1007.360532] CFAR: c00000000009da10 DAR: 0000000000000d00 DSISR: 00080000 IRQMASK: 1 [ 1007.360532] GPR00: c00000000009e62c c000007f24ccf5c0 c000000001510600 c000007fe7f947c0 [ 1007.360532] GPR04: 0000000000000d00 0000000000000000 0000000000000000 c000005eff02d200 [ 1007.360532] GPR08: 0000000000400000 0000000000000000 0000000000000000 fffffffffffffffd [ 1007.360532] GPR12: c00000000009e5d0 c000007fffff7b00 0000000000000031 000000012c345718 [ 1007.360532] GPR16: 0000000000000000 0000000000000008 0000000000418004 0000000000040100 [ 1007.360532] GPR20: 0000000000000000 0000000008430000 00000000003c0000 0000000000000027 [ 1007.360532] GPR24: 00000000000000ff 0000000000000000 00000000000000ff c000007faa90d98c [ 1007.360532] GPR28: c000007faa90da40 00000000000fe040 ffffffffffffffff c000007fe7f947c0 [ 1007.360689] NIP [c00000000009da34] xive_esb_read+0x34/0x120 [ 1007.360706] LR [c00000000009e50c] xive_do_source_set_mask.part.0+0x2c/0x50 [ 1007.360732] Call Trace: [ 1007.360738] [c000007f24ccf5c0] [c000000000a6383c] snooze_loop+0x15c/0x270 (unreliable) [ 1007.360775] [c000007f24ccf5f0] [c00000000009e62c] xive_irq_shutdown+0x5c/0xe0 [ 1007.360795] [c000007f24ccf630] [c00000000019e4a0] irq_shutdown+0x60/0xe0 [ 1007.360813] [c000007f24ccf660] [c000000000198c44] __free_irq+0x3a4/0x420 [ 1007.360831] [c000007f24ccf700] [c000000000198dc8] free_irq+0x78/0xe0 [ 1007.360849] [c000007f24ccf730] [c00000000096c5a8] vfio_msi_set_vector_signal+0xa8/0x350 [ 1007.360878] [c000007f24ccf7f0] [c00000000096c938] vfio_msi_set_block+0xe8/0x1e0 [ 1007.360899] [c000007f24ccf850] [c00000000096cae0] vfio_msi_disable+0xb0/0x110 [ 1007.360912] [c000007f24ccf8a0] [c00000000096cd04] vfio_pci_set_msi_trigger+0x1c4/0x3d0 [ 1007.360922] [c000007f24ccf910] [c00000000096d910] vfio_pci_set_irqs_ioctl+0xa0/0x170 [ 1007.360941] [c000007f24ccf930] [c00000000096b400] vfio_pci_disable+0x80/0x5e0 [ 1007.360963] [c000007f24ccfa10] [c00000000096b9bc] vfio_pci_release+0x5c/0x90 [ 1007.360991] [c000007f24ccfa40] [c000000000963a9c] vfio_device_fops_release+0x3c/0x70 [ 1007.361012] [c000007f24ccfa70] [c0000000003b5668] __fput+0xc8/0x2b0 [ 1007.361040] [c000007f24ccfac0] [c0000000001409b0] task_work_run+0x140/0x1b0 [ 1007.361059] [c000007f24ccfb20] [c000000000118f8c] do_exit+0x3ac/0xd00 [ 1007.361076] [c000007f24ccfc00] [c0000000001199b0] do_group_exit+0x60/0x100 [ 1007.361094] [c000007f24ccfc40] [c00000000012b514] get_signal+0x1a4/0x8f0 [ 1007.361112] [c000007f24ccfd30] [c000000000021cc8] do_notify_resume+0x1a8/0x430 [ 1007.361141] [c000007f24ccfe20] [c00000000000e444] ret_from_except_lite+0x70/0x74 [ 1007.361159] Instruction dump: [ 1007.361175] 38422c00 e9230000 712a0004 41820010 548a2036 7d442378 78840020 71290020 [ 1007.361194] 4082004c e9230010 7c892214 7c0004ac <e9240000> 0c090000 4c00012c 792a0022 Cc: stable@vger.kernel.org # v4.12+ Fixes: 5af50993850a ("KVM: PPC: Book3S HV: Native usage of the XIVE interrupt controller") Signed-off-by: Cédric Le Goater <clg@kaod.org> Signed-off-by: Greg Kurz <groug@kaod.org> Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
| * | KVM: PPC: Book3S HV: XIVE: Introduce a new mutex for the XIVE deviceCédric Le Goater2019-05-303-17/+22
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The XICS-on-XIVE KVM device needs to allocate XIVE event queues when a priority is used by the OS. This is referred as EQ provisioning and it is done under the hood when : 1. a CPU is hot-plugged in the VM 2. the "set-xive" is called at VM startup 3. sources are restored at VM restore The kvm->lock mutex is used to protect the different XIVE structures being modified but in some contexts, kvm->lock is taken under the vcpu->mutex which is not permitted by the KVM locking rules. Introduce a new mutex 'lock' for the KVM devices for them to synchronize accesses to the XIVE device structures. Reviewed-by: Greg Kurz <groug@kaod.org> Signed-off-by: Cédric Le Goater <clg@kaod.org> Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
| * | KVM: PPC: Book3S HV: XIVE: Fix the enforced limit on the vCPU identifierCédric Le Goater2019-05-291-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When a vCPU is connected to the KVM device, it is done using its vCPU identifier in the guest. Fix the enforced limit on the vCPU identifier by taking into account the SMT mode. Reported-by: Satheesh Rajendran <sathnaga@linux.vnet.ibm.com> Tested-by: Satheesh Rajendran <sathnaga@linux.vnet.ibm.com> Signed-off-by: Cédric Le Goater <clg@kaod.org> Reviewed-by: Greg Kurz <groug@kaod.org> Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
| * | KVM: PPC: Book3S HV: XIVE: Do not test the EQ flag validity when resettingCédric Le Goater2019-05-291-18/+18
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When a CPU is hot-unplugged, the EQ is deconfigured using a zero size and a zero address. In this case, there is no need to check the flag and queue size validity. Move the checks after the queue reset code section to fix CPU hot-unplug. Reported-by: Satheesh Rajendran <sathnaga@linux.vnet.ibm.com> Tested-by: Satheesh Rajendran <sathnaga@linux.vnet.ibm.com> Signed-off-by: Cédric Le Goater <clg@kaod.org> Reviewed-by: Greg Kurz <groug@kaod.org> Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
| * | KVM: PPC: Book3S HV: XIVE: Clear file mapping when device is releasedCédric Le Goater2019-05-291-0/+8
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Improve the release of the XIVE KVM device by clearing the file address_space, which is used to unmap the interrupt ESB pages when a device is passed-through. Suggested-by: Paul Mackerras <paulus@ozlabs.org> Signed-off-by: Cédric Le Goater <clg@kaod.org> Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
| * | KVM: PPC: Book3S HV: Don't take kvm->lock around kvm_for_each_vcpuPaul Mackerras2019-05-291-8/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Currently the HV KVM code takes the kvm->lock around calls to kvm_for_each_vcpu() and kvm_get_vcpu_by_id() (which can call kvm_for_each_vcpu() internally). However, that leads to a lock order inversion problem, because these are called in contexts where the vcpu mutex is held, but the vcpu mutexes nest within kvm->lock according to Documentation/virtual/kvm/locking.txt. Hence there is a possibility of deadlock. To fix this, we simply don't take the kvm->lock mutex around these calls. This is safe because the implementations of kvm_for_each_vcpu() and kvm_get_vcpu_by_id() have been designed to be able to be called locklessly. Signed-off-by: Paul Mackerras <paulus@ozlabs.org> Reviewed-by: Cédric Le Goater <clg@kaod.org> Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
| * | KVM: PPC: Book3S: Use new mutex to synchronize access to rtas token listPaul Mackerras2019-05-292-8/+7
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Currently the Book 3S KVM code uses kvm->lock to synchronize access to the kvm->arch.rtas_tokens list. Because this list is scanned inside kvmppc_rtas_hcall(), which is called with the vcpu mutex held, taking kvm->lock cause a lock inversion problem, which could lead to a deadlock. To fix this, we add a new mutex, kvm->arch.rtas_token_lock, which nests inside the vcpu mutexes, and use that instead of kvm->lock when accessing the rtas token list. This removes the lockdep_assert_held() in kvmppc_rtas_tokens_free(). At this point we don't hold the new mutex, but that is OK because kvmppc_rtas_tokens_free() is only called when the whole VM is being destroyed, and at that point nothing can be looking up a token in the list. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
| * | KVM: PPC: Book3S HV: Use new mutex to synchronize MMU setupPaul Mackerras2019-05-292-26/+41
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Currently the HV KVM code uses kvm->lock in conjunction with a flag, kvm->arch.mmu_ready, to synchronize MMU setup and hold off vcpu execution until the MMU-related data structures are ready. However, this means that kvm->lock is being taken inside vcpu->mutex, which is contrary to Documentation/virtual/kvm/locking.txt and results in lockdep warnings. To fix this, we add a new mutex, kvm->arch.mmu_setup_lock, which nests inside the vcpu mutexes, and is taken in the places where kvm->lock was taken that are related to MMU setup. Additionally we take the new mutex in the vcpu creation code at the point where we are creating a new vcore, in order to provide mutual exclusion with kvmppc_update_lpcr() and ensure that an update to kvm->arch.lpcr doesn't get missed, which could otherwise lead to a stale vcore->lpcr value. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
| * | KVM: PPC: Book3S HV: Avoid touching arch.mmu_ready in XIVE release functionsPaul Mackerras2019-05-292-32/+24
| |/ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Currently, kvmppc_xive_release() and kvmppc_xive_native_release() clear kvm->arch.mmu_ready and call kick_all_cpus_sync() as a way of ensuring that no vcpus are executing in the guest. However, future patches will change the mutex associated with kvm->arch.mmu_ready to a new mutex that nests inside the vcpu mutexes, making it difficult to continue to use this method. In fact, taking the vcpu mutex for a vcpu excludes execution of that vcpu, and we already take the vcpu mutex around the call to kvmppc_xive_[native_]cleanup_vcpu(). Once the cleanup function is done and we release the vcpu mutex, the vcpu can execute once again, but because we have cleared vcpu->arch.xive_vcpu, vcpu->arch.irq_type, vcpu->arch.xive_esc_vaddr and vcpu->arch.xive_esc_raddr, that vcpu will not be going into XIVE code any more. Thus, once we have cleaned up all of the vcpus, we are safe to clean up the rest of the XIVE state, and we don't need to use kvm->arch.mmu_ready to hold off vcpu execution. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
* | treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 174Thomas Gleixner2019-05-302-17/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Based on 1 normalized pattern(s): this program is free software you can redistribute it and or modify it under the terms of the gnu general public license version 2 as published by the free software foundation this program is distributed in the hope that it will be useful but without any warranty without even the implied warranty of merchantability or fitness for a particular purpose see the gnu general public license for more details extracted by the scancode license scanner the SPDX license identifier GPL-2.0-only has been chosen to replace the boilerplate/reference in 655 file(s). Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Allison Randal <allison@lohutok.net> Reviewed-by: Kate Stewart <kstewart@linuxfoundation.org> Reviewed-by: Richard Fontana <rfontana@redhat.com> Cc: linux-spdx@vger.kernel.org Link: https://lkml.kernel.org/r/20190527070034.575739538@linutronix.de Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* | treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 152Thomas Gleixner2019-05-301-6/+1
|/ | | | | | | | | | | | | | | | | | | | | Based on 1 normalized pattern(s): this program is free software you can redistribute it and or modify it under the terms of the gnu general public license as published by the free software foundation either version 2 of the license or at your option any later version extracted by the scancode license scanner the SPDX license identifier GPL-2.0-or-later has been chosen to replace the boilerplate/reference in 3029 file(s). Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Allison Randal <allison@lohutok.net> Cc: linux-spdx@vger.kernel.org Link: https://lkml.kernel.org/r/20190527070032.746973796@linutronix.de Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 114Thomas Gleixner2019-05-241-13/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | Based on 1 normalized pattern(s): this program is free software you can redistribute it and or modify it under the terms of the gnu general public license as published by the free software foundation either version 2 of the license or at your option any later version this program is distributed in the hope that it will be useful but without any warranty without even the implied warranty of merchantability or fitness for a particular purpose see the gnu general public license for more details you should have received a copy of the gnu general public license along with this program extracted by the scancode license scanner the SPDX license identifier GPL-2.0-or-later has been chosen to replace the boilerplate/reference in 8 file(s). Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Allison Randal <allison@lohutok.net> Reviewed-by: Richard Fontana <rfontana@redhat.com> Cc: linux-spdx@vger.kernel.org Link: https://lkml.kernel.org/r/20190523091650.663497195@linutronix.de Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* treewide: Replace GPLv2 boilerplate/reference with SPDX - rule 98Thomas Gleixner2019-05-241-6/+1
| | | | | | | | | | | | | | | | | | | | | | Based on 1 normalized pattern(s): this program is free software you can redistribute it and or modify it under the terms of the gnu general public license as published by the free software foundation either version 2 of the license or at your optional any later version of the license extracted by the scancode license scanner the SPDX license identifier GPL-2.0-or-later has been chosen to replace the boilerplate/reference in 3 file(s). Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Reviewed-by: Richard Fontana <rfontana@redhat.com> Reviewed-by: Allison Randal <allison@lohutok.net> Cc: linux-spdx@vger.kernel.org Link: https://lkml.kernel.org/r/20190520075212.713472955@linutronix.de Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvmLinus Torvalds2019-05-1713-269/+2069
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull KVM updates from Paolo Bonzini: "ARM: - support for SVE and Pointer Authentication in guests - PMU improvements POWER: - support for direct access to the POWER9 XIVE interrupt controller - memory and performance optimizations x86: - support for accessing memory not backed by struct page - fixes and refactoring Generic: - dirty page tracking improvements" * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (155 commits) kvm: fix compilation on aarch64 Revert "KVM: nVMX: Expose RDPMC-exiting only when guest supports PMU" kvm: x86: Fix L1TF mitigation for shadow MMU KVM: nVMX: Disable intercept for FS/GS base MSRs in vmcs02 when possible KVM: PPC: Book3S: Remove useless checks in 'release' method of KVM device KVM: PPC: Book3S HV: XIVE: Fix spelling mistake "acessing" -> "accessing" KVM: PPC: Book3S HV: Make sure to load LPID for radix VCPUs kvm: nVMX: Set nested_run_pending in vmx_set_nested_state after checks complete tests: kvm: Add tests for KVM_SET_NESTED_STATE KVM: nVMX: KVM_SET_NESTED_STATE - Tear down old EVMCS state before setting new state tests: kvm: Add tests for KVM_CAP_MAX_VCPUS and KVM_CAP_MAX_CPU_ID tests: kvm: Add tests to .gitignore KVM: Introduce KVM_CAP_MANUAL_DIRTY_LOG_PROTECT2 KVM: Fix kvm_clear_dirty_log_protect off-by-(minus-)one KVM: Fix the bitmap range to copy during clear dirty KVM: arm64: Fix ptrauth ID register masking logic KVM: x86: use direct accessors for RIP and RSP KVM: VMX: Use accessors for GPRs outside of dedicated caching logic KVM: x86: Omit caching logic for always-available GPRs kvm, x86: Properly check whether a pfn is an MMIO or not ...
| * Merge tag 'kvm-ppc-next-5.2-2' of ↵Paolo Bonzini2019-05-1513-281/+2090
| |\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/paulus/powerpc into HEAD PPC KVM update for 5.2 * Support for guests to access the new POWER9 XIVE interrupt controller hardware directly, reducing interrupt latency and overhead for guests. * In-kernel implementation of the H_PAGE_INIT hypercall. * Reduce memory usage of sparsely-populated IOMMU tables. * Several bug fixes. Second PPC KVM update for 5.2 * Fix a bug, fix a spelling mistake, remove some useless code.
| | * KVM: PPC: Book3S HV: XIVE: Fix spelling mistake "acessing" -> "accessing"Colin Ian King2019-05-141-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | There is a spelling mistake in a pr_err message, fix it. Signed-off-by: Colin Ian King <colin.king@canonical.com> Reviewed-by: Mukesh Ojha <mojha@codeaurora.org> Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
| | * KVM: PPC: Book3S HV: Make sure to load LPID for radix VCPUsPaul Mackerras2019-05-141-6/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Commit 70ea13f6e609 ("KVM: PPC: Book3S HV: Flush TLB on secondary radix threads", 2019-04-29) aimed to make radix guests that are using the real-mode entry path load the LPID register and flush the TLB in the same place where those things are done for HPT guests. However, it omitted to remove a branch which branches around that code for radix guests. The result is that with indep_thread_mode = N, radix guests don't run correctly. (With indep_threads_mode = Y, which is the default, radix guests use a different entry path.) This removes the offending branch, and also the load and compare that the branch depends on, since the cr7 setting is now unused. Reported-by: Suraj Jitindar Singh <sjitindarsingh@gmail.com> Tested-by: Suraj Jitindar Singh <sjitindarsingh@gmail.com> Fixes: 70ea13f6e609 ("KVM: PPC: Book3S HV: Flush TLB on secondary radix threads") Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
| | * KVM: PPC: Book3S HV: XIVE: Clear escalation interrupt pointers on device closePaul Mackerras2019-04-301-0/+15
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This adds code to ensure that after a XIVE or XICS-on-XIVE KVM device is closed, KVM will not try to enable or disable any of the escalation interrupts for the VCPUs. We don't have to worry about races between clearing the pointers and use of the pointers by the XIVE context push/pull code, because the callers hold the vcpu->mutex, which is also taken by the KVM_RUN code. Therefore the vcpu cannot be entering or exiting the guest concurrently. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
| | * KVM: PPC: Book3S HV: XIVE: Prevent races when releasing devicePaul Mackerras2019-04-302-16/+78
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Now that we have the possibility of a XIVE or XICS-on-XIVE device being released while the VM is still running, we need to be careful about races and potential use-after-free bugs. Although the kvmppc_xive struct is not freed, but kept around for re-use, the kvmppc_xive_vcpu structs are freed, and they are used extensively in both the XIVE native and XICS-on-XIVE code. There are various ways in which XIVE code gets invoked: - VCPU entry and exit, which do push and pull operations on the XIVE hardware - one_reg get and set functions (vcpu->mutex is held) - XICS hypercalls (but only inside guest execution, not from kvmppc_pseries_do_hcall) - device creation calls (kvm->lock is held) - device callbacks - get/set attribute, mmap, pagefault, release/destroy - set_mapped/clr_mapped calls (kvm->lock is held) - connect_vcpu calls - debugfs file read callbacks Inside a device release function, we know that userspace cannot have an open file descriptor referring to the device, nor can it have any mmapped regions from the device. Therefore the device callbacks are excluded, as are the connect_vcpu calls (since they need a fd for the device). Further, since the caller holds the kvm->lock mutex, no other device creation calls or set/clr_mapped calls can be executing concurrently. To exclude VCPU execution and XICS hypercalls, we temporarily set kvm->arch.mmu_ready to 0. This forces any VCPU task that is trying to enter the guest to take the kvm->lock mutex, which is held by the caller of the release function. Then, sending an IPI to all other CPUs forces any VCPU currently executing in the guest to exit. Finally, we take the vcpu->mutex for each VCPU around the process of cleaning up and freeing its XIVE data structures, in order to exclude any one_reg get/set calls. To exclude the debugfs read callbacks, we just need to ensure that debugfs_remove is called before freeing any data structures. Once it returns we know that no CPU can be executing the callbacks (for our kvmppc_xive instance). Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
| | * KVM: PPC: Book3S HV: XIVE: Replace the 'destroy' method by a 'release' methodCédric Le Goater2019-04-304-11/+96
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When a P9 sPAPR VM boots, the CAS negotiation process determines which interrupt mode to use (XICS legacy or XIVE native) and invokes a machine reset to activate the chosen mode. We introduce 'release' methods for the XICS-on-XIVE and the XIVE native KVM devices which are called when the file descriptor of the device is closed after the TIMA and ESB pages have been unmapped. They perform the necessary cleanups : clear the vCPU interrupt presenters that could be attached and then destroy the device. The 'release' methods replace the 'destroy' methods as 'destroy' is not called anymore once 'release' is. Compatibility with older QEMU is nevertheless maintained. This is not considered as a safe operation as the vCPUs are still running and could be referencing the KVM device through their presenters. To protect the system from any breakage, the kvmppc_xive objects representing both KVM devices are now stored in an array under the VM. Allocation is performed on first usage and memory is freed only when the VM exits. [paulus@ozlabs.org - Moved freeing of xive structures to book3s.c, put it under #ifdef CONFIG_KVM_XICS.] Signed-off-by: Cédric Le Goater <clg@kaod.org> Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
| | * KVM: PPC: Book3S HV: XIVE: Activate XIVE exploitation modeCédric Le Goater2019-04-301-3/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Full support for the XIVE native exploitation mode is now available, advertise the capability KVM_CAP_PPC_IRQ_XIVE for guests running on PowerNV KVM Hypervisors only. Support for nested guests (pseries KVM Hypervisor) is not yet available. XIVE should also have been activated which is default setting on POWER9 systems running a recent Linux kernel. Signed-off-by: Cédric Le Goater <clg@kaod.org> Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
| | * KVM: PPC: Book3S HV: XIVE: Add passthrough supportCédric Le Goater2019-04-303-0/+65
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The KVM XICS-over-XIVE device and the proposed KVM XIVE native device implement an IRQ space for the guest using the generic IPI interrupts of the XIVE IC controller. These interrupts are allocated at the OPAL level and "mapped" into the guest IRQ number space in the range 0-0x1FFF. Interrupt management is performed in the XIVE way: using loads and stores on the addresses of the XIVE IPI interrupt ESB pages. Both KVM devices share the same internal structure caching information on the interrupts, among which the xive_irq_data struct containing the addresses of the IPI ESB pages and an extra one in case of pass-through. The later contains the addresses of the ESB pages of the underlying HW controller interrupts, PHB4 in all cases for now. A guest, when running in the XICS legacy interrupt mode, lets the KVM XICS-over-XIVE device "handle" interrupt management, that is to perform the loads and stores on the addresses of the ESB pages of the guest interrupts. However, when running in XIVE native exploitation mode, the KVM XIVE native device exposes the interrupt ESB pages to the guest and lets the guest perform directly the loads and stores. The VMA exposing the ESB pages make use of a custom VM fault handler which role is to populate the VMA with appropriate pages. When a fault occurs, the guest IRQ number is deduced from the offset, and the ESB pages of associated XIVE IPI interrupt are inserted in the VMA (using the internal structure caching information on the interrupts). Supporting device passthrough in the guest running in XIVE native exploitation mode adds some extra refinements because the ESB pages of a different HW controller (PHB4) need to be exposed to the guest along with the initial IPI ESB pages of the XIVE IC controller. But the overall mechanic is the same. When the device HW irqs are mapped into or unmapped from the guest IRQ number space, the passthru_irq helpers, kvmppc_xive_set_mapped() and kvmppc_xive_clr_mapped(), are called to record or clear the passthrough interrupt information and to perform the switch. The approach taken by this patch is to clear the ESB pages of the guest IRQ number being mapped and let the VM fault handler repopulate. The handler will insert the ESB page corresponding to the HW interrupt of the device being passed-through or the initial IPI ESB page if the device is being removed. Signed-off-by: Cédric Le Goater <clg@kaod.org> Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
| | * KVM: PPC: Book3S HV: XIVE: Add a mapping for the source ESB pagesCédric Le Goater2019-04-301-0/+57
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Each source is associated with an Event State Buffer (ESB) with a even/odd pair of pages which provides commands to manage the source: to trigger, to EOI, to turn off the source for instance. The custom VM fault handler will deduce the guest IRQ number from the offset of the fault, and the ESB page of the associated XIVE interrupt will be inserted into the VMA using the internal structure caching information on the interrupts. Signed-off-by: Cédric Le Goater <clg@kaod.org> Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
| | * KVM: PPC: Book3S HV: XIVE: Add a TIMA mappingCédric Le Goater2019-04-301-0/+39
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Each thread has an associated Thread Interrupt Management context composed of a set of registers. These registers let the thread handle priority management and interrupt acknowledgment. The most important are : - Interrupt Pending Buffer (IPB) - Current Processor Priority (CPPR) - Notification Source Register (NSR) They are exposed to software in four different pages each proposing a view with a different privilege. The first page is for the physical thread context and the second for the hypervisor. Only the third (operating system) and the fourth (user level) are exposed the guest. A custom VM fault handler will populate the VMA with the appropriate pages, which should only be the OS page for now. Signed-off-by: Cédric Le Goater <clg@kaod.org> Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
| | * KVM: PPC: Book3S HV: XIVE: Add get/set accessors for the VP XIVE stateCédric Le Goater2019-04-302-0/+100
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The state of the thread interrupt management registers needs to be collected for migration. These registers are cached under the 'xive_saved_state.w01' field of the VCPU when the VPCU context is pulled from the HW thread. An OPAL call retrieves the backup of the IPB register in the underlying XIVE NVT structure and merges it in the KVM state. Signed-off-by: Cédric Le Goater <clg@kaod.org> Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
| | * KVM: PPC: Book3S HV: XIVE: Add a control to dirty the XIVE EQ pagesCédric Le Goater2019-04-301-0/+85
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When migration of a VM is initiated, a first copy of the RAM is transferred to the destination before the VM is stopped, but there is no guarantee that the EQ pages in which the event notifications are queued have not been modified. To make sure migration will capture a consistent memory state, the XIVE device should perform a XIVE quiesce sequence to stop the flow of event notifications and stabilize the EQs. This is the purpose of the KVM_DEV_XIVE_EQ_SYNC control which will also marks the EQ pages dirty to force their transfer. Signed-off-by: Cédric Le Goater <clg@kaod.org> Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
| | * KVM: PPC: Book3S HV: XIVE: Add a control to sync the sourcesCédric Le Goater2019-04-301-0/+36
| | | | | | | | | | | | | | | | | | | | | | | | | | | This control will be used by the H_INT_SYNC hcall from QEMU to flush event notifications on the XIVE IC owning the source. Signed-off-by: Cédric Le Goater <clg@kaod.org> Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
| | * KVM: PPC: Book3S HV: XIVE: Add a global reset controlCédric Le Goater2019-04-301-0/+85
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This control is to be used by the H_INT_RESET hcall from QEMU. Its purpose is to clear all configuration of the sources and EQs. This is necessary in case of a kexec (for a kdump kernel for instance) to make sure that no remaining configuration is left from the previous boot setup so that the new kernel can start safely from a clean state. The queue 7 is ignored when the XIVE device is configured to run in single escalation mode. Prio 7 is used by escalations. The XIVE VP is kept enabled as the vCPU is still active and connected to the XIVE device. Signed-off-by: Cédric Le Goater <clg@kaod.org> Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
| | * KVM: PPC: Book3S HV: XIVE: Add controls for the EQ configurationCédric Le Goater2019-04-303-6/+260
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | These controls will be used by the H_INT_SET_QUEUE_CONFIG and H_INT_GET_QUEUE_CONFIG hcalls from QEMU to configure the underlying Event Queue in the XIVE IC. They will also be used to restore the configuration of the XIVE EQs and to capture the internal run-time state of the EQs. Both 'get' and 'set' rely on an OPAL call to access the EQ toggle bit and EQ index which are updated by the XIVE IC when event notifications are enqueued in the EQ. The value of the guest physical address of the event queue is saved in the XIVE internal xive_q structure for later use. That is when migration needs to mark the EQ pages dirty to capture a consistent memory state of the VM. To be noted that H_INT_SET_QUEUE_CONFIG does not require the extra OPAL call setting the EQ toggle bit and EQ index to configure the EQ, but restoring the EQ state will. Signed-off-by: Cédric Le Goater <clg@kaod.org> Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
| | * KVM: PPC: Book3S HV: XIVE: Add a control to configure a sourceCédric Le Goater2019-04-303-2/+104
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This control will be used by the H_INT_SET_SOURCE_CONFIG hcall from QEMU to configure the target of a source and also to restore the configuration of a source when migrating the VM. The XIVE source interrupt structure is extended with the value of the Effective Interrupt Source Number. The EISN is the interrupt number pushed in the event queue that the guest OS will use to dispatch events internally. Caching the EISN value in KVM eases the test when checking if a reconfiguration is indeed needed. Signed-off-by: Cédric Le Goater <clg@kaod.org> Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
| | * KVM: PPC: Book3S HV: XIVE: add a control to initialize a sourceCédric Le Goater2019-04-303-4/+120
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The XIVE KVM device maintains a list of interrupt sources for the VM which are allocated in the pool of generic interrupts (IPIs) of the main XIVE IC controller. These are used for the CPU IPIs as well as for virtual device interrupts. The IRQ number space is defined by QEMU. The XIVE device reuses the source structures of the XICS-on-XIVE device for the source blocks (2-level tree) and for the source interrupts. Under XIVE native, the source interrupt caches mostly configuration information and is less used than under the XICS-on-XIVE device in which hcalls are still necessary at run-time. When a source is initialized in KVM, an IPI interrupt source is simply allocated at the OPAL level and then MASKED. KVM only needs to know about its type: LSI or MSI. Signed-off-by: Cédric Le Goater <clg@kaod.org> Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
| | * KVM: PPC: Book3S HV: XIVE: Introduce a new capability KVM_CAP_PPC_IRQ_XIVECédric Le Goater2019-04-304-41/+244
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The user interface exposes a new capability KVM_CAP_PPC_IRQ_XIVE to let QEMU connect the vCPU presenters to the XIVE KVM device if required. The capability is not advertised for now as the full support for the XIVE native exploitation mode is not yet available. When this is case, the capability will be advertised on PowerNV Hypervisors only. Nested guests (pseries KVM Hypervisor) are not supported. Internally, the interface to the new KVM device is protected with a new interrupt mode: KVMPPC_IRQ_XIVE. Signed-off-by: Cédric Le Goater <clg@kaod.org> Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
| | * KVM: PPC: Book3S HV: Add a new KVM device for the XIVE native exploitation modeCédric Le Goater2019-04-303-2/+186
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This is the basic framework for the new KVM device supporting the XIVE native exploitation mode. The user interface exposes a new KVM device to be created by QEMU, only available when running on a L0 hypervisor. Support for nested guests is not available yet. The XIVE device reuses the device structure of the XICS-on-XIVE device as they have a lot in common. That could possibly change in the future if the need arise. Signed-off-by: Cédric Le Goater <clg@kaod.org> Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
| | * Merge remote-tracking branch 'remotes/powerpc/topic/ppc-kvm' into kvm-ppc-nextPaul Mackerras2019-04-302-11/+15
| | |\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | This merges in the ppc-kvm topic branch from the powerpc tree to get patches which touch both general powerpc code and KVM code, one of which is a prerequisite for following patches. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
| | * | KVM: PPC: Book3S HV: Save/restore vrsave register in kvmhv_p9_guest_entry()Suraj Jitindar Singh2019-04-301-0/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | On POWER9 and later processors where the host can schedule vcpus on a per thread basis, there is a streamlined entry path used when the guest is radix. This entry path saves/restores the fp and vr state in kvmhv_p9_guest_entry() by calling store_[fp/vr]_state() and load_[fp/vr]_state(). This is the same as the old entry path however the old entry path also saved/restored the VRSAVE register, which isn't done in the new entry path. This means that the vrsave register is now volatile across guest exit, which is an incorrect change in behaviour. Fix this by saving/restoring the vrsave register in kvmhv_p9_guest_entry(). This restores the old, correct, behaviour. Fixes: 95a6432ce9038 ("KVM: PPC: Book3S HV: Streamlined guest entry/exit path on P9 for radix guests") Signed-off-by: Suraj Jitindar Singh <sjitindarsingh@gmail.com> Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
| | * | KVM: PPC: Book3S HV: Flush TLB on secondary radix threadsPaul Mackerras2019-04-303-62/+51
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When running on POWER9 with kvm_hv.indep_threads_mode = N and the host in SMT1 mode, KVM will run guest VCPUs on offline secondary threads. If those guests are in radix mode, we fail to load the LPID and flush the TLB if necessary, leading to the guest crashing with an unsupported MMU fault. This arises from commit 9a4506e11b97 ("KVM: PPC: Book3S HV: Make radix handle process scoped LPID flush in C, with relocation on", 2018-05-17), which didn't consider the case where indep_threads_mode = N. For simplicity, this makes the real-mode guest entry path flush the TLB in the same place for both radix and hash guests, as we did before 9a4506e11b97, though the code is now C code rather than assembly code. We also have the radix TLB flush open-coded rather than calling radix__local_flush_tlb_lpid_guest(), because the TLB flush can be called in real mode, and in real mode we don't want to invoke the tracepoint code. Fixes: 9a4506e11b97 ("KVM: PPC: Book3S HV: Make radix handle process scoped LPID flush in C, with relocation on") Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
| | * | KVM: PPC: Book3S HV: Move HPT guest TLB flushing to C codePaul Mackerras2019-04-302-34/+33
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This replaces assembler code in book3s_hv_rmhandlers.S that checks the kvm->arch.need_tlb_flush cpumask and optionally does a TLB flush with C code in book3s_hv_builtin.c. Note that unlike the radix version, the hash version doesn't do an explicit ERAT invalidation because we will invalidate and load up the SLB before entering the guest, and that will invalidate the ERAT. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
| | * | KVM: PPC: Book3S HV: Handle virtual mode in XIVE VCPU push codeSuraj Jitindar Singh2019-04-301-11/+25
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The code in book3s_hv_rmhandlers.S that pushes the XIVE virtual CPU context to the hardware currently assumes it is being called in real mode, which is usually true. There is however a path by which it can be executed in virtual mode, in the case where indep_threads_mode = N. A virtual CPU executing on an offline secondary thread can take a hypervisor interrupt in virtual mode and return from the kvmppc_hv_entry() call after the kvm_secondary_got_guest label. It is possible for it to be given another vcpu to execute before it gets to execute the stop instruction. In that case it will call kvmppc_hv_entry() for the second VCPU in virtual mode, and the XIVE vCPU push code will be executed in virtual mode. The result in that case will be a host crash due to an unexpected data storage interrupt caused by executing the stdcix instruction in virtual mode. This fixes it by adding a code path for virtual mode, which uses the virtual TIMA pointer and normal load/store instructions. [paulus@ozlabs.org - wrote patch description] Signed-off-by: Suraj Jitindar Singh <sjitindarsingh@gmail.com> Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
| | * | KVM: PPC: Book3S HV: Fix XICS-on-XIVE H_IPI when priority = 0Paul Mackerras2019-04-301-38/+40
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This fixes a bug in the XICS emulation on POWER9 machines which is triggered by the guest doing a H_IPI with priority = 0 (the highest priority). What happens is that the notification interrupt arrives at the destination at priority zero. The loop in scan_interrupts() sees that a priority 0 interrupt is pending, but because xc->mfrr is zero, we break out of the loop before taking the notification interrupt out of the queue and EOI-ing it. (This doesn't happen when xc->mfrr != 0; in that case we process the priority-0 notification interrupt on the first iteration of the loop, and then break out of a subsequent iteration of the loop with hirq == XICS_IPI.) To fix this, we move the prio >= xc->mfrr check down to near the end of the loop. However, there are then some other things that need to be adjusted. Since we are potentially handling the notification interrupt and also delivering an IPI to the guest in the same loop iteration, we need to update pending and handle any q->pending_count value before the xc->mfrr check, rather than at the end of the loop. Also, we need to update the queue pointers when we have processed and EOI-ed the notification interrupt, since we may not do it later. Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
| | * | KVM: PPC: Book3S HV: smb->smp comment fixupPalmer Dabbelt2019-04-301-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | I made the same typo when trying to grep for uses of smp_wmb and figured I might as well fix it. Signed-off-by: Palmer Dabbelt <palmer@sifive.com> Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
| | * | KVM: PPC: Book3S: Allocate guest TCEs on demand tooAlexey Kardashevskiy2019-04-302-31/+107
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | We already allocate hardware TCE tables in multiple levels and skip intermediate levels when we can, now it is a turn of the KVM TCE tables. Thankfully these are allocated already in 2 levels. This moves the table's last level allocation from the creating helper to kvmppc_tce_put() and kvm_spapr_tce_fault(). Since such allocation cannot be done in real mode, this creates a virtual mode version of kvmppc_tce_put() which handles allocations. This adds kvmppc_rm_ioba_validate() to do an additional test if the consequent kvmppc_tce_put() needs a page which has not been allocated; if this is the case, we bail out to virtual mode handlers. The allocations are protected by a new mutex as kvm->lock is not suitable for the task because the fault handler is called with the mmap_sem held but kvmhv_setup_mmu() locks kvm->lock and mmap_sem in the reverse order. Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru> Signed-off-by: Paul Mackerras <paulus@ozlabs.org>
| | * | KVM: PPC: Book3S HV: Avoid lockdep debugging in TCE realmode handlersAlexey Kardashevskiy2019-04-302-31/+44
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The kvmppc_tce_to_ua() helper is called from real and virtual modes and it works fine as long as CONFIG_DEBUG_LOCKDEP is not enabled. However if the lockdep debugging is on, the lockdep will most likely break in kvm_memslots() because of srcu_dereference_check() so we need to use PPC-own kvm_memslots_raw() which uses realmode safe rcu_dereference_raw_notrace(). This creates a realmode copy of kvmppc_tce_to_ua() which replaces kvm_memslots() with kvm_memslots_raw(). Since kvmppc_rm_tce_to_ua() becomes static and can only be used inside HV KVM, this moves it earlier under CONFIG_KVM_BOOK3S_HV_POSSIBLE. This moves truly virtual-mode kvmppc_tce_to_ua() to where it belongs and drops the prmap parameter which was never used in the virtual mode. Fixes: d3695aa4f452 ("KVM: PPC: Add support for multiple-TCE hcalls", 2016-02-15) Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru> Signed-off-by: Paul Mackerras <paulus@ozlabs.org>