linux - linux

	Commit message (Collapse)	Author	Files	Lines
2015-08-20	xen/PMU: Describe vendor-specific PMU registers	Boris Ostrovsky	1	-1/+152
	AMD and Intel PMU register initialization and helpers that determine whether a register belongs to PMU. This and some of subsequent PMU emulation code is somewhat similar to Xen's PMU implementation. Signed-off-by: Boris Ostrovsky <boris.ostrovsky@oracle.com> Reviewed-by: David Vrabel <david.vrabel@citrix.com> Signed-off-by: David Vrabel <david.vrabel@citrix.com>
2015-08-20	xen/PMU: Initialization code for Xen PMU	Boris Ostrovsky	10	-9/+398
	Map shared data structure that will hold CPU registers, VPMU context, V/PCPU IDs of the CPU interrupted by PMU interrupt. Hypervisor fills this information in its handler and passes it to the guest for further processing. Set up PMU VIRQ. Now that perf infrastructure will assume that PMU is available on a PV guest we need to be careful and make sure that accesses via RDPMC instruction don't cause fatal traps by the hypervisor. Provide a nop RDPMC handler. For the same reason avoid issuing a warning on a write to APIC's LVTPC. Both of these will be made functional in later patches. Signed-off-by: Boris Ostrovsky <boris.ostrovsky@oracle.com> Reviewed-by: David Vrabel <david.vrabel@citrix.com> Signed-off-by: David Vrabel <david.vrabel@citrix.com>
2015-08-20	xen/PMU: Sysfs interface for setting Xen PMU mode	Boris Ostrovsky	7	-1/+228
	Set Xen's PMU mode via /sys/hypervisor/pmu/pmu_mode. Add XENPMU hypercall. Signed-off-by: Boris Ostrovsky <boris.ostrovsky@oracle.com> Reviewed-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com> Signed-off-by: David Vrabel <david.vrabel@citrix.com>
2015-08-20	xen: xensyms support	Boris Ostrovsky	6	-0/+183
	Export Xen symbols to dom0 via /proc/xen/xensyms (similar to /proc/kallsyms). Signed-off-by: Boris Ostrovsky <boris.ostrovsky@oracle.com> Reviewed-by: David Vrabel <david.vrabel@citrix.com> Signed-off-by: David Vrabel <david.vrabel@citrix.com>
2015-08-20	xen: remove no longer needed p2m.h	Juergen Gross	4	-19/+9
	Cleanup by removing arch/x86/xen/p2m.h as it isn't needed any more. Most definitions in this file are used in p2m.c only. Move those into p2m.c. set_phys_range_identity() is already declared in arch/x86/include/asm/xen/page.h, add __init annotation there. MAX_REMAP_RANGES isn't used at all, just delete it. The only define left is P2M_PER_PAGE which is moved to page.h as well. Signed-off-by: Juergen Gross <jgross@suse.com> Acked-by: Konrad Rzeszutek Wilk <Konrad.wilk@oracle.com> Signed-off-by: David Vrabel <david.vrabel@citrix.com>
2015-08-20	xen: allow more than 512 GB of RAM for 64 bit pv-domains	Juergen Gross	5	-27/+73
	64 bit pv-domains under Xen are limited to 512 GB of RAM today. The main reason has been the 3 level p2m tree, which was replaced by the virtual mapped linear p2m list. Parallel to the p2m list which is being used by the kernel itself there is a 3 level mfn tree for usage by the Xen tools and eventually for crash dump analysis. For this tree the linear p2m list can serve as a replacement, too. As the kernel can't know whether the tools are capable of dealing with the p2m list instead of the mfn tree, the limit of 512 GB can't be dropped in all cases. This patch replaces the hard limit by a kernel parameter which tells the kernel to obey the 512 GB limit or not. The default is selected by a configuration parameter which specifies whether the 512 GB limit should be active per default for domUs (domain save/restore/migration and crash dump analysis are affected). Memory above the domain limit is returned to the hypervisor instead of being identity mapped, which was wrong anyway. The kernel configuration parameter to specify the maximum size of a domain can be deleted, as it is not relevant any more. Signed-off-by: Juergen Gross <jgross@suse.com> Acked-by: Konrad Rzeszutek Wilk <Konrad.wilk@oracle.com> Signed-off-by: David Vrabel <david.vrabel@citrix.com>
2015-08-20	xen: move p2m list if conflicting with e820 map	Juergen Gross	3	-47/+264
	Check whether the hypervisor supplied p2m list is placed at a location which is conflicting with the target E820 map. If this is the case relocate it to a new area unused up to now and compliant to the E820 map. As the p2m list might by huge (up to several GB) and is required to be mapped virtually, set up a temporary mapping for the copied list. For pvh domains just delete the p2m related information from start info instead of reserving the p2m memory, as we don't need it at all. For 32 bit kernels adjust the memblock_reserve() parameters in order to cover the page tables only. This requires to memblock_reserve() the start_info page on it's own. Signed-off-by: Juergen Gross <jgross@suse.com> Acked-by: Konrad Rzeszutek Wilk <Konrad.wilk@oracle.com> Signed-off-by: David Vrabel <david.vrabel@citrix.com>
2015-08-20	xen: add explicit memblock_reserve() calls for special pages	Juergen Gross	3	-1/+19
	Some special pages containing interfaces to xen are being reserved implicitly only today. The memblock_reserve() call to reserve them is meant to reserve the p2m list supplied by xen. It is just reserving not only the p2m list itself, but some more pages up to the start of the xen built page tables. To be able to move the p2m list to another pfn range, which is needed for support of huge RAM, this memblock_reserve() must be split up to cover all affected reserved pages explicitly. The affected pages are: - start_info page - xenstore ring (might be missing, mfn is 0 in this case) - console ring (not for initial domain) Signed-off-by: Juergen Gross <jgross@suse.com> Signed-off-by: David Vrabel <david.vrabel@citrix.com>
2015-08-20	mm: provide early_memremap_ro to establish read-only mapping	Juergen Gross	3	-0/+17
	During early boot as Xen pv domain the kernel needs to map some page tables supplied by the hypervisor read only. This is needed to be able to relocate some data structures conflicting with the physical memory map especially on systems with huge RAM (above 512GB). Provide the function early_memremap_ro() to provide this read only mapping. Signed-off-by: Juergen Gross <jgross@suse.com> Acked-by: Konrad Rzeszutek Wilk <Konrad.wilk@oracle.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: David Vrabel <david.vrabel@citrix.com>
2015-08-20	xen: check for initrd conflicting with e820 map	Juergen Gross	1	-0/+51
	Check whether the initrd is placed at a location which is conflicting with the target E820 map. If this is the case relocate it to a new area unused up to now and compliant to the E820 map. Signed-off-by: Juergen Gross <jgross@suse.com> Reviewed-by: David Vrabel <david.vrabel@citrix.com> Acked-by: Konrad Rzeszutek Wilk <Konrad.wilk@oracle.com> Signed-off-by: David Vrabel <david.vrabel@citrix.com>
2015-08-20	xen: check pre-allocated page tables for conflict with memory map	Juergen Gross	3	-3/+23
	Check whether the page tables built by the domain builder are at memory addresses which are in conflict with the target memory map. If this is the case just panic instead of running into problems later. Signed-off-by: Juergen Gross <jgross@suse.com> Acked-by: Konrad Rzeszutek Wilk <Konrad.wilk@oracle.com> Signed-off-by: David Vrabel <david.vrabel@citrix.com>
2015-08-20	xen: check for kernel memory conflicting with memory layout	Juergen Gross	1	-0/+12
	Checks whether the pre-allocated memory of the loaded kernel is in conflict with the target memory map. If this is the case, just panic instead of run into problems later, as there is nothing we can do to repair this situation. Signed-off-by: Juergen Gross <jgross@suse.com> Reviewed-by: David Vrabel <david.vrabel@citrix.com> Acked-by: Konrad Rzeszutek Wilk <Konrad.wilk@oracle.com> Signed-off-by: David Vrabel <david.vrabel@citrix.com>
2015-08-20	xen: find unused contiguous memory area	Juergen Gross	2	-0/+35
	For being able to relocate pre-allocated data areas like initrd or p2m list it is mandatory to find a contiguous memory area which is not yet in use and doesn't conflict with the memory map we want to be in effect. In case such an area is found reserve it at once as this will be required to be done in any case. Signed-off-by: Juergen Gross <jgross@suse.com> Reviewed-by: David Vrabel <david.vrabel@citrix.com> Acked-by: Konrad Rzeszutek Wilk <Konrad.wilk@oracle.com> Signed-off-by: David Vrabel <david.vrabel@citrix.com>
2015-08-20	xen: check memory area against e820 map	Juergen Gross	2	-0/+24
	Provide a service routine to check a physical memory area against the E820 map. The routine will return false if the complete area is RAM according to the E820 map and true otherwise. Signed-off-by: Juergen Gross <jgross@suse.com> Reviewed-by: David Vrabel <david.vrabel@citrix.com> Acked-by: Konrad Rzeszutek Wilk <Konrad.wilk@oracle.com> Signed-off-by: David Vrabel <david.vrabel@citrix.com>
2015-08-20	xen: split counting of extra memory pages from remapping	Juergen Gross	1	-40/+58
	Memory pages in the initial memory setup done by the Xen hypervisor conflicting with the target E820 map are remapped. In order to do this those pages are counted and remapped in xen_set_identity_and_remap(). Split the counting from the remapping operation to be able to setup the needed memory sizes in time but doing the remap operation at a later time. This enables us to simplify the interface to xen_set_identity_and_remap() as the number of remapped and released pages is no longer needed here. Finally move the remapping further down to prepare relocating conflicting memory contents before the memory might be clobbered by xen_set_identity_and_remap(). This requires to not destroy the Xen E820 map when the one for the system is being constructed. Signed-off-by: Juergen Gross <jgross@suse.com> Acked-by: Konrad Rzeszutek Wilk <Konrad.wilk@oracle.com> Signed-off-by: David Vrabel <david.vrabel@citrix.com>
2015-08-20	xen: move static e820 map to global scope	Juergen Gross	1	-47/+49
	Instead of using a function local static e820 map in xen_memory_setup() and calling various functions in the same source with the map as a parameter use a map directly accessible by all functions in the source. Signed-off-by: Juergen Gross <jgross@suse.com> Reviewed-by: David Vrabel <david.vrabel@citrix.com> Acked-by: Konrad Rzeszutek Wilk <Konrad.wilk@oracle.com> Signed-off-by: David Vrabel <david.vrabel@citrix.com>
2015-08-20	xen: eliminate scalability issues from initial mapping setup	Juergen Gross	3	-39/+156
	Direct Xen to place the initial P->M table outside of the initial mapping, as otherwise the 1G (implementation) / 2G (theoretical) restriction on the size of the initial mapping limits the amount of memory a domain can be handed initially. As the initial P->M table is copied rather early during boot to domain private memory and it's initial virtual mapping is dropped, the easiest way to avoid virtual address conflicts with other addresses in the kernel is to use a user address area for the virtual address of the initial P->M table. This allows us to just throw away the page tables of the initial mapping after the copy without having to care about address invalidation. It should be noted that this patch won't enable a pv-domain to USE more than 512 GB of RAM. It just enables it to be started with a P->M table covering more memory. This is especially important for being able to boot a Dom0 on a system with more than 512 GB memory. Signed-off-by: Juergen Gross <jgross@suse.com> Based-on-patch-by: Jan Beulich <jbeulich@suse.com> Acked-by: Konrad Rzeszutek Wilk <Konrad.wilk@oracle.com> Signed-off-by: David Vrabel <david.vrabel@citrix.com>
2015-08-20	xen: don't build mfn tree if tools don't need it	Juergen Gross	1	-3/+7
	In case the Xen tools indicate they don't need the p2m 3 level tree as they support the virtual mapped linear p2m list, just omit building the tree. Signed-off-by: Juergen Gross <jgross@suse.com> Reviewed-by: David Vrabel <david.vrabel@citrix.com> Acked-by: Konrad Rzeszutek Wilk <Konrad.wilk@oracle.com> Signed-off-by: David Vrabel <david.vrabel@citrix.com>
2015-08-20	xen: save linear p2m list address in shared info structure	Juergen Gross	1	-0/+17
	The virtual address of the linear p2m list should be stored in the shared info structure read by the Xen tools to be able to support 64 bit pv-domains larger than 512 GB. Additionally the linear p2m list interface includes a generation count which is changed prior to and after each mapping change of the p2m list. Reading the generation count the Xen tools can detect changes of the mappings and re-read the p2m list eventually. Signed-off-by: Juergen Gross <jgross@suse.com> Reviewed-by: David Vrabel <david.vrabel@citrix.com> Acked-by: Konrad Rzeszutek Wilk <Konrad.wilk@oracle.com> Signed-off-by: David Vrabel <david.vrabel@citrix.com>
2015-08-20	xen: sync with xen headers	Juergen Gross	2	-24/+107
	Use the newest headers from the xen tree to get some new structure layouts. Signed-off-by: Juergen Gross <jgross@suse.com> Reviewed-by: David Vrabel <david.vrabel@citrix.com> Acked-by: Konrad Rzeszutek Wilk <Konrad.wilk@oracle.com> Signed-off-by: David Vrabel <david.vrabel@citrix.com>
2015-08-20	arm/xen: Drop the definition of xen_pci_platform_unplug	Julien Grall	1	-3/+0
	The commit 6f6c15ef912465b3aaafe709f39bd6026a8b3e72 "xen/pvhvm: Remove the xen_platform_pci int." makes the x86 version of xen_pci_platform_unplug static. Therefore we don't need anymore to define a dummy xen_pci_platform_unplug for ARM. Signed-off-by: Julien Grall <julien.grall@citrix.com> Signed-off-by: Stefano Stabellini <stefano.stabellini@eu.citrix.com>
2015-08-20	xen/events: Support event channel rebind on ARM	Julien Grall	6	-10/+24
	Currently, the event channel rebind code is gated with the presence of the vector callback. The virtual interrupt controller on ARM has the concept of per-CPU interrupt (PPI) which allow us to support per-VCPU event channel. Therefore there is no need of vector callback for ARM. Xen is already using a free PPI to notify the guest VCPU of an event. Furthermore, the xen code initialization in Linux (see arch/arm/xen/enlighten.c) is requesting correctly a per-CPU IRQ. Introduce new helper xen_support_evtchn_rebind to allow architecture decide whether rebind an event is support or not. It will always return true on ARM and keep the same behavior on x86. This is also allow us to drop the usage of xen_have_vector_callback entirely in the ARM code. Signed-off-by: Julien Grall <julien.grall@citrix.com> Signed-off-by: David Vrabel <david.vrabel@citrix.com>
2015-08-20	xen-blkfront: convert to blk-mq APIs	Bob Liu	1	-86/+60
	Note: This patch is based on original work of Arianna's internship for GNOME's Outreach Program for Women. Only one hardware queue is used now, so there is no significant performance change The legacy non-mq code is deleted completely which is the same as other drivers like virtio, mtip, and nvme. Also dropped one unnecessary holding of info->io_lock when calling blk_mq_stop_hw_queues(). Signed-off-by: Arianna Avanzini <avanzini.arianna@gmail.com> Signed-off-by: Bob Liu <bob.liu@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Acked-by: Jens Axboe <axboe@fb.com> Signed-off-by: David Vrabel <david.vrabel@citrix.com>
2015-08-20	xen/preempt: use need_resched() instead of should_resched()	Konstantin Khlebnikov	1	-1/+1
	This code is used only when CONFIG_PREEMPT=n and only in non-atomic context: xen_in_preemptible_hcall is set only in privcmd_ioctl_hypercall(). Thus preempt_count is zero and should_resched() is equal to need_resched(). Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru> Signed-off-by: David Vrabel <david.vrabel@citrix.com>
2015-08-20	x86/xen: fix non-ANSI declaration of xen_has_pv_devices()	Colin Ian King	1	-1/+1
	xen_has_pv_devices() has no parameters, so use the normal void parameter convention to make it match the prototype in the header file include/xen/platform_pci.h. Signed-off-by: Colin Ian King <colin.king@canonical.com> Signed-off-by: David Vrabel <david.vrabel@citrix.com>
2015-08-17	Linux 4.2-rc7v4.2-rc7	Linus Torvalds	1	-1/+1

2015-08-17	x86/ldt: Further fix FPU emulation	Andy Lutomirski	1	-1/+1
	The previous fix confused a selector with a segment prefix. Fix it. Compile-tested only. Cc: stable@vger.kernel.org Cc: Juergen Gross <jgross@suse.com> Reported-by: Linus Torvalds <torvalds@linux-foundation.org> Fixes: 4809146b86c3 ("x86/ldt: Correct FPU emulation access to LDT") Signed-off-by: Andy Lutomirski <luto@kernel.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-08-16	fs/fuse: fix ioctl type confusion	Jann Horn	1	-1/+9
	fuse_dev_ioctl() performed fuse_get_dev() on a user-supplied fd, leading to a type confusion issue. Fix it by checking file->f_op. Signed-off-by: Jann Horn <jann@thejh.net> Acked-by: Miklos Szeredi <miklos@szeredi.hu> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-08-16	MIPS: Fix seccomp syscall argument for MIPS64	Markos Chandras	2	-2/+2
	Commit 4c21b8fd8f14 ("MIPS: seccomp: Handle indirect system calls (o32)") fixed indirect system calls on O32 but it also introduced a bug for MIPS64 where it erroneously modified the v0 (syscall) register with the assumption that the sycall offset hasn't been taken into consideration. This breaks seccomp on MIPS64 n64 and n32 ABIs. We fix this by replacing the addition with a move instruction. Fixes: 4c21b8fd8f14 ("MIPS: seccomp: Handle indirect system calls (o32)") Cc: <stable@vger.kernel.org> # 3.15+ Reviewed-by: James Hogan <james.hogan@imgtec.com> Signed-off-by: Markos Chandras <markos.chandras@imgtec.com> Cc: linux-mips@linux-mips.org Patchwork: https://patchwork.linux-mips.org/patch/10951/ Signed-off-by: Ralf Baechle <ralf@linux-mips.org>
2015-08-15	Update maintainers for DRM STI driver	Benjamin Gaignard	1	-0/+9
	Add Vincent Abriou and myself as maintainers. Signed-off-by: Benjamin Gaignard <benjamin.gaignard@linaro.org> Cc: Vincent Abriou <vincent.abriou@st.com> Cc: Dave Airlie <airlied@linux.ie> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-08-15	mm: cma: mark cma_bitmap_maxno() inline in header	Gregory Fong	1	-1/+1
	cma_bitmap_maxno() was marked as static and not static inline, which can cause warnings about this function not being used if this file is included in a file that does not call that function, and violates the conventions used elsewhere. The two options are to move the function implementation back to mm/cma.c or make it inline here, and it's simple enough for the latter to make sense. Signed-off-by: Gregory Fong <gregory.0xf0@gmail.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Sasha Levin <sasha.levin@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-08-15	zram: fix pool name truncation	Sergey Senozhatsky	1	-4/+2
	zram_meta_alloc() constructs a pool name for zs_create_pool() call as snprintf(pool_name, sizeof(pool_name), "zram%d", device_id); However, it defines pool name buffer to be only 8 bytes long (minus trailing zero), which means that we can have only 1000 pool names: zram0 -- zram999. With CONFIG_ZSMALLOC_STAT enabled an attempt to create a device zram1000 can fail if device zram100 already exists, because snprintf() will truncate new pool name to zram100 and pass it debugfs_create_dir(), causing: debugfs dir <zram100> creation failed zram: Error creating memory pool ... and so on. Fix it by passing zram->disk->disk_name to zram_meta_alloc() instead of divice_id. We construct zram%d name earlier and keep it as a ->disk_name, no need to snprintf() it again. Signed-off-by: Sergey Senozhatsky <sergey.senozhatsky@gmail.com> Cc: Minchan Kim <minchan@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-08-15	memory-hotplug: fix wrong edge when hot add a new node	Xishi Qiu	2	-0/+11
	When we add a new node, the edge of memory may be wrong. e.g. system has 4 nodes, and node3 is movable, node3 mem:[24G-32G], 1. hotremove the node3, 2. then hotadd node3 with a part of memory, mem:[26G-30G], 3. call hotadd_new_pgdat() free_area_init_node() get_pfn_range_for_nid() 4. it will return wrong start_pfn and end_pfn, because we have not update the memblock. This patch also fixes a BUG_ON during hot-addition, please see http://marc.info/?l=linux-kernel&m=142961156129456&w=2 Signed-off-by: Xishi Qiu <qiuxishi@huawei.com> Cc: Yasuaki Ishimatsu <isimatu.yasuaki@jp.fujitsu.com> Cc: Kamezawa Hiroyuki <kamezawa.hiroyu@jp.fujitsu.com> Cc: Taku Izumi <izumi.taku@jp.fujitsu.com> Cc: Tang Chen <tangchen@cn.fujitsu.com> Cc: Gu Zheng <guz.fnst@cn.fujitsu.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-08-15	.mailmap: Andrey Ryabinin has moved	Andrey Ryabinin	3	-2/+3
	Update my email address. Signed-off-by: Andrey Ryabinin <ryabinin.a.a@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-08-15	ipc/sem.c: update/correct memory barriers	Manfred Spraul	1	-4/+14
	sem_lock() did not properly pair memory barriers: !spin_is_locked() and spin_unlock_wait() are both only control barriers. The code needs an acquire barrier, otherwise the cpu might perform read operations before the lock test. As no primitive exists inside <include/spinlock.h> and since it seems noone wants another primitive, the code creates a local primitive within ipc/sem.c. With regards to -stable: The change of sem_wait_array() is a bugfix, the change to sem_lock() is a nop (just a preprocessor redefinition to improve the readability). The bugfix is necessary for all kernels that use sem_wait_array() (i.e.: starting from 3.10). Signed-off-by: Manfred Spraul <manfred@colorfullife.com> Reported-by: Oleg Nesterov <oleg@redhat.com> Acked-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: "Paul E. McKenney" <paulmck@linux.vnet.ibm.com> Cc: Kirill Tkhai <ktkhai@parallels.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Josh Poimboeuf <jpoimboe@redhat.com> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: <stable@vger.kernel.org> [3.10+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-08-15	mm/hwpoison: fix panic due to split huge zero page	Wanpeng Li	1	-2/+5
	Bug: ------------[ cut here ]------------ kernel BUG at mm/huge_memory.c:1957! invalid opcode: 0000 [#1] SMP Modules linked in: snd_hda_codec_hdmi i915 rpcsec_gss_krb5 snd_hda_codec_realtek snd_hda_codec_generic nfsv4 dns_re CPU: 2 PID: 2576 Comm: test_huge Not tainted 4.2.0-rc5-mm1+ #27 Hardware name: Dell Inc. OptiPlex 7020/0F5C5X, BIOS A03 01/08/2015 task: ffff880204e3d600 ti: ffff8800db16c000 task.ti: ffff8800db16c000 RIP: split_huge_page_to_list+0xdb/0x120 Call Trace: memory_failure+0x32e/0x7c0 madvise_hwpoison+0x8b/0x160 SyS_madvise+0x40/0x240 ? do_page_fault+0x37/0x90 entry_SYSCALL_64_fastpath+0x12/0x71 Code: ff f0 41 ff 4c 24 30 74 0d 31 c0 48 83 c4 08 5b 41 5c 41 5d c9 c3 4c 89 e7 e8 e2 58 fd ff 48 83 c4 08 31 c0 RIP split_huge_page_to_list+0xdb/0x120 RSP <ffff8800db16fde8> ---[ end trace aee7ce0df8e44076 ]--- Testcase: #define _GNU_SOURCE #include <stdlib.h> #include <stdio.h> #include <sys/mman.h> #include <unistd.h> #include <fcntl.h> #include <sys/types.h> #include <errno.h> #include <string.h> #define MB 10241024 int main(void) { char mem; posix_memalign((void *)&mem, 2 MB, 200 * MB); madvise(mem, 200 * MB, MADV_HWPOISON); free(mem); return 0; } Huge zero page is allocated if page fault w/o FAULT_FLAG_WRITE flag. The get_user_pages_fast() which called in madvise_hwpoison() will get huge zero page if the page is not allocated before. Huge zero page is a tranparent huge page, however, it is not an anonymous page. memory_failure will split the huge zero page and trigger BUG_ON(is_huge_zero_page(page)); After commit 98ed2b0052e6 ("mm/memory-failure: give up error handling for non-tail-refcounted thp"), memory_failure will not catch non anon thp from madvise_hwpoison path and this bug occur. Fix it by catching non anon thp in memory_failure in order to not split huge zero page in madvise_hwpoison path. After this patch: Injecting memory failure for page 0x202800 at 0x7fd8ae800000 MCE: 0x202800: non anonymous thp [...] [akpm@linux-foundation.org: remove second split, per Wanpeng] Signed-off-by: Wanpeng Li <wanpeng.li@hotmail.com> Acked-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-08-15	ipc,sem: remove uneeded sem_undo_list lock usage in exit_sem()	Herton R. Krzesinski	1	-2/+4
	After we acquire the sma->sem_perm lock in exit_sem(), we are protected against a racing IPC_RMID operation. Also at that point, we are the last user of sem_undo_list. Therefore it isn't required that we acquire or use ulp->lock. Signed-off-by: Herton R. Krzesinski <herton@redhat.com> Acked-by: Manfred Spraul <manfred@colorfullife.com> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: Rafael Aquini <aquini@redhat.com> CC: Aristeu Rozanski <aris@redhat.com> Cc: David Jeffery <djeffery@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-08-15	ipc,sem: fix use after free on IPC_RMID after a task using same semaphore ↵	Herton R. Krzesinski	1	-6/+17
	set exits The current semaphore code allows a potential use after free: in exit_sem we may free the task's sem_undo_list while there is still another task looping through the same semaphore set and cleaning the sem_undo list at freeary function (the task called IPC_RMID for the same semaphore set). For example, with a test program [1] running which keeps forking a lot of processes (which then do a semop call with SEM_UNDO flag), and with the parent right after removing the semaphore set with IPC_RMID, and a kernel built with CONFIG_SLAB, CONFIG_SLAB_DEBUG and CONFIG_DEBUG_SPINLOCK, you can easily see something like the following in the kernel log: Slab corruption (Not tainted): kmalloc-64 start=ffff88003b45c1c0, len=64 000: 6b 6b 6b 6b 6b 6b 6b 6b 00 6b 6b 6b 6b 6b 6b 6b kkkkkkkk.kkkkkkk 010: ff ff ff ff 6b 6b 6b 6b ff ff ff ff ff ff ff ff ....kkkk........ Prev obj: start=ffff88003b45c180, len=64 000: 00 00 00 00 ad 4e ad de ff ff ff ff 5a 5a 5a 5a .....N......ZZZZ 010: ff ff ff ff ff ff ff ff c0 fb 01 37 00 88 ff ff ...........7.... Next obj: start=ffff88003b45c200, len=64 000: 00 00 00 00 ad 4e ad de ff ff ff ff 5a 5a 5a 5a .....N......ZZZZ 010: ff ff ff ff ff ff ff ff 68 29 a7 3c 00 88 ff ff ........h).<.... BUG: spinlock wrong CPU on CPU#2, test/18028 general protection fault: 0000 [#1] SMP Modules linked in: 8021q mrp garp stp llc nf_conntrack_ipv4 nf_defrag_ipv4 ip6t_REJECT nf_reject_ipv6 nf_conntrack_ipv6 nf_defrag_ipv6 xt_state nf_conntrack ip6table_filter ip6_tables binfmt_misc ppdev input_leds joydev parport_pc parport floppy serio_raw virtio_balloon virtio_rng virtio_console virtio_net iosf_mbi crct10dif_pclmul crc32_pclmul ghash_clmulni_intel pcspkr qxl ttm drm_kms_helper drm snd_hda_codec_generic i2c_piix4 snd_hda_intel snd_hda_codec snd_hda_core snd_hwdep snd_seq snd_seq_device snd_pcm snd_timer snd soundcore crc32c_intel virtio_pci virtio_ring virtio pata_acpi ata_generic [last unloaded: speedstep_lib] CPU: 2 PID: 18028 Comm: test Not tainted 4.2.0-rc5+ #1 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.8.1-20150318_183358- 04/01/2014 RIP: spin_dump+0x53/0xc0 Call Trace: spin_bug+0x30/0x40 do_raw_spin_unlock+0x71/0xa0 _raw_spin_unlock+0xe/0x10 freeary+0x82/0x2a0 ? _raw_spin_lock+0xe/0x10 semctl_down.clone.0+0xce/0x160 ? __do_page_fault+0x19a/0x430 ? __audit_syscall_entry+0xa8/0x100 SyS_semctl+0x236/0x2c0 ? syscall_trace_leave+0xde/0x130 entry_SYSCALL_64_fastpath+0x12/0x71 Code: 8b 80 88 03 00 00 48 8d 88 60 05 00 00 48 c7 c7 a0 2c a4 81 31 c0 65 8b 15 eb 40 f3 7e e8 08 31 68 00 4d 85 e4 44 8b 4b 08 74 5e <45> 8b 84 24 88 03 00 00 49 8d 8c 24 60 05 00 00 8b 53 04 48 89 RIP [<ffffffff810d6053>] spin_dump+0x53/0xc0 RSP <ffff88003750fd68> ---[ end trace 783ebb76612867a0 ]--- NMI watchdog: BUG: soft lockup - CPU#3 stuck for 22s! [test:18053] Modules linked in: 8021q mrp garp stp llc nf_conntrack_ipv4 nf_defrag_ipv4 ip6t_REJECT nf_reject_ipv6 nf_conntrack_ipv6 nf_defrag_ipv6 xt_state nf_conntrack ip6table_filter ip6_tables binfmt_misc ppdev input_leds joydev parport_pc parport floppy serio_raw virtio_balloon virtio_rng virtio_console virtio_net iosf_mbi crct10dif_pclmul crc32_pclmul ghash_clmulni_intel pcspkr qxl ttm drm_kms_helper drm snd_hda_codec_generic i2c_piix4 snd_hda_intel snd_hda_codec snd_hda_core snd_hwdep snd_seq snd_seq_device snd_pcm snd_timer snd soundcore crc32c_intel virtio_pci virtio_ring virtio pata_acpi ata_generic [last unloaded: speedstep_lib] CPU: 3 PID: 18053 Comm: test Tainted: G D 4.2.0-rc5+ #1 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.8.1-20150318_183358- 04/01/2014 RIP: native_read_tsc+0x0/0x20 Call Trace: ? delay_tsc+0x40/0x70 __delay+0xf/0x20 do_raw_spin_lock+0x96/0x140 _raw_spin_lock+0xe/0x10 sem_lock_and_putref+0x11/0x70 SYSC_semtimedop+0x7bf/0x960 ? handle_mm_fault+0xbf6/0x1880 ? dequeue_task_fair+0x79/0x4a0 ? __do_page_fault+0x19a/0x430 ? kfree_debugcheck+0x16/0x40 ? __do_page_fault+0x19a/0x430 ? __audit_syscall_entry+0xa8/0x100 ? do_audit_syscall_entry+0x66/0x70 ? syscall_trace_enter_phase1+0x139/0x160 SyS_semtimedop+0xe/0x10 SyS_semop+0x10/0x20 entry_SYSCALL_64_fastpath+0x12/0x71 Code: 47 10 83 e8 01 85 c0 89 47 10 75 08 65 48 89 3d 1f 74 ff 7e c9 c3 0f 1f 44 00 00 55 48 89 e5 e8 87 17 04 00 66 90 c9 c3 0f 1f 00 <55> 48 89 e5 0f 31 89 c1 48 89 d0 48 c1 e0 20 89 c9 48 09 c8 c9 Kernel panic - not syncing: softlockup: hung tasks I wasn't able to trigger any badness on a recent kernel without the proper config debugs enabled, however I have softlockup reports on some kernel versions, in the semaphore code, which are similar as above (the scenario is seen on some servers running IBM DB2 which uses semaphore syscalls). The patch here fixes the race against freeary, by acquiring or waiting on the sem_undo_list lock as necessary (exit_sem can race with freeary, while freeary sets un->semid to -1 and removes the same sem_undo from list_proc or when it removes the last sem_undo). After the patch I'm unable to reproduce the problem using the test case [1]. [1] Test case used below: #include <stdio.h> #include <sys/types.h> #include <sys/ipc.h> #include <sys/sem.h> #include <sys/wait.h> #include <stdlib.h> #include <time.h> #include <unistd.h> #include <errno.h> #define NSEM 1 #define NSET 5 int sid[NSET]; void thread() { struct sembuf op; int s; uid_t pid = getuid(); s = rand() % NSET; op.sem_num = pid % NSEM; op.sem_op = 1; op.sem_flg = SEM_UNDO; semop(sid[s], &op, 1); exit(EXIT_SUCCESS); } void create_set() { int i, j; pid_t p; union { int val; struct semid_ds buf; unsigned short int array; struct seminfo __buf; } un; / Create and initialize semaphore set / for (i = 0; i < NSET; i++) { sid[i] = semget(IPC_PRIVATE , NSEM, 0644 \| IPC_CREAT); if (sid[i] < 0) { perror("semget"); exit(EXIT_FAILURE); } } un.val = 0; for (i = 0; i < NSET; i++) { for (j = 0; j < NSEM; j++) { if (semctl(sid[i], j, SETVAL, un) < 0) perror("semctl"); } } / Launch threads that operate on semaphore set / for (i = 0; i < NSEM NSET * NSET; i++) { p = fork(); if (p < 0) perror("fork"); if (p == 0) thread(); } /* Free semaphore set / for (i = 0; i < NSET; i++) { if (semctl(sid[i], NSEM, IPC_RMID)) perror("IPC_RMID"); } / Wait for forked processes to exit / while (wait(NULL)) { if (errno == ECHILD) break; }; } int main(int argc, char argv) { pid_t p; srand(time(NULL)); while (1) { p = fork(); if (p < 0) { perror("fork"); exit(EXIT_FAILURE); } if (p == 0) { create_set(); goto end; } / Wait for forked processes to exit */ while (wait(NULL)) { if (errno == ECHILD) break; }; } end: return 0; } [akpm@linux-foundation.org: use normal comment layout] Signed-off-by: Herton R. Krzesinski <herton@redhat.com> Acked-by: Manfred Spraul <manfred@colorfullife.com> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: Rafael Aquini <aquini@redhat.com> CC: Aristeu Rozanski <aris@redhat.com> Cc: David Jeffery <djeffery@redhat.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-08-15	mm/hwpoison: fix fail isolate hugetlbfs page w/ refcount held	Wanpeng Li	1	-7/+6
	Hugetlbfs pages will get a refcount in get_any_page() or madvise_hwpoison() if soft offlining through madvise. The refcount which is held by the soft offline path should be released if we fail to isolate hugetlbfs pages. Fix it by reducing the refcount for both isolation success and failure. Signed-off-by: Wanpeng Li <wanpeng.li@hotmail.com> Acked-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: <stable@vger.kernel.org> [3.9+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-08-15	mm/hwpoison: fix page refcount of unknown non LRU page	Wanpeng Li	1	-0/+2
	After trying to drain pages from pagevec/pageset, we try to get reference count of the page again, however, the reference count of the page is not reduced if the page is still not on LRU list. Fix it by adding the put_page() to drop the page reference which is from __get_any_page(). Signed-off-by: Wanpeng Li <wanpeng.li@hotmail.com> Acked-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: <stable@vger.kernel.org> [3.9+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-08-14	Revert "drm/nouveau/fifo/gk104: kick channels when deactivating them"	Alexandre Courbot	1	-21/+8
	This reverts commit 1addc1264852 This commit seems to cause crashes in gk104_fifo_intr_runlist() by returning 0xbad0da00 when register 0x2a00 is read. Since this commit was intended for GM20B which is not completely supported yet, let's revert it for the time being. Reported-by: Eric Biggers <ebiggers3@gmail.com> Signed-off-by: Alexandre Courbot <acourbot@nvidia.com> Tested-by: Afzal Mohammed <afzal.mohd.ma@gmail.com> Signed-off-by: Dave Airlie <airlied@redhat.com>
2015-08-14	drm/vmwgfx: Fix execbuf locking issues	Thomas Hellstrom	1	-2/+2
	This addresses two issues that cause problems with viewperf maya-03 in situation with memory pressure. The first issue causes attempts to unreserve buffers if batched reservation fails due to, for example, a signal pending. While previously the ttm_eu api was resistant against this type of error, it is no longer and the lockdep code will complain about attempting to unreserve buffers that are not reserved. The issue is resolved by avoid calling ttm_eu_backoff_reservation in the buffer reserve error path. The second issue is that the binding_mutex may be held when user-space fence objects are created and hence during memory reclaims. This may cause recursive attempts to grab the binding mutex. The issue is resolved by not holding the binding mutex across fence creation and submission. Signed-off-by: Thomas Hellstrom <thellstrom@vmware.com> Reviewed-by: Sinclair Yeh <syeh@vmware.com> Cc: <stable@vger.kernel.org> Signed-off-by: Dave Airlie <airlied@redhat.com>
2015-08-14	x86: fix error handling for 32-bit compat out-of-range system call numbers	Linus Torvalds	1	-1/+2
	Commit 3f5159a9221f ("x86/asm/entry/32: Update -ENOSYS handling to match the 64-bit logic") broke the ENOSYS handling for the 32-bit compat case. The proper error return value was never loaded into %rax, except if things just happened to go through the audit paths, which ended up reloading the return value. This moves the loading or %rax into the normal system call path, just to make sure the error case triggers it. It's kind of sad, since it adds a useless instruction to reload the register to the fast path, but it's not like that single load from the stack is going to be noticeable. Reported-by: David Drysdale <drysdale@google.com> Tested-by: Kees Cook <keescook@chromium.org> Acked-by: Andy Lutomirski <luto@amacapital.net> Cc: Denys Vlasenko <dvlasenk@redhat.com> Cc: Ingo Molnar <mingo@kernel.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-08-13	Revert x86 sigcontext cleanups	Linus Torvalds	3	-36/+17
	This reverts commits 9a036b93a344 ("x86/signal/64: Remove 'fs' and 'gs' from sigcontext") and c6f2062935c8 ("x86/signal/64: Fix SS handling for signals delivered to 64-bit programs"). They were cleanups, but they break dosemu by changing the signal return behavior (and removing 'fs' and 'gs' from the sigcontext struct - while not actually changing any behavior - causes build problems). Reported-and-tested-by: Stas Sergeev <stsp@list.ru> Acked-by: Andy Lutomirski <luto@amacapital.net> Cc: Ingo Molnar <mingo@kernel.org> Cc: stable@vger.kernel.org Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2015-08-13	ARM: dts: keystone: Fix the mdio bindings by moving it to soc specific file	Murali Karicheri	4	-20/+33
	Currently mdio bindings are defined in keystone.dtsi and this results in incorrect unit address for the node on K2E and K2L SoCs. Fix this by moving them to SoC specific DTS file. Signed-off-by: Murali Karicheri <m-karicheri2@ti.com> Signed-off-by: Santosh Shilimkar <ssantosh@kernel.org>
2015-08-13	ARM: dts: keystone: fix the clock node for mdio	Murali Karicheri	1	-1/+1
	Currently the MDIO clock is pointing to clkpa instead of clkcpgmac. MDIO is part of the ethss and the clock should be clkcpgmac. Signed-off-by: Murali Karicheri <m-karicheri2@ti.com> Signed-off-by: Santosh Shilimkar <ssantosh@kernel.org>
2015-08-13	EDAC, ppc4xx: Access mci->csrows array elements properly	Michael Walle	1	-1/+1
	The commit de3910eb79ac ("edac: change the mem allocation scheme to make Documentation/kobject.txt happy") changed the memory allocation for the csrows member. But ppc4xx_edac was forgotten in the patch. Fix it. Signed-off-by: Michael Walle <michael@walle.cc> Cc: <stable@vger.kernel.org> Cc: linux-edac <linux-edac@vger.kernel.org> Cc: Mauro Carvalho Chehab <mchehab@osg.samsung.com> Link: http://lkml.kernel.org/r/1437469253-8611-1-git-send-email-michael@walle.cc Signed-off-by: Borislav Petkov <bp@suse.de>
2015-08-13	cosa: missing error code on failure in probe()	Dan Carpenter	1	-1/+2
	If register_hdlc_device() fails, the current code returns 0 but we should return an error code instead. Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-08-12	gianfar: remove faulty filer optimizer	Jakub Kicinski	1	-337/+0
	Current filer rule optimization is broken in several ways: (1) Can perform reads/writes beyond end of allocated tables. (gianfar_ethtool.c:1326). (2) It breaks badly for rules with more than 2 specifiers (e.g. matching ip, port, tos). Example: # ethtool -N eth2 flow-type udp4 dst-ip 10.0.0.1 dst-port 1 tos 1 action 1 Added rule with ID 254 # ethtool -N eth2 flow-type udp4 dst-ip 10.0.0.2 dst-port 2 tos 2 action 9 Added rule with ID 253 # ethtool -N eth2 flow-type udp4 dst-ip 10.0.0.3 dst-port 3 tos 3 action 17 Added rule with ID 252 # ./filer_decode /sys/kernel/debug/gfar1/filer_raw 00: MASK == 00000210 AND Q:00 ctrl:00000080 prop:00000210 01: FPR == 00000210 AND CLE Q:00 ctrl:00000281 prop:00000210 02: MASK == ffffffff AND Q:00 ctrl:00000080 prop:ffffffff 03: DPT == 00000003 AND Q:00 ctrl:0000008e prop:00000003 04: TOS == 00000003 AND Q:00 ctrl:0000008a prop:00000003 05: DIA == 0a000003 AND Q:11 ctrl:0000448c prop:0a000003 06: DPT == 00000002 AND Q:00 ctrl:0000008e prop:00000002 07: TOS == 00000002 AND Q:00 ctrl:0000008a prop:00000002 08: DIA == 0a000002 AND Q:09 ctrl:0000248c prop:0a000002 09: DIA == 0a000001 AND Q:00 ctrl:0000008c prop:0a000001 0a: DPT == 00000001 AND Q:00 ctrl:0000008e prop:00000001 0b: TOS == 00000001 CLE Q:01 ctrl:0000060a prop:00000001 ff: MASK >= 00000000 Q:00 ctrl:00000020 prop:00000000 (Entire cluster gets AND-ed together). (3) We observed that the masking rules it generates do not play well with clustering on P2020. Only first rule of the cluster would ever fire. Given that optimizer relies heavily on masking this is very hard to fix. Example: # ethtool -N eth2 flow-type udp4 dst-ip 10.0.0.1 dst-port 1 action 1 Added rule with ID 254 # ethtool -N eth2 flow-type udp4 dst-ip 10.0.0.2 dst-port 2 action 9 Added rule with ID 253 # ethtool -N eth2 flow-type udp4 dst-ip 10.0.0.3 dst-port 3 action 17 Added rule with ID 252 # ./filer_decode /sys/kernel/debug/gfar1/filer_raw 00: MASK == 00000210 AND Q:00 ctrl:00000080 prop:00000210 01: FPR == 00000210 AND CLE Q:00 ctrl:00000281 prop:00000210 02: MASK == ffffffff AND Q:00 ctrl:00000080 prop:ffffffff 03: DPT == 00000003 AND Q:00 ctrl:0000008e prop:00000003 04: DIA == 0a000003 Q:11 ctrl:0000440c prop:0a000003 05: DPT == 00000002 AND Q:00 ctrl:0000008e prop:00000002 06: DIA == 0a000002 Q:09 ctrl:0000240c prop:0a000002 07: DIA == 0a000001 AND Q:00 ctrl:0000008c prop:0a000001 08: DPT == 00000001 CLE Q:01 ctrl:0000060e prop:00000001 ff: MASK >= 00000000 Q:00 ctrl:00000020 prop:00000000 Which looks correct according to the spec but only the first (eth id 252)/last added rule for 10.0.0.3 will ever trigger. As if filer did not treat the AND CLE as cluster start but also kept AND-ing the rules. We found no errata covering this. The fact that nobody noticed (2) or (3) makes me think that this feature is not very widely used and we should just remove it. Reported-by: Aleksander Dutkowski <adutkowski@gmail.com> Signed-off-by: Jakub Kicinski <kubakici@wp.pl> Acked-by: Claudiu Manoil <claudiu.manoil@freescale.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2015-08-12	gianfar: correct list membership accounting	Jakub Kicinski	1	-1/+2
	At a cost of one line let's make sure .count is correct when calling gfar_process_filer_changes(). Signed-off-by: Jakub Kicinski <kubakici@wp.pl> Signed-off-by: David S. Miller <davem@davemloft.net>