summaryrefslogtreecommitdiffstats
path: root/Documentation/admin-guide (follow)
Commit message (Collapse)AuthorAgeFilesLines
* ipe: allow secondary and platform keyrings to install/update policiesLuca Boccassi2024-10-171-1/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | The current policy management makes it impossible to use IPE in a general purpose distribution. In such cases the users are not building the kernel, the distribution is, and access to the private key included in the trusted keyring is, for obvious reason, not available. This means that users have no way to enable IPE, since there will be no built-in generic policy, and no access to the key to sign updates validated by the trusted keyring. Just as we do for dm-verity, kernel modules and more, allow the secondary and platform keyrings to also validate policies. This allows users enrolling their own keys in UEFI db or MOK to also sign policies, and enroll them. This makes it sensible to enable IPE in general purpose distributions, as it becomes usable by any user wishing to do so. Keys in these keyrings can already load kernels and kernel modules, so there is no security downgrade. Add a kconfig each, like dm-verity does, but default to enabled if the dependencies are available. Signed-off-by: Luca Boccassi <bluca@debian.org> Reviewed-by: Serge Hallyn <serge@hallyn.com> [FW: fixed some style issues] Signed-off-by: Fan Wu <wufan@kernel.org>
* ipe: also reject policy updates with the same versionLuca Boccassi2024-10-171-1/+1
| | | | | | | | | | | | | | Currently IPE accepts an update that has the same version as the policy being updated, but it doesn't make it a no-op nor it checks that the old and new policyes are the same. So it is possible to change the content of a policy, without changing its version. This is very confusing from userspace when managing policies. Instead change the update logic to reject updates that have the same version with ESTALE, as that is much clearer and intuitive behaviour. Signed-off-by: Luca Boccassi <bluca@debian.org> Reviewed-by: Serge Hallyn <serge@hallyn.com> Signed-off-by: Fan Wu <wufan@kernel.org>
* Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvmLinus Torvalds2024-09-281-0/+17
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull x86 kvm updates from Paolo Bonzini: "x86: - KVM currently invalidates the entirety of the page tables, not just those for the memslot being touched, when a memslot is moved or deleted. This does not traditionally have particularly noticeable overhead, but Intel's TDX will require the guest to re-accept private pages if they are dropped from the secure EPT, which is a non starter. Actually, the only reason why this is not already being done is a bug which was never fully investigated and caused VM instability with assigned GeForce GPUs, so allow userspace to opt into the new behavior. - Advertise AVX10.1 to userspace (effectively prep work for the "real" AVX10 functionality that is on the horizon) - Rework common MSR handling code to suppress errors on userspace accesses to unsupported-but-advertised MSRs This will allow removing (almost?) all of KVM's exemptions for userspace access to MSRs that shouldn't exist based on the vCPU model (the actual cleanup is non-trivial future work) - Rework KVM's handling of x2APIC ICR, again, because AMD (x2AVIC) splits the 64-bit value into the legacy ICR and ICR2 storage, whereas Intel (APICv) stores the entire 64-bit value at the ICR offset - Fix a bug where KVM would fail to exit to userspace if one was triggered by a fastpath exit handler - Add fastpath handling of HLT VM-Exit to expedite re-entering the guest when there's already a pending wake event at the time of the exit - Fix a WARN caused by RSM entering a nested guest from SMM with invalid guest state, by forcing the vCPU out of guest mode prior to signalling SHUTDOWN (the SHUTDOWN hits the VM altogether, not the nested guest) - Overhaul the "unprotect and retry" logic to more precisely identify cases where retrying is actually helpful, and to harden all retry paths against putting the guest into an infinite retry loop - Add support for yielding, e.g. to honor NEED_RESCHED, when zapping rmaps in the shadow MMU - Refactor pieces of the shadow MMU related to aging SPTEs in prepartion for adding multi generation LRU support in KVM - Don't stuff the RSB after VM-Exit when RETPOLINE=y and AutoIBRS is enabled, i.e. when the CPU has already flushed the RSB - Trace the per-CPU host save area as a VMCB pointer to improve readability and cleanup the retrieval of the SEV-ES host save area - Remove unnecessary accounting of temporary nested VMCB related allocations - Set FINAL/PAGE in the page fault error code for EPT violations if and only if the GVA is valid. If the GVA is NOT valid, there is no guest-side page table walk and so stuffing paging related metadata is nonsensical - Fix a bug where KVM would incorrectly synthesize a nested VM-Exit instead of emulating posted interrupt delivery to L2 - Add a lockdep assertion to detect unsafe accesses of vmcs12 structures - Harden eVMCS loading against an impossible NULL pointer deref (really truly should be impossible) - Minor SGX fix and a cleanup - Misc cleanups Generic: - Register KVM's cpuhp and syscore callbacks when enabling virtualization in hardware, as the sole purpose of said callbacks is to disable and re-enable virtualization as needed - Enable virtualization when KVM is loaded, not right before the first VM is created Together with the previous change, this simplifies a lot the logic of the callbacks, because their very existence implies virtualization is enabled - Fix a bug that results in KVM prematurely exiting to userspace for coalesced MMIO/PIO in many cases, clean up the related code, and add a testcase - Fix a bug in kvm_clear_guest() where it would trigger a buffer overflow _if_ the gpa+len crosses a page boundary, which thankfully is guaranteed to not happen in the current code base. Add WARNs in more helpers that read/write guest memory to detect similar bugs Selftests: - Fix a goof that caused some Hyper-V tests to be skipped when run on bare metal, i.e. NOT in a VM - Add a regression test for KVM's handling of SHUTDOWN for an SEV-ES guest - Explicitly include one-off assets in .gitignore. Past Sean was completely wrong about not being able to detect missing .gitignore entries - Verify userspace single-stepping works when KVM happens to handle a VM-Exit in its fastpath - Misc cleanups" * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (127 commits) Documentation: KVM: fix warning in "make htmldocs" s390: Enable KVM_S390_UCONTROL config in debug_defconfig selftests: kvm: s390: Add VM run test case KVM: SVM: let alternatives handle the cases when RSB filling is required KVM: VMX: Set PFERR_GUEST_{FINAL,PAGE}_MASK if and only if the GVA is valid KVM: x86/mmu: Use KVM_PAGES_PER_HPAGE() instead of an open coded equivalent KVM: x86/mmu: Add KVM_RMAP_MANY to replace open coded '1' and '1ul' literals KVM: x86/mmu: Fold mmu_spte_age() into kvm_rmap_age_gfn_range() KVM: x86/mmu: Morph kvm_handle_gfn_range() into an aging specific helper KVM: x86/mmu: Honor NEED_RESCHED when zapping rmaps and blocking is allowed KVM: x86/mmu: Add a helper to walk and zap rmaps for a memslot KVM: x86/mmu: Plumb a @can_yield parameter into __walk_slot_rmaps() KVM: x86/mmu: Move walk_slot_rmaps() up near for_each_slot_rmap_range() KVM: x86/mmu: WARN on MMIO cache hit when emulating write-protected gfn KVM: x86/mmu: Detect if unprotect will do anything based on invalid_list KVM: x86/mmu: Subsume kvm_mmu_unprotect_page() into the and_retry() version KVM: x86: Rename reexecute_instruction()=>kvm_unprotect_and_retry_on_failure() KVM: x86: Update retry protection fields when forcing retry on emulation failure KVM: x86: Apply retry protection to "unprotect on failure" path KVM: x86: Check EMULTYPE_WRITE_PF_TO_SP before unprotecting gfn ...
| * Merge branch 'kvm-redo-enable-virt' into HEADPaolo Bonzini2024-09-171-0/+17
| |\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Register KVM's cpuhp and syscore callbacks when enabling virtualization in hardware, as the sole purpose of said callbacks is to disable and re-enable virtualization as needed. The primary motivation for this series is to simplify dealing with enabling virtualization for Intel's TDX, which needs to enable virtualization when kvm-intel.ko is loaded, i.e. long before the first VM is created. That said, this is a nice cleanup on its own. By registering the callbacks on-demand, the callbacks themselves don't need to check kvm_usage_count, because their very existence implies a non-zero count. Patch 1 (re)adds a dedicated lock for kvm_usage_count. This avoids a lock ordering issue between cpus_read_lock() and kvm_lock. The lock ordering issue still exist in very rare cases, and will be fixed for good by switching vm_list to an (S)RCU-protected list. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
| | * KVM: Add a module param to allow enabling virtualization when KVM is loadedSean Christopherson2024-09-041-0/+17
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Add an on-by-default module param, enable_virt_at_load, to let userspace force virtualization to be enabled in hardware when KVM is initialized, i.e. just before /dev/kvm is exposed to userspace. Enabling virtualization during KVM initialization allows userspace to avoid the additional latency when creating/destroying the first/last VM (or more specifically, on the 0=>1 and 1=>0 edges of creation/destruction). Now that KVM uses the cpuhp framework to do per-CPU enabling, the latency could be non-trivial as the cpuhup bringup/teardown is serialized across CPUs, e.g. the latency could be problematic for use case that need to spin up VMs quickly. Prior to commit 10474ae8945c ("KVM: Activate Virtualization On Demand"), KVM _unconditionally_ enabled virtualization during load, i.e. there's no fundamental reason KVM needs to dynamically toggle virtualization. These days, the only known argument for not enabling virtualization is to allow KVM to be autoloaded without blocking other out-of-tree hypervisors, and such use cases can simply change the module param, e.g. via command line. Note, the aforementioned commit also mentioned that enabling SVM (AMD's virtualization extensions) can result in "using invalid TLB entries". It's not clear whether the changelog was referring to a KVM bug, a CPU bug, or something else entirely. Regardless, leaving virtualization off by default is not a robust "fix", as any protection provided is lost the instant userspace creates the first VM. Reviewed-by: Chao Gao <chao.gao@intel.com> Acked-by: Kai Huang <kai.huang@intel.com> Reviewed-by: Kai Huang <kai.huang@intel.com> Tested-by: Farrah Chen <farrah.chen@intel.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Message-ID: <20240830043600.127750-8-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
* | | Merge tag 'for-6.12/dm-changes' of ↵Linus Torvalds2024-09-273-10/+42
|\ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm Pull device mapper updates from Mikulas Patocka: - Misc VDO fixes - Remove unused declarations dm_get_rq_mapinfo() and dm_zone_map_bio() - Dm-delay: Improve kernel documentation - Dm-crypt: Allow to specify the integrity key size as an option - Dm-bufio: Remove pointless NULL check - Small code cleanups: Use ERR_CAST; remove unlikely() around IS_ERR; use __assign_bit - Dm-integrity: Fix gcc 5 warning; convert comma to semicolon; fix smatch warning - Dm-integrity: Support recalculation in the 'I' mode - Revert "dm: requeue IO if mapping table not yet available" - Dm-crypt: Small refactoring to make the code more readable - Dm-cache: Remove pointless error check - Dm: Fix spelling errors - Dm-verity: Restart or panic on an I/O error if restart or panic was requested - Dm-verity: Fallback to platform keyring also if key in trusted keyring is rejected * tag 'for-6.12/dm-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/device-mapper/linux-dm: (26 commits) dm verity: fallback to platform keyring also if key in trusted keyring is rejected dm-verity: restart or panic on an I/O error dm: fix spelling errors dm-cache: remove pointless error check dm vdo: handle unaligned discards correctly dm vdo indexer: Convert comma to semicolon dm-crypt: Use common error handling code in crypt_set_keyring_key() dm-crypt: Use up_read() together with key_put() only once in crypt_set_keyring_key() Revert "dm: requeue IO if mapping table not yet available" dm-integrity: check mac_size against HASH_MAX_DIGESTSIZE in sb_mac() dm-integrity: support recalculation in the 'I' mode dm integrity: Convert comma to semicolon dm integrity: fix gcc 5 warning dm: Make use of __assign_bit() API dm integrity: Remove extra unlikely helper dm: Convert to use ERR_CAST() dm bufio: Remove NULL check of list_entry() dm-crypt: Allow to specify the integrity key size as option dm: Remove unused declaration and empty definition "dm_zone_map_bio" dm delay: enhance kernel documentation ...
| * | | dm-crypt: Allow to specify the integrity key size as optionIngo Franzki2024-08-211-0/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | For the MAC based integrity operation, the integrity key size (i.e. key_mac_size) is currently set to the digest size of the used digest. For wrapped key HMAC algorithms, the key size is independent of the cryptographic key size. So there is no known size of the mac key in such cases. The desired key size can optionally be specified as argument when the dm-crypt device is configured via 'integrity_key_size:%u'. If no integrity_key_size argument is specified, the mac key size is still set to the digest size, as before. Increase version number to 1.28.0 so that support for the new argument can be detected by user space (i.e. cryptsetup). Signed-off-by: Ingo Franzki <ifranzki@linux.ibm.com> Reviewed-by: Milan Broz <gmazyland@gmail.com> Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
| * | | dm delay: enhance kernel documentationHeinz Mauelshagen2024-08-211-9/+32
| | | | | | | | | | | | | | | | | | | | | | | | | | | | This commit improves documentation of the dm-delay target. Signed-off-by: Heinz Mauelshagen <heinzm@redhat.com> Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
| * | | dm vdo: add dmsetup message for returning configuration infoBruce Johnston2024-08-211-1/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Add a new dmsetup message called config, which will return useful configuration information for the vdo volume and the uds index associated with it. The output is a YAML string, and contains a version number to allow future additions to the content. Signed-off-by: Bruce Johnston <bjohnsto@redhat.com> Signed-off-by: Matthew Sakai <msakai@redhat.com> Signed-off-by: Mikulas Patocka <mpatocka@redhat.com>
* | | | Merge tag 'media/v6.12-1' of ↵Linus Torvalds2024-09-244-13/+112
|\ \ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/mchehab/linux-media Pull media updates from Mauro Carvalho Chehab: - New CEC driver: Extron DA HD 4K Plus - Lots of driver fixes, cleanups and improvements * tag 'media/v6.12-1' of git://git.kernel.org/pub/scm/linux/kernel/git/mchehab/linux-media: (179 commits) media: atomisp: Use clamp() in ia_css_eed1_8_vmem_encode() media: atomisp: Fix eed1_8 code assigning signed values to an unsigned variable media: atomisp: set lock before calling vb2_queue_init() media: atomisp: Improve binary finding debug logging media: atomisp: Drop dev_dbg() calls from hmm_[alloc|free]() media: atomisp: csi2-bridge: Add DMI quirk for t4ka3 on Xiaomi Mipad2 media: atomisp: add missing wait_prepare/finish ops media: atomisp: Remove unused declaration media: atomisp: use clamp() in compute_coring() media: atomisp: use clamp() in ia_css_eed1_8_encode() media: atomisp: Simplify ia_css_pipe_create_cas_scaler_desc_single_output() media: atomisp: Replace rarely used macro from math_support.h media: atomisp: Remove duplicated leftover, i.e. sh_css_dvs_info.h media: atomisp: bnr: fix trailing statement media: atomisp: move trailing */ to separate lines media: atomisp: move trailing statement to next line. media: atomisp: Fix trailing statement in ia_css_de.host.c media: atomisp: Fix spelling mistakes in atomisp.h media: atomisp: Fix spelling mistakes in atomisp_platform.h media: atomisp: Fix spelling mistake in csi_rx_public.h ...
| * | | | media: cec: extron-da-hd-4k-plus: add the Extron DA HD 4K Plus CEC driverHans Verkuil2024-09-051-0/+87
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Add support for the Extron DA HD 4K Plus series of 4K HDMI Distrubution Amplifiers (aka HDMI Splitters). These devices support CEC and this driver adds support for the CEC protocol for both the input and all outputs (2, 4 or 6 outputs, depending on the model). It also exports the EDID from the outputs and allows reading and setting the EDID of the input. Signed-off-by: Hans Verkuil <hverkuil-cisco@xs4all.nl> Signed-off-by: Mauro Carvalho Chehab <mchehab+huawei@kernel.org>
| * | | | Merge tag 'next-media-rkisp1-20240814' of ↵Hans Verkuil2024-08-141-2/+9
| |\ \ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/pinchartl/linux.git Extensible parameters support for the rkisp1 driver. Signed-off-by: Hans Verkuil <hverkuil-cisco@xs4all.nl>
| | * | | | media: uapi: videodev2: Add V4L2_META_FMT_RK_ISP1_EXT_PARAMSJacopo Mondi2024-08-121-2/+9
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The rkisp1 driver stores ISP configuration parameters in the fixed rkisp1_params_cfg structure. As the members of the structure are part of the userspace API, the structure layout is immutable and cannot be extended further. Introducing new parameters or modifying the existing ones would change the buffer layout and cause breakages in existing applications. The allow for future extensions to the ISP parameters, introduce a new extensible parameters format, with a new format 4CC. Document usage of the new format in the rkisp1 admin guide. Signed-off-by: Jacopo Mondi <jacopo.mondi@ideasonboard.com> Reviewed-by: Daniel Scally <dan.scally@ideasonboard.com> Reviewed-by: Paul Elder <paul.elder@ideasonboard.com> Reviewed-by: Laurent Pinchart <laurent.pinchart@ideasonboard.com> Tested-by: Kieran Bingham <kieran.bingham@ideasonboard.com> Acked-by: Sakari Ailus <sakari.ailus@linux.intel.com> Signed-off-by: Laurent Pinchart <laurent.pinchart@ideasonboard.com>
| * | | | | media: admin-guide: mgb4: Outputs DV timings documentation updateMartin Tůma2024-08-141-9/+14
| |/ / / / | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Properly document the function of the mgb4 output "frame_rate" sysfs parameter and update the default DV timings values according to the latest code changes. Signed-off-by: Martin Tůma <martin.tuma@digiteqautomotive.com> Signed-off-by: Hans Verkuil <hverkuil-cisco@xs4all.nl>
| * | | | Documentation: media: vivid.rst: update TODO listHans Verkuil2024-08-091-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The vivid driver supports media controller support for quite a long time now, so drop that from the list. Since commit 4c4dacb052d4 ("media: vivid: loopback based on 'Connected To' controls") making EDID changes causes correct signaling to happen, but what is still missing is the 100 ms delay required before signaling that there is an HPD. Modify this TODO item accordingly. Signed-off-by: Hans Verkuil <hverkuil-cisco@xs4all.nl>
* | | | | Merge tag 'trace-ring-buffer-v6.12' of ↵Linus Torvalds2024-09-221-0/+45
|\ \ \ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace Pull ring-buffer updates from Steven Rostedt: - tracing/ring-buffer: persistent buffer across reboots This allows for the tracing instance ring buffer to stay persistent across reboots. The way this is done is by adding to the kernel command line: trace_instance=boot_map@0x285400000:12M This will reserve 12 megabytes at the address 0x285400000, and then map the tracing instance "boot_map" ring buffer to that memory. This will appear as a normal instance in the tracefs system: /sys/kernel/tracing/instances/boot_map A user could enable tracing in that instance, and on reboot or kernel crash, if the memory is not wiped by the firmware, it will recreate the trace in that instance. For example, if one was debugging a shutdown of a kernel reboot: # cd /sys/kernel/tracing # echo function > instances/boot_map/current_tracer # reboot [..] # cd /sys/kernel/tracing # tail instances/boot_map/trace swapper/0-1 [000] d..1. 164.549800: restore_boot_irq_mode <-native_machine_shutdown swapper/0-1 [000] d..1. 164.549801: native_restore_boot_irq_mode <-native_machine_shutdown swapper/0-1 [000] d..1. 164.549802: disconnect_bsp_APIC <-native_machine_shutdown swapper/0-1 [000] d..1. 164.549811: hpet_disable <-native_machine_shutdown swapper/0-1 [000] d..1. 164.549812: iommu_shutdown_noop <-native_machine_restart swapper/0-1 [000] d..1. 164.549813: native_machine_emergency_restart <-__do_sys_reboot swapper/0-1 [000] d..1. 164.549813: tboot_shutdown <-native_machine_emergency_restart swapper/0-1 [000] d..1. 164.549820: acpi_reboot <-native_machine_emergency_restart swapper/0-1 [000] d..1. 164.549821: acpi_reset <-acpi_reboot swapper/0-1 [000] d..1. 164.549822: acpi_os_write_port <-acpi_reboot On reboot, the buffer is examined to make sure it is valid. The validation check even steps through every event to make sure the meta data of the event is correct. If any test fails, it will simply reset the buffer, and the buffer will be empty on boot. - Allow the tracing persistent boot buffer to use the "reserve_mem" option Instead of having the admin find a physical address to store the persistent buffer, which can be very tedious if they have to administrate several different machines, allow them to use the "reserve_mem" option that will find a location for them. It is not as reliable because of KASLR, as the loading of the kernel in different locations can cause the memory allocated to be inconsistent. Booting with "nokaslr" can make reserve_mem more reliable. - Have function graph tracer handle offsets from a previous boot. The ring buffer output from a previous boot may have different addresses due to kaslr. Have the function graph tracer handle these by using the delta from the previous boot to the new boot address space. - Only reset the saved meta offset when the buffer is started or reset In the persistent memory meta data, it holds the previous address space information, so that it can calculate the delta to have function tracing work. But this gets updated after being read to hold the new address space. But if the buffer isn't used for that boot, on reboot, the delta is now calculated from the previous boot and not the boot that holds the data in the ring buffer. This causes the functions not to be shown. Do not save the address space information of the current kernel until it is being recorded. - Add a magic variable to test the valid meta data Add a magic variable in the meta data that can also be used for validation. The validator of the previous buffer doesn't need this magic data, but it can be used if the meta data is changed by a new kernel, which may have the same format that passes the validator but is used differently. This magic number can also be used as a "versioning" of the meta data. - Align user space mapped ring buffer sub buffers to improve TLB entries Linus mentioned that the mapped ring buffer sub buffers were misaligned between the meta page and the sub-buffers, so that if the sub-buffers were bigger than PAGE_SIZE, it wouldn't allow the TLB to use bigger entries. - Add new kernel command line "traceoff" to disable tracing on boot for instances If tracing is enabled for a boot instance, there needs a way to be able to disable it on boot so that new events do not get entered into the ring buffer and be mixed with events from a previous boot, as that can be confusing. - Allow trace_printk() to go to other instances Currently, trace_printk() can only go to the top level instance. When debugging with a persistent buffer, it is really useful to be able to add trace_printk() to go to that buffer, so that you have access to them after a crash. - Do not use "bin_printk()" for traces to a boot instance The bin_printk() saves only a pointer to the printk format in the ring buffer, as the reader of the buffer can still have access to it. But this is not the case if the buffer is from a previous boot. If the trace_printk() is going to a "persistent" buffer, it will use the slower version that writes the printk format into the buffer. - Add command line option to allow trace_printk() to go to an instance Allow the kernel command line to define which instance the trace_printk() goes to, instead of forcing the admin to set it for every boot via the tracefs options. - Start a document that explains how to use tracefs to debug the kernel - Add some more kernel selftests to test user mapped ring buffer * tag 'trace-ring-buffer-v6.12' of git://git.kernel.org/pub/scm/linux/kernel/git/trace/linux-trace: (28 commits) selftests/ring-buffer: Handle meta-page bigger than the system selftests/ring-buffer: Verify the entire meta-page padding tracing/Documentation: Start a document on how to debug with tracing tracing: Add option to set an instance to be the trace_printk destination tracing: Have trace_printk not use binary prints if boot buffer tracing: Allow trace_printk() to go to other instance buffers tracing: Add "traceoff" flag to boot time tracing instances ring-buffer: Align meta-page to sub-buffers for improved TLB usage ring-buffer: Add magic and struct size to boot up meta data ring-buffer: Don't reset persistent ring-buffer meta saved addresses tracing/fgraph: Have fgraph handle previous boot function addresses tracing: Allow boot instances to use reserve_mem boot memory tracing: Fix ifdef of snapshots to not prevent last_boot_info file ring-buffer: Use vma_pages() helper function tracing: Fix NULL vs IS_ERR() check in enable_instances() tracing: Add last boot delta offset for stack traces tracing: Update function tracing output for previous boot buffer tracing: Handle old buffer mappings for event strings and functions tracing/ring-buffer: Add last_boot_info file to boot instance ring-buffer: Save text and data locations in mapped meta data ...
| * | | | | tracing/Documentation: Start a document on how to debug with tracingSteven Rostedt2024-08-261-0/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Add a new document Documentation/trace/debugging.rst that will hold various ways to debug tracing. This initial version mentions trace_printk and how to create persistent buffers that can last across bootups. Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Vincent Donnefort <vdonnefort@google.com> Cc: Joel Fernandes <joel@joelfernandes.org> Cc: Ingo Molnar <mingo@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vineeth Pillai <vineeth@bitbyteword.org> Cc: Beau Belgrave <beaub@linux.microsoft.com> Cc: Alexander Graf <graf@amazon.com> Cc: Baoquan He <bhe@redhat.com> Cc: Borislav Petkov <bp@alien8.de> Cc: "Paul E. McKenney" <paulmck@kernel.org> Cc: David Howells <dhowells@redhat.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Tony Luck <tony.luck@intel.com> Cc: Guenter Roeck <linux@roeck-us.net> Cc: Ross Zwisler <zwisler@google.com> Cc: Kees Cook <keescook@chromium.org> Cc: Alexander Aring <aahringo@redhat.com> Cc: "Luis Claudio R. Goncalves" <lgoncalv@redhat.com> Cc: Tomas Glozar <tglozar@redhat.com> Cc: John Kacur <jkacur@redhat.com> Cc: Clark Williams <williams@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: "Jonathan Corbet" <corbet@lwn.net> Link: https://lore.kernel.org/20240823014019.702433486@goodmis.org Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
| * | | | | tracing: Have trace_printk not use binary prints if boot bufferSteven Rostedt2024-08-261-3/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | If the persistent boot mapped ring buffer is used for trace_printk(), force it to not use the binary versions. trace_printk() by default uses bin_printf() that only saves the pointer to the format and not the format itself inside the ring buffer. But for a persistent buffer that is read after reboot, the pointers to the format strings may not be the same, or worse, not even exist! Instead, just force the more robust, but slower, version that does the formatting before saving into the ring buffer. The boot mapped buffer can now be used for trace_printk and friends! Using the trace_printk() and the persistent buffer was used to debug the issue with the osnoise tracer: Link: https://lore.kernel.org/all/20240822103443.6a6ae051@gandalf.local.home/ Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Vincent Donnefort <vdonnefort@google.com> Cc: Joel Fernandes <joel@joelfernandes.org> Cc: Ingo Molnar <mingo@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vineeth Pillai <vineeth@bitbyteword.org> Cc: Beau Belgrave <beaub@linux.microsoft.com> Cc: Alexander Graf <graf@amazon.com> Cc: Baoquan He <bhe@redhat.com> Cc: Borislav Petkov <bp@alien8.de> Cc: "Paul E. McKenney" <paulmck@kernel.org> Cc: David Howells <dhowells@redhat.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Tony Luck <tony.luck@intel.com> Cc: Guenter Roeck <linux@roeck-us.net> Cc: Ross Zwisler <zwisler@google.com> Cc: Kees Cook <keescook@chromium.org> Cc: Alexander Aring <aahringo@redhat.com> Cc: "Luis Claudio R. Goncalves" <lgoncalv@redhat.com> Cc: Tomas Glozar <tglozar@redhat.com> Cc: John Kacur <jkacur@redhat.com> Cc: Clark Williams <williams@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: "Jonathan Corbet" <corbet@lwn.net> Link: https://lore.kernel.org/20240823014019.386925800@goodmis.org Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
| * | | | | tracing: Allow trace_printk() to go to other instance buffersSteven Rostedt2024-08-261-4/+10
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Currently, trace_printk() just goes to the top level ring buffer. But there may be times that it should go to one of the instances created by the kernel command line. Add a new trace_instance flag: traceprintk (also can use "printk" or "trace_printk" as people tend to forget the actual flag name). trace_instance=foo^traceprintk Will assign the trace_printk to this buffer at boot up. Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Vincent Donnefort <vdonnefort@google.com> Cc: Joel Fernandes <joel@joelfernandes.org> Cc: Ingo Molnar <mingo@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vineeth Pillai <vineeth@bitbyteword.org> Cc: Beau Belgrave <beaub@linux.microsoft.com> Cc: Alexander Graf <graf@amazon.com> Cc: Baoquan He <bhe@redhat.com> Cc: Borislav Petkov <bp@alien8.de> Cc: "Paul E. McKenney" <paulmck@kernel.org> Cc: David Howells <dhowells@redhat.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Tony Luck <tony.luck@intel.com> Cc: Guenter Roeck <linux@roeck-us.net> Cc: Ross Zwisler <zwisler@google.com> Cc: Kees Cook <keescook@chromium.org> Cc: Alexander Aring <aahringo@redhat.com> Cc: "Luis Claudio R. Goncalves" <lgoncalv@redhat.com> Cc: Tomas Glozar <tglozar@redhat.com> Cc: John Kacur <jkacur@redhat.com> Cc: Clark Williams <williams@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: "Jonathan Corbet" <corbet@lwn.net> Link: https://lore.kernel.org/20240823014019.226694946@goodmis.org Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
| * | | | | tracing: Add "traceoff" flag to boot time tracing instancesSteven Rostedt2024-08-261-0/+17
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Add a "flags" delimiter (^) to the "trace_instance" kernel command line parameter, and add the "traceoff" flag. The format is: trace_instance=<name>[^<flag1>[^<flag2>]][@<memory>][,<events>] The code allows for more than one flag to be added, but currently only "traceoff" is done so. The motivation for this change came from debugging with the persistent ring buffer and having trace_printk() writing to it. The trace_printk calls are always enabled, and the boot after the crash was having the unwanted trace_printks from the current boot inject into the ring buffer with the trace_printks of the crash kernel, making the output very confusing. Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Vincent Donnefort <vdonnefort@google.com> Cc: Joel Fernandes <joel@joelfernandes.org> Cc: Ingo Molnar <mingo@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vineeth Pillai <vineeth@bitbyteword.org> Cc: Beau Belgrave <beaub@linux.microsoft.com> Cc: Alexander Graf <graf@amazon.com> Cc: Baoquan He <bhe@redhat.com> Cc: Borislav Petkov <bp@alien8.de> Cc: "Paul E. McKenney" <paulmck@kernel.org> Cc: David Howells <dhowells@redhat.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Tony Luck <tony.luck@intel.com> Cc: Guenter Roeck <linux@roeck-us.net> Cc: Ross Zwisler <zwisler@google.com> Cc: Kees Cook <keescook@chromium.org> Cc: Alexander Aring <aahringo@redhat.com> Cc: "Luis Claudio R. Goncalves" <lgoncalv@redhat.com> Cc: Tomas Glozar <tglozar@redhat.com> Cc: John Kacur <jkacur@redhat.com> Cc: Clark Williams <williams@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: "Jonathan Corbet" <corbet@lwn.net> Link: https://lore.kernel.org/20240823014019.053229958@goodmis.org Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
| * | | | | tracing: Allow boot instances to use reserve_mem boot memorySteven Rostedt (Google)2024-08-151-0/+13
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Allow boot instances to use memory reserved by the reserve_mem boot option. reserve_mem=12M:4096:trace trace_instance=boot_mapped@trace The above will allocate 12 megs with 4096 alignment and label it "trace". The second parameter will create a "boot_mapped" instance and use the memory reserved and labeled as "trace" as the memory for the ring buffer. That will create an instance called "boot_mapped": /sys/kernel/tracing/instances/boot_mapped Note, because the ring buffer is using a defined memory ranged, it will act just like a memory mapped ring buffer. It will not have a snapshot buffer, as it can't swap out the buffer. The snapshot files as well as any tracers that uses a snapshot will not be present in the boot_mapped instance. Also note that reserve_mem is not reliable in acquiring the same physical memory at each soft reboot. It is possible that KALSR could map the kernel at the previous boot memory location forcing the reserve_mem to return a different memory location. In this case, the previous ring buffer will be lost. Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Ross Zwisler <zwisler@google.com> Cc: Vincent Donnefort <vdonnefort@google.com> Link: https://lore.kernel.org/20240815082811.669f7d8c@gandalf.local.home Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
| * | | | | Merge tag 'v6.11-rc3' into trace/ring-buffer/coreSteven Rostedt2024-08-1432-479/+812
| |\ \ \ \ \ | | | |_|_|/ | | |/| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The "reserve_mem" kernel command line parameter has been pulled into v6.11. Merge the latest -rc3 to allow the persistent ring buffer memory to be able to be mapped at the address specified by the "reserve_mem" command line parameter. Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
| * | | | | tracing: Add option to use memmapped memory for trace boot instanceSteven Rostedt (Google)2024-06-141-0/+9
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Add an option to the trace_instance kernel command line parameter that allows it to use the reserved memory from memmap boot parameter. memmap=12M$0x284500000 trace_instance=boot_mapped@0x284500000:12M The above will reserves 12 megs at the physical address 0x284500000. The second parameter will create a "boot_mapped" instance and use the memory reserved as the memory for the ring buffer. That will create an instance called "boot_mapped": /sys/kernel/tracing/instances/boot_mapped Note, because the ring buffer is using a defined memory ranged, it will act just like a memory mapped ring buffer. It will not have a snapshot buffer, as it can't swap out the buffer. The snapshot files as well as any tracers that uses a snapshot will not be present in the boot_mapped instance. Link: https://lkml.kernel.org/r/20240612232026.329660169@goodmis.org Cc: linux-mm@kvack.org Cc: Masami Hiramatsu <mhiramat@kernel.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Vincent Donnefort <vdonnefort@google.com> Cc: Joel Fernandes <joel@joelfernandes.org> Cc: Daniel Bristot de Oliveira <bristot@redhat.com> Cc: Ingo Molnar <mingo@kernel.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vineeth Pillai <vineeth@bitbyteword.org> Cc: Youssef Esmat <youssefesmat@google.com> Cc: Beau Belgrave <beaub@linux.microsoft.com> Cc: Alexander Graf <graf@amazon.com> Cc: Baoquan He <bhe@redhat.com> Cc: Borislav Petkov <bp@alien8.de> Cc: "Paul E. McKenney" <paulmck@kernel.org> Cc: David Howells <dhowells@redhat.com> Cc: Mike Rapoport <rppt@kernel.org> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Tony Luck <tony.luck@intel.com> Cc: Guenter Roeck <linux@roeck-us.net> Cc: Ross Zwisler <zwisler@google.com> Cc: Kees Cook <keescook@chromium.org> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
* | | | | | Merge tag 'mm-nonmm-stable-2024-09-21-07-52' of ↵Linus Torvalds2024-09-211-2/+3
|\ \ \ \ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull non-MM updates from Andrew Morton: "Many singleton patches - please see the various changelogs for details. Quite a lot of nilfs2 work this time around. Notable patch series in this pull request are: - "mul_u64_u64_div_u64: new implementation" by Nicolas Pitre, with assistance from Uwe Kleine-König. Reimplement mul_u64_u64_div_u64() to provide (much) more accurate results. The current implementation was causing Uwe some issues in the PWM drivers. - "xz: Updates to license, filters, and compression options" from Lasse Collin. Miscellaneous maintenance and kinor feature work to the xz decompressor. - "Fix some GDB command error and add some GDB commands" from Kuan-Ying Lee. Fixes and enhancements to the gdb scripts. - "treewide: add missing MODULE_DESCRIPTION() macros" from Jeff Johnson. Adds lots of MODULE_DESCRIPTIONs, thus fixing lots of warnings about this. - "nilfs2: add support for some common ioctls" from Ryusuke Konishi. Adds various commonly-available ioctls to nilfs2. - "This series fixes a number of formatting issues in kernel doc comments" from Ryusuke Konishi does that. - "nilfs2: prevent unexpected ENOENT propagation" from Ryusuke Konishi. Fix issues where -ENOENT was being unintentionally and inappropriately returned to userspace. - "nilfs2: assorted cleanups" from Huang Xiaojia. - "nilfs2: fix potential issues with empty b-tree nodes" from Ryusuke Konishi fixes some issues which can occur on corrupted nilfs2 filesystems. - "scripts/decode_stacktrace.sh: improve error reporting and usability" from Luca Ceresoli does those things" * tag 'mm-nonmm-stable-2024-09-21-07-52' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (103 commits) list: test: increase coverage of list_test_list_replace*() list: test: fix tests for list_cut_position() proc: use __auto_type more treewide: correct the typo 'retun' ocfs2: cleanup return value and mlog in ocfs2_global_read_info() nilfs2: remove duplicate 'unlikely()' usage nilfs2: fix potential oob read in nilfs_btree_check_delete() nilfs2: determine empty node blocks as corrupted nilfs2: fix potential null-ptr-deref in nilfs_btree_insert() user_namespace: use kmemdup_array() instead of kmemdup() for multiple allocation tools/mm: rm thp_swap_allocator_test when make clean squashfs: fix percpu address space issues in decompressor_multi_percpu.c lib: glob.c: added null check for character class nilfs2: refactor nilfs_segctor_thread() nilfs2: use kthread_create and kthread_stop for the log writer thread nilfs2: remove sc_timer_task nilfs2: do not repair reserved inode bitmap in nilfs_new_inode() nilfs2: eliminate the shared counter and spinlock for i_generation nilfs2: separate inode type information from i_state field nilfs2: use the BITS_PER_LONG macro ...
| * | | | | | Document/kexec: generalize crash hotplug descriptionSourabh Jain2024-09-021-2/+3
| | |_|_|/ / | |/| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Commit 79365026f869 ("crash: add a new kexec flag for hotplug support") generalizes the crash hotplug support to allow architectures to update multiple kexec segments on CPU/Memory hotplug and not just elfcorehdr. Therefore, update the relevant kernel documentation to reflect the same. No functional change. Link: https://lkml.kernel.org/r/20240812041651.703156-1-sourabhjain@linux.ibm.com Signed-off-by: Sourabh Jain <sourabhjain@linux.ibm.com> Reviewed-by: Petr Tesarik <ptesarik@suse.com> Acked-by: Baoquan He <bhe@redhat.com> Cc: Hari Bathini <hbathini@linux.ibm.com> Cc: Petr Tesarik <petr@tesarici.cz> Cc: Sourabh Jain <sourabhjain@linux.ibm.com> Cc: Jonathan Corbet <corbet@lwn.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
* | | | | | Merge tag 'mm-stable-2024-09-20-02-31' of ↵Linus Torvalds2024-09-217-45/+196
|\ \ \ \ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull MM updates from Andrew Morton: "Along with the usual shower of singleton patches, notable patch series in this pull request are: - "Align kvrealloc() with krealloc()" from Danilo Krummrich. Adds consistency to the APIs and behaviour of these two core allocation functions. This also simplifies/enables Rustification. - "Some cleanups for shmem" from Baolin Wang. No functional changes - mode code reuse, better function naming, logic simplifications. - "mm: some small page fault cleanups" from Josef Bacik. No functional changes - code cleanups only. - "Various memory tiering fixes" from Zi Yan. A small fix and a little cleanup. - "mm/swap: remove boilerplate" from Yu Zhao. Code cleanups and simplifications and .text shrinkage. - "Kernel stack usage histogram" from Pasha Tatashin and Shakeel Butt. This is a feature, it adds new feilds to /proc/vmstat such as $ grep kstack /proc/vmstat kstack_1k 3 kstack_2k 188 kstack_4k 11391 kstack_8k 243 kstack_16k 0 which tells us that 11391 processes used 4k of stack while none at all used 16k. Useful for some system tuning things, but partivularly useful for "the dynamic kernel stack project". - "kmemleak: support for percpu memory leak detect" from Pavel Tikhomirov. Teaches kmemleak to detect leaksage of percpu memory. - "mm: memcg: page counters optimizations" from Roman Gushchin. "3 independent small optimizations of page counters". - "mm: split PTE/PMD PT table Kconfig cleanups+clarifications" from David Hildenbrand. Improves PTE/PMD splitlock detection, makes powerpc/8xx work correctly by design rather than by accident. - "mm: remove arch_make_page_accessible()" from David Hildenbrand. Some folio conversions which make arch_make_page_accessible() unneeded. - "mm, memcg: cg2 memory{.swap,}.peak write handlers" fro David Finkel. Cleans up and fixes our handling of the resetting of the cgroup/process peak-memory-use detector. - "Make core VMA operations internal and testable" from Lorenzo Stoakes. Rationalizaion and encapsulation of the VMA manipulation APIs. With a view to better enable testing of the VMA functions, even from a userspace-only harness. - "mm: zswap: fixes for global shrinker" from Takero Funaki. Fix issues in the zswap global shrinker, resulting in improved performance. - "mm: print the promo watermark in zoneinfo" from Kaiyang Zhao. Fill in some missing info in /proc/zoneinfo. - "mm: replace follow_page() by folio_walk" from David Hildenbrand. Code cleanups and rationalizations (conversion to folio_walk()) resulting in the removal of follow_page(). - "improving dynamic zswap shrinker protection scheme" from Nhat Pham. Some tuning to improve zswap's dynamic shrinker. Significant reductions in swapin and improvements in performance are shown. - "mm: Fix several issues with unaccepted memory" from Kirill Shutemov. Improvements to the new unaccepted memory feature, - "mm/mprotect: Fix dax puds" from Peter Xu. Implements mprotect on DAX PUDs. This was missing, although nobody seems to have notied yet. - "Introduce a store type enum for the Maple tree" from Sidhartha Kumar. Cleanups and modest performance improvements for the maple tree library code. - "memcg: further decouple v1 code from v2" from Shakeel Butt. Move more cgroup v1 remnants away from the v2 memcg code. - "memcg: initiate deprecation of v1 features" from Shakeel Butt. Adds various warnings telling users that memcg v1 features are deprecated. - "mm: swap: mTHP swap allocator base on swap cluster order" from Chris Li. Greatly improves the success rate of the mTHP swap allocation. - "mm: introduce numa_memblks" from Mike Rapoport. Moves various disparate per-arch implementations of numa_memblk code into generic code. - "mm: batch free swaps for zap_pte_range()" from Barry Song. Greatly improves the performance of munmap() of swap-filled ptes. - "support large folio swap-out and swap-in for shmem" from Baolin Wang. With this series we no longer split shmem large folios into simgle-page folios when swapping out shmem. - "mm/hugetlb: alloc/free gigantic folios" from Yu Zhao. Nice performance improvements and code reductions for gigantic folios. - "support shmem mTHP collapse" from Baolin Wang. Adds support for khugepaged's collapsing of shmem mTHP folios. - "mm: Optimize mseal checks" from Pedro Falcato. Fixes an mprotect() performance regression due to the addition of mseal(). - "Increase the number of bits available in page_type" from Matthew Wilcox. Increases the number of bits available in page_type! - "Simplify the page flags a little" from Matthew Wilcox. Many legacy page flags are now folio flags, so the page-based flags and their accessors/mutators can be removed. - "mm: store zero pages to be swapped out in a bitmap" from Usama Arif. An optimization which permits us to avoid writing/reading zero-filled zswap pages to backing store. - "Avoid MAP_FIXED gap exposure" from Liam Howlett. Fixes a race window which occurs when a MAP_FIXED operqtion is occurring during an unrelated vma tree walk. - "mm: remove vma_merge()" from Lorenzo Stoakes. Major rotorooting of the vma_merge() functionality, making ot cleaner, more testable and better tested. - "misc fixups for DAMON {self,kunit} tests" from SeongJae Park. Minor fixups of DAMON selftests and kunit tests. - "mm: memory_hotplug: improve do_migrate_range()" from Kefeng Wang. Code cleanups and folio conversions. - "Shmem mTHP controls and stats improvements" from Ryan Roberts. Cleanups for shmem controls and stats. - "mm: count the number of anonymous THPs per size" from Barry Song. Expose additional anon THP stats to userspace for improved tuning. - "mm: finish isolate/putback_lru_page()" from Kefeng Wang: more folio conversions and removal of now-unused page-based APIs. - "replace per-quota region priorities histogram buffer with per-context one" from SeongJae Park. DAMON histogram rationalization. - "Docs/damon: update GitHub repo URLs and maintainer-profile" from SeongJae Park. DAMON documentation updates. - "mm/vdpa: correct misuse of non-direct-reclaim __GFP_NOFAIL and improve related doc and warn" from Jason Wang: fixes usage of page allocator __GFP_NOFAIL and GFP_ATOMIC flags. - "mm: split underused THPs" from Yu Zhao. Improve THP=always policy. This was overprovisioning THPs in sparsely accessed memory areas. - "zram: introduce custom comp backends API" frm Sergey Senozhatsky. Add support for zram run-time compression algorithm tuning. - "mm: Care about shadow stack guard gap when getting an unmapped area" from Mark Brown. Fix up the various arch_get_unmapped_area() implementations to better respect guard areas. - "Improve mem_cgroup_iter()" from Kinsey Ho. Improve the reliability of mem_cgroup_iter() and various code cleanups. - "mm: Support huge pfnmaps" from Peter Xu. Extends the usage of huge pfnmap support. - "resource: Fix region_intersects() vs add_memory_driver_managed()" from Huang Ying. Fix a bug in region_intersects() for systems with CXL memory. - "mm: hwpoison: two more poison recovery" from Kefeng Wang. Teaches a couple more code paths to correctly recover from the encountering of poisoned memry. - "mm: enable large folios swap-in support" from Barry Song. Support the swapin of mTHP memory into appropriately-sized folios, rather than into single-page folios" * tag 'mm-stable-2024-09-20-02-31' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (416 commits) zram: free secondary algorithms names uprobes: turn xol_area->pages[2] into xol_area->page uprobes: introduce the global struct vm_special_mapping xol_mapping Revert "uprobes: use vm_special_mapping close() functionality" mm: support large folios swap-in for sync io devices mm: add nr argument in mem_cgroup_swapin_uncharge_swap() helper to support large folios mm: fix swap_read_folio_zeromap() for large folios with partial zeromap mm/debug_vm_pgtable: Use pxdp_get() for accessing page table entries set_memory: add __must_check to generic stubs mm/vma: return the exact errno in vms_gather_munmap_vmas() memcg: cleanup with !CONFIG_MEMCG_V1 mm/show_mem.c: report alloc tags in human readable units mm: support poison recovery from copy_present_page() mm: support poison recovery from do_cow_fault() resource, kunit: add test case for region_intersects() resource: make alloc_free_mem_region() works for iomem_resource mm: z3fold: deprecate CONFIG_Z3FOLD vfio/pci: implement huge_fault support mm/arm64: support large pfn mappings mm/x86: support large pfn mappings ...
| * | | | | | zram: support priority parameter in recompressionSergey Senozhatsky2024-09-101-1/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | recompress device attribute supports alg=NAME parameter so that we can specify only one particular algorithm we want to perform recompression with. However, with algo params we now can have several exactly same secondary algorithms but each with its own params tuning (e.g. priority 1 configured to use more aggressive level, and priority 2 configured to use a pre-trained dictionary). Support priority=NUM parameter so that we can correctly determine which secondary algorithm we want to use. Link: https://lkml.kernel.org/r/20240902105656.1383858-25-senozhatsky@chromium.org Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org> Cc: Minchan Kim <minchan@kernel.org> Cc: Nick Terrell <terrelln@fb.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
| * | | | | | Documentation/zram: add documentation for algorithm parametersSergey Senozhatsky2024-09-101-8/+39
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Document brief description of compression algorithms' parameters: compression level and pre-trained dictionary. [senozhatsky@chromium.org: trivial fixup] Link: https://lkml.kernel.org/r/20240903063722.1603592-1-senozhatsky@chromium.org Link: https://lkml.kernel.org/r/20240902105656.1383858-24-senozhatsky@chromium.org Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org> Cc: Minchan Kim <minchan@kernel.org> Cc: Nick Terrell <terrelln@fb.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
| * | | | | | zram: introduce algorithm_params device attributeSergey Senozhatsky2024-09-101-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This attribute is used to setup compression algorithms' parameters, so we can tweak algorithms' characteristics. At this point only 'level' is supported (to be extended in the future). Each call sets up parameters for one particular algorithm, which should be specified either by the algorithm's priority or algo name. This is expected to be called after corresponding algorithm is selected via comp_algorithm or recomp_algorithm. echo "priority=0 level=1" > /sys/block/zram0/algorithm_params or echo "algo=zstd level=1" > /sys/block/zram0/algorithm_params Link: https://lkml.kernel.org/r/20240902105656.1383858-16-senozhatsky@chromium.org Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org> Cc: Minchan Kim <minchan@kernel.org> Cc: Nick Terrell <terrelln@fb.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
| * | | | | | zram: introduce custom comp backends APISergey Senozhatsky2024-09-101-9/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Moving to custom backends implementation gives us ability to have our own minimalistic and extendable API, and algorithms tunings becomes possible. The list of compression backends is empty at this point, we will add backends in the followup patches. Link: https://lkml.kernel.org/r/20240902105656.1383858-5-senozhatsky@chromium.org Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org> Cc: Minchan Kim <minchan@kernel.org> Cc: Nick Terrell <terrelln@fb.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
| * | | | | | mm: add sysfs entry to disable splitting underused THPsUsama Arif2024-09-101-0/+10
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | If disabled, THPs faulted in or collapsed will not be added to _deferred_list, and therefore won't be considered for splitting under memory pressure if underused. Link: https://lkml.kernel.org/r/20240830100438.3623486-7-usamaarif642@gmail.com Signed-off-by: Usama Arif <usamaarif642@gmail.com> Cc: Alexander Zhu <alexlzhu@fb.com> Cc: Barry Song <baohua@kernel.org> Cc: David Hildenbrand <david@redhat.com> Cc: Domenico Cerasuolo <cerasuolodomenico@gmail.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Kairui Song <ryncsn@gmail.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Mike Rapoport <rppt@kernel.org> Cc: Nico Pache <npache@redhat.com> Cc: Rik van Riel <riel@surriel.com> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Shakeel Butt <shakeel.butt@linux.dev> Cc: Shuang Zhai <zhais@google.com> Cc: Shuang Zhai <szhai2@cs.rochester.edu> Cc: Yu Zhao <yuzhao@google.com> Cc: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
| * | | | | | mm: split underused THPsUsama Arif2024-09-101-0/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This is an attempt to mitigate the issue of running out of memory when THP is always enabled. During runtime whenever a THP is being faulted in (__do_huge_pmd_anonymous_page) or collapsed by khugepaged (collapse_huge_page), the THP is added to _deferred_list. Whenever memory reclaim happens in linux, the kernel runs the deferred_split shrinker which goes through the _deferred_list. If the folio was partially mapped, the shrinker attempts to split it. If the folio is not partially mapped, the shrinker checks if the THP was underused, i.e. how many of the base 4K pages of the entire THP were zero-filled. If this number goes above a certain threshold (decided by /sys/kernel/mm/transparent_hugepage/khugepaged/max_ptes_none), the shrinker will attempt to split that THP. Then at remap time, the pages that were zero-filled are mapped to the shared zeropage, hence saving memory. Link: https://lkml.kernel.org/r/20240830100438.3623486-6-usamaarif642@gmail.com Signed-off-by: Usama Arif <usamaarif642@gmail.com> Suggested-by: Rik van Riel <riel@surriel.com> Co-authored-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Alexander Zhu <alexlzhu@fb.com> Cc: Barry Song <baohua@kernel.org> Cc: David Hildenbrand <david@redhat.com> Cc: Domenico Cerasuolo <cerasuolodomenico@gmail.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Kairui Song <ryncsn@gmail.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Mike Rapoport <rppt@kernel.org> Cc: Nico Pache <npache@redhat.com> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Shakeel Butt <shakeel.butt@linux.dev> Cc: Shuang Zhai <zhais@google.com> Cc: Yu Zhao <yuzhao@google.com> Cc: Shuang Zhai <szhai2@cs.rochester.edu> Cc: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
| * | | | | | Docs/damon: use damonitor GitHub organization instead of awslabsSeongJae Park2024-09-102-6/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Patch series "Docs/damon: update GitHub repo URLs and maintainer-profile". Replace GitHub URLS on DAMON documents for none-kernel parts DAMON repos with new ones[1] via the first patch. With following two patches, wordsmith maitnainer-profile for better readability, and document the Google clendsar for bi-weekly meetups, respectively. [1] https://lore.kernel.org/20240813232158.83903-1-sj@kernel.org This patch (of 3): GitHub repos for non-kernel parts of DAMON project including 'damo', 'damon-tests' and 'damoos' will be moved[1] from 'awslabs' org to 'damonitor', by 2024-09-05. Update related URLs in kernel tree. [1] https://lore.kernel.org/20240813232158.83903-1-sj@kernel.org Link: https://lkml.kernel.org/r/20240826015741.80707-1-sj@kernel.org Link: https://lkml.kernel.org/r/20240826015741.80707-2-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Cc: Alex Shi <alexs@kernel.org> Cc: Hu Haowen <2023002089@link.tyut.edu.cn> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Yanteng Si <siyanteng@loongson.cn> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
| * | | | | | mm: count the number of partially mapped anonymous THPs per sizeBarry Song2024-09-101-0/+7
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When a THP is added to the deferred_list due to partially mapped, its partial pages are unused, leading to wasted memory and potentially increasing memory reclamation pressure. Detailing the specifics of how unmapping occurs is quite difficult and not that useful, so we adopt a simple approach: each time a THP enters the deferred_list, we increment the count by 1; whenever it leaves for any reason, we decrement the count by 1. Link: https://lkml.kernel.org/r/20240824010441.21308-3-21cnbao@gmail.com Signed-off-by: Barry Song <v-songbaohua@oppo.com> Acked-by: David Hildenbrand <david@redhat.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Chris Li <chrisl@kernel.org> Cc: Chuanhua Han <hanchuanhua@oppo.com> Cc: Kairui Song <kasong@tencent.com> Cc: Kalesh Singh <kaleshsingh@google.com> Cc: Lance Yang <ioworker0@gmail.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Shuai Yuan <yuanshuai@oppo.com> Cc: Usama Arif <usamaarif642@gmail.com> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
| * | | | | | mm: count the number of anonymous THPs per sizeBarry Song2024-09-101-0/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Patch series "mm: count the number of anonymous THPs per size", v4. Knowing the number of transparent anon THPs in the system is crucial for performance analysis. It helps in understanding the ratio and distribution of THPs versus small folios throughout the system. Additionally, partial unmapping by userspace can lead to significant waste of THPs over time and increase memory reclamation pressure. We need this information for comprehensive system tuning. This patch (of 2): Let's track for each anonymous THP size, how many of them are currently allocated. We'll track the complete lifespan of an anon THP, starting when it becomes an anon THP ("large anon folio") (->mapping gets set), until it gets freed (->mapping gets cleared). Introduce a new "nr_anon" counter per THP size and adjust the corresponding counter in the following cases: * We allocate a new THP and call folio_add_new_anon_rmap() to map it the first time and turn it into an anon THP. * We split an anon THP into multiple smaller ones. * We migrate an anon THP, when we prepare the destination. * We free an anon THP back to the buddy. Note that AnonPages in /proc/meminfo currently tracks the total number of *mapped* anonymous *pages*, and therefore has slightly different semantics. In the future, we might also want to track "nr_anon_mapped" for each THP size, which might be helpful when comparing it to the number of allocated anon THPs (long-term pinning, stuck in swapcache, memory leaks, ...). Further note that for now, we only track anon THPs after they got their ->mapping set, for example via folio_add_new_anon_rmap(). If we would allocate some in the swapcache, they will only show up in the statistics for now after they have been mapped to user space the first time, where we call folio_add_new_anon_rmap(). [akpm@linux-foundation.org: documentation fixups, per David] Link: https://lkml.kernel.org/r/3e8add35-e26b-443b-8a04-1078f4bc78f6@redhat.com Link: https://lkml.kernel.org/r/20240824010441.21308-1-21cnbao@gmail.com Link: https://lkml.kernel.org/r/20240824010441.21308-2-21cnbao@gmail.com Signed-off-by: Barry Song <v-songbaohua@oppo.com> Acked-by: David Hildenbrand <david@redhat.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Chris Li <chrisl@kernel.org> Cc: Chuanhua Han <hanchuanhua@oppo.com> Cc: Kairui Song <kasong@tencent.com> Cc: Kalesh Singh <kaleshsingh@google.com> Cc: Lance Yang <ioworker0@gmail.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Shuai Yuan <yuanshuai@oppo.com> Cc: Usama Arif <usamaarif642@gmail.com> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
| * | | | | | Documentation/cgroup-v2: clarify that zswap.writeback is ignored if zswap is ↵Mike Yuan2024-09-041-0/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | disabled As discussed in [1], zswap-related settings natually lose their effect when zswap is disabled, specifically zswap.writeback here. Be explicit about this behavior. [1] https://lore.kernel.org/linux-kernel/CAKEwX=Mhbwhh-=xxCU-RjMXS_n=RpV3Gtznb2m_3JgL+jzz++g@mail.gmail.com/ [akpm@linux-foundation.org: fix/simplify text] Link: https://lkml.kernel.org/r/20240823162506.12117-3-me@yhndnzj.com Signed-off-by: Mike Yuan <me@yhndnzj.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Michal Koutný <mkoutny@suse.com> Cc: Muchun Song <muchun.song@linux.dev> Cc: Nhat Pham <nphamcs@gmail.com> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Shakeel Butt <shakeel.butt@linux.dev> Cc: Yosry Ahmed <yosryahmed@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
| * | | | | | mm,memcg: provide per-cgroup counters for NUMA balancing operationsKaiyang Zhao2024-09-041-0/+19
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The ability to observe the demotion and promotion decisions made by the kernel on a per-cgroup basis is important for monitoring and tuning containerized workloads on machines equipped with tiered memory. Different containers in the system may experience drastically different memory tiering actions that cannot be distinguished from the global counters alone. For example, a container running a workload that has a much hotter memory accesses will likely see more promotions and fewer demotions, potentially depriving a colocated container of top tier memory to such an extent that its performance degrades unacceptably. For another example, some containers may exhibit longer periods between data reuse, causing much more numa_hint_faults than numa_pages_migrated. In this case, tuning hot_threshold_ms may be appropriate, but the signal can easily be lost if only global counters are available. In the long term, we hope to introduce per-cgroup control of promotion and demotion actions to implement memory placement policies in tiering. This patch set adds seven counters to memory.stat in a cgroup: numa_pages_migrated, numa_pte_updates, numa_hint_faults, pgdemote_kswapd, pgdemote_khugepaged, pgdemote_direct and pgpromote_success. pgdemote_* and pgpromote_success are also available in memory.numa_stat. count_memcg_events_mm() is added to count multiple event occurrences at once, and get_mem_cgroup_from_folio() is added because we need to get a reference to the memcg of a folio before it's migrated to track numa_pages_migrated. The accounting of PGDEMOTE_* is moved to shrink_inactive_list() before being changed to per-cgroup. [kaiyang2@cs.cmu.edu: add documentation of the memcg counters in cgroup-v2.rst] Link: https://lkml.kernel.org/r/20240814235122.252309-1-kaiyang2@cs.cmu.edu Link: https://lkml.kernel.org/r/20240814174227.30639-1-kaiyang2@cs.cmu.edu Signed-off-by: Kaiyang Zhao <kaiyang2@cs.cmu.edu> Cc: David Rientjes <rientjes@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Shakeel Butt <shakeel.butt@linux.dev> Cc: Wei Xu <weixugc@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
| * | | | | | docs: move numa=fake description to kernel-parameters.txtMike Rapoport (Microsoft)2024-09-041-0/+15
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | NUMA emulation can be now enabled on arm64 and riscv in addition to x86. Move description of numa=fake parameters from x86 documentation of admin-guide/kernel-parameters.txt Link: https://lkml.kernel.org/r/20240807064110.1003856-27-rppt@kernel.org Suggested-by: Zi Yan <ziy@nvidia.com> Signed-off-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com> Tested-by: Jonathan Cameron <Jonathan.Cameron@huawei.com> [arm64 + CXL via QEMU] Acked-by: Dan Williams <dan.j.williams@intel.com> Acked-by: David Hildenbrand <david@redhat.com> Cc: Alexander Gordeev <agordeev@linux.ibm.com> Cc: Andreas Larsson <andreas@gaisler.com> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Borislav Petkov <bp@alien8.de> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: David S. Miller <davem@davemloft.net> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Huacai Chen <chenhuacai@kernel.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jiaxun Yang <jiaxun.yang@flygoat.com> Cc: John Paul Adrian Glaubitz <glaubitz@physik.fu-berlin.de> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Palmer Dabbelt <palmer@dabbelt.com> Cc: Rafael J. Wysocki <rafael@kernel.org> Cc: Rob Herring (Arm) <robh@kernel.org> Cc: Samuel Holland <samuel.holland@sifive.com> Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
| * | | | | | memcg: initiate deprecation of pressure_levelShakeel Butt2024-09-021-2/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The pressure_level in memcg v1 provides memory pressure notifications to the user space. At the moment it provides notifications for three levels of memory pressure i.e. low, medium and critical, which are defined based on internal memory reclaim implementation details. More specifically the ratio of scanned and reclaimed pages during a memory reclaim. However this is not robust as there are workloads with mostly unreclaimable user memory or kernel memory. For v2, the users can use PSI for memory pressure status of the system or the cgroup. Let's start the deprecation process for pressure_level and add warnings to gather the info on how the current users are using this interface and how they can be used to PSI. Link: https://lkml.kernel.org/r/20240814220021.3208384-5-shakeel.butt@linux.dev Signed-off-by: Shakeel Butt <shakeel.butt@linux.dev> Reviewed-by: T.J. Mercier <tjmercier@google.com> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Roman Gushchin <roman.gushchin@linux.dev> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Muchun Song <muchun.song@linux.dev> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
| * | | | | | memcg: initiate deprecation of oom_controlShakeel Butt2024-09-021-2/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The oom_control provides functionality to disable memcg oom-killer, notifications on oom-kill and reading the stats regarding oom-kills. This interface was mainly introduced to provide functionality for userspace oom-killers. However it is not robust enough and only supports OOM handling in the page fault path. For v2, the users can use the combination of memory.events notifications, memory.high and PSI to provide userspace OOM-killing functionality. Actually LMKD in Android and OOMd in systemd and Meta infrastructure already use PSI in combination with other stats to implement userspace OOM-killing. Let's start the deprecation process for v1 and gather the info on how the current users are using this interface and work on providing a more robust functionality in v2. Link: https://lkml.kernel.org/r/20240814220021.3208384-4-shakeel.butt@linux.dev Signed-off-by: Shakeel Butt <shakeel.butt@linux.dev> Reviewed-by: T.J. Mercier <tjmercier@google.com> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Roman Gushchin <roman.gushchin@linux.dev> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Muchun Song <muchun.song@linux.dev> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
| * | | | | | memcg: initiate deprecation of v1 soft limitShakeel Butt2024-09-021-2/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Memcg v1 provides soft limit functionality for the best effort memory sharing between multiple workloads on a system. It is usually triggered through kswapd and at the moment does not reclaim kernel memory. Memcg v2 provides more straightforward best effort (memory.low) and hard protection (memory.min) functionalities. Let's initiate the deprecation of soft limit from v1 and gather if v2 needs something more to move the existing v1 users to v2 regarding soft limit. Link: https://lkml.kernel.org/r/20240814220021.3208384-3-shakeel.butt@linux.dev Signed-off-by: Shakeel Butt <shakeel.butt@linux.dev> Reviewed-by: T.J. Mercier <tjmercier@google.com> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Roman Gushchin <roman.gushchin@linux.dev> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Muchun Song <muchun.song@linux.dev> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
| * | | | | | memcg: initiate deprecation of v1 tcp accountingShakeel Butt2024-09-021-0/+8
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Patch series "memcg: initiate deprecation of v1 features", v2. Start the deprecation process of the memcg v1 features which we discussed during LSFMMBPF 2024 [1]. For now add the warnings to collect the information on how the current users are using these features. Next we will work on providing better alternatives in v2 (if needed) and fully deprecate these features. Link: https://lwn.net/Articles/974575 [1] This patch (of 4): Memcg v1 provides opt-in TCP memory accounting feature. However it is mostly unused due to its performance impact on the network traffic. In v2, the TCP memory is accounted in the regular memory usage and is transparent to the users but they can observe the TCP memory usage through memcg stats. Let's initiate the deprecation process of memcg v1's tcp accounting functionality and add warnings to gather if there are any users and if there are, collect how they are using it and plan to provide them better alternative in v2. Link: https://lkml.kernel.org/r/20240814220021.3208384-1-shakeel.butt@linux.dev Link: https://lkml.kernel.org/r/20240814220021.3208384-2-shakeel.butt@linux.dev Signed-off-by: Shakeel Butt <shakeel.butt@linux.dev> Reviewed-by: T.J. Mercier <tjmercier@google.com> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Roman Gushchin <roman.gushchin@linux.dev> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Muchun Song <muchun.song@linux.dev> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
| * | | | | | mm: override mTHP "enabled" defaults at kernel cmdlineRyan Roberts2024-09-022-7/+40
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Add thp_anon= cmdline parameter to allow specifying the default enablement of each supported anon THP size. The parameter accepts the following format and can be provided multiple times to configure each size: thp_anon=<size>,<size>[KMG]:<value>;<size>-<size>[KMG]:<value> An example: thp_anon=16K-64K:always;128K,512K:inherit;256K:madvise;1M-2M:never See Documentation/admin-guide/mm/transhuge.rst for more details. Configuring the defaults at boot time is useful to allow early user space to take advantage of mTHP before its been configured through sysfs. [v-songbaohua@oppo.com: use get_oder() and check size is is_power_of_2] Link: https://lkml.kernel.org/r/20240814224635.43272-1-21cnbao@gmail.com [ryan.roberts@arm.com: some minor cleanup according to David's comments] Link: https://lkml.kernel.org/r/20240820105244.62703-1-21cnbao@gmail.com Link: https://lkml.kernel.org/r/20240814020247.67297-1-21cnbao@gmail.com Signed-off-by: Ryan Roberts <ryan.roberts@arm.com> Co-developed-by: Barry Song <v-songbaohua@oppo.com> Signed-off-by: Barry Song <v-songbaohua@oppo.com> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com> Tested-by: Baolin Wang <baolin.wang@linux.alibaba.com> Acked-by: David Hildenbrand <david@redhat.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Lance Yang <ioworker0@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
| * | | | | | mm, memcg: cg2 memory{.swap,}.peak write handlersDavid Finkel2024-09-021-8/+14
| |/ / / / / | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Patch series "mm, memcg: cg2 memory{.swap,}.peak write handlers", v7. This patch (of 2): Other mechanisms for querying the peak memory usage of either a process or v1 memory cgroup allow for resetting the high watermark. Restore parity with those mechanisms, but with a less racy API. For example: - Any write to memory.max_usage_in_bytes in a cgroup v1 mount resets the high watermark. - writing "5" to the clear_refs pseudo-file in a processes's proc directory resets the peak RSS. This change is an evolution of a previous patch, which mostly copied the cgroup v1 behavior, however, there were concerns about races/ownership issues with a global reset, so instead this change makes the reset filedescriptor-local. Writing any non-empty string to the memory.peak and memory.swap.peak pseudo-files reset the high watermark to the current usage for subsequent reads through that same FD. Notably, following Johannes's suggestion, this implementation moves the O(FDs that have written) behavior onto the FD write(2) path. Instead, on the page-allocation path, we simply add one additional watermark to conditionally bump per-hierarchy level in the page-counter. Additionally, this takes Longman's suggestion of nesting the page-charging-path checks for the two watermarks to reduce the number of common-case comparisons. This behavior is particularly useful for work scheduling systems that need to track memory usage of worker processes/cgroups per-work-item. Since memory can't be squeezed like CPU can (the OOM-killer has opinions), these systems need to track the peak memory usage to compute system/container fullness when binpacking workitems. Most notably, Vimeo's use-case involves a system that's doing global binpacking across many Kubernetes pods/containers, and while we can use PSI for some local decisions about overload, we strive to avoid packing workloads too tightly in the first place. To facilitate this, we track the peak memory usage. However, since we run with long-lived workers (to amortize startup costs) we need a way to track the high watermark while a work-item is executing. Polling runs the risk of missing short spikes that last for timescales below the polling interval, and peak memory tracking at the cgroup level is otherwise perfect for this use-case. As this data is used to ensure that binpacked work ends up with sufficient headroom, this use-case mostly avoids the inaccuracies surrounding reclaimable memory. Link: https://lkml.kernel.org/r/20240730231304.761942-1-davidf@vimeo.com Link: https://lkml.kernel.org/r/20240729143743.34236-1-davidf@vimeo.com Link: https://lkml.kernel.org/r/20240729143743.34236-2-davidf@vimeo.com Signed-off-by: David Finkel <davidf@vimeo.com> Suggested-by: Johannes Weiner <hannes@cmpxchg.org> Suggested-by: Waiman Long <longman@redhat.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: Michal Koutný <mkoutny@suse.com> Acked-by: Tejun Heo <tj@kernel.org> Reviewed-by: Roman Gushchin <roman.gushchin@linux.dev> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Michal Hocko <mhocko@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Shakeel Butt <shakeel.butt@linux.dev> Cc: Shuah Khan <shuah@kernel.org> Cc: Zefan Li <lizefan.x@bytedance.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
* | | | | | Merge tag 'ext4_for_linus-6.12-rc1' of ↵Linus Torvalds2024-09-211-10/+0
|\ \ \ \ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4 Pull ext4 updates from Ted Ts'o: "Lots of cleanups and bug fixes this cycle, primarily in the block allocation, extent management, fast commit, and journalling" * tag 'ext4_for_linus-6.12-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (93 commits) ext4: convert EXT4_B2C(sbi->s_stripe) users to EXT4_NUM_B2C ext4: check stripe size compatibility on remount as well ext4: fix i_data_sem unlock order in ext4_ind_migrate() ext4: remove the special buffer dirty handling in do_journal_get_write_access ext4: fix a potential assertion failure due to improperly dirtied buffer ext4: hoist ext4_block_write_begin and replace the __block_write_begin ext4: persist the new uptodate buffers in ext4_journalled_zero_new_buffers ext4: dax: keep orphan list before truncate overflow allocated blocks ext4: fix error message when rejecting the default hash ext4: save unnecessary indentation in ext4_ext_create_new_leaf() ext4: make some fast commit functions reuse extents path ext4: refactor ext4_swap_extents() to reuse extents path ext4: get rid of ppath in convert_initialized_extent() ext4: get rid of ppath in ext4_ext_handle_unwritten_extents() ext4: get rid of ppath in ext4_ext_convert_to_initialized() ext4: get rid of ppath in ext4_convert_unwritten_extents_endio() ext4: get rid of ppath in ext4_split_convert_extents() ext4: get rid of ppath in ext4_split_extent() ext4: get rid of ppath in ext4_force_split_extent_at() ext4: get rid of ppath in ext4_split_extent_at() ...
| * | | | | | Documentation: ext4.rst: remove obsolete descriptions of noacl/nouser_xattr ↵Stefan Tauner2024-08-271-10/+0
| |/ / / / / | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | options These have been deprecated for a decade[1] and removed two years ago[2]. 1: f70486055ee351158bd6999f3965ad378b52c694 2: 2d544ec923dbe5fbed64a7f43dccf527218380bc Signed-off-by: Stefan Tauner <stefan.tauner@gmx.at> Link: https://patch.msgid.link/20240728003433.2566649-1-stefan.tauner@gmx.at Signed-off-by: Theodore Ts'o <tytso@mit.edu>
* | | | | | Merge tag 'platform-drivers-x86-v6.12-1' of ↵Linus Torvalds2024-09-191-0/+59
|\ \ \ \ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/pdx86/platform-drivers-x86 Pull x86 platform drivers updates from Hans de Goede: - asus-wmi: Add support for vivobook fan profiles - dell-laptop: Add knobs to change battery charge settings - lg-laptop: Add operation region support - intel-uncore-freq: Add support for efficiency latency control - intel/ifs: Add SBAF test support - intel/pmc: Ignore all LTRs during suspend - platform/surface: Support for arm64 based Surface devices - wmi: Pass event data directly to legacy notify handlers - x86/platform/geode: switch GPIO buttons and LEDs to software properties - bunch of small cleanups, fixes, hw-id additions, etc. * tag 'platform-drivers-x86-v6.12-1' of git://git.kernel.org/pub/scm/linux/kernel/git/pdx86/platform-drivers-x86: (65 commits) MAINTAINERS: adjust file entry in INTEL MID PLATFORM platform/x86: x86-android-tablets: Adjust Xiaomi Pad 2 bottom bezel touch buttons LED platform/mellanox: mlxbf-pmc: fix lockdep warning platform/x86/amd: pmf: Add quirk for TUF Gaming A14 platform/x86: touchscreen_dmi: add nanote-next quirk platform/x86: asus-wmi: don't fail if platform_profile already registered platform/x86: asus-wmi: add debug print in more key places platform/x86: intel_scu_wdt: Move intel_scu_wdt.h to x86 subfolder platform/x86: intel_scu_ipc: Move intel_scu_ipc.h out of arch/x86/include/asm MAINTAINERS: Add Intel MID section platform/x86: panasonic-laptop: Add support for programmable buttons platform/olpc: Remove redundant null pointer checks in olpc_ec_setup_debugfs() platform/x86: intel/pmc: Ignore all LTRs during suspend platform/x86: wmi: Call both legacy and WMI driver notify handlers platform/x86: wmi: Merge get_event_data() with wmi_get_notify_data() platform/x86: wmi: Remove wmi_get_event_data() platform/x86: wmi: Pass event data directly to legacy notify handlers platform/x86: thinkpad_acpi: Fix uninitialized symbol 's' warning platform/x86: x86-android-tablets: Fix spelling in the comments platform/x86: ideapad-laptop: Make the scope_guard() clear of its scope ...
| * | | | | | Merge tag 'hwmon-for-v6.11-rc7' into review-hansHans de Goede2024-09-053-11/+10
| |\| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Merge "hwmon fixes for v6.11-rc7" into review-hans to bring in commit a54da9df75cd ("hwmon: (hp-wmi-sensors) Check if WMI event data exists"). This is a dependency for a set of WMI event data refactoring changes.
| * | | | | | Documentation: admin-guide: pm: Add efficiency vs. latency tradeoff to ↵Tero Kristo2024-09-041-0/+59
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | uncore documentation Added documentation about the functionality of efficiency vs. latency tradeoff control in intel Xeon processors, and how this is configured via sysfs. Signed-off-by: Tero Kristo <tero.kristo@linux.intel.com> Reviewed-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com> Link: https://lore.kernel.org/r/20240828153657.1296410-2-tero.kristo@linux.intel.com Signed-off-by: Hans de Goede <hdegoede@redhat.com>
* | | | | | | Merge tag 'iommu-updates-v6.12' of ↵Linus Torvalds2024-09-181-6/+11
|\ \ \ \ \ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/iommu/linux Pull iommu updates from Joerg Roedel: "Core changes: - Allow ATS on VF when parent device is identity mapped - Optimize unmap path on ARM io-pagetable implementation - Use of_property_present() ARM-SMMU changes: - SMMUv2: - Devicetree binding updates for Qualcomm MMU-500 implementations - Extend workarounds for broken Qualcomm hypervisor to avoid touching features that are not available (e.g. 16KiB page support, reserved context banks) - SMMUv3: - Support for NVIDIA's custom virtual command queue hardware - Fix Stage-2 stall configuration and extend tests to cover this area - A bunch of driver cleanups, including simplification of the master rbtree code - Minor cleanups and fixes across both drivers Intel VT-d changes: - Retire si_domain and convert to use static identity domain - Batched IOTLB/dev-IOTLB invalidation - Small code refactoring and cleanups AMD-Vi changes: - Cleanup and refactoring of io-pagetable code - Add parameter to limit the used io-pagesizes - Other cleanups and fixes" * tag 'iommu-updates-v6.12' of git://git.kernel.org/pub/scm/linux/kernel/git/iommu/linux: (77 commits) dt-bindings: arm-smmu: Add compatible for QCS8300 SoC iommu/amd: Test for PAGING domains before freeing a domain iommu/amd: Fix argument order in amd_iommu_dev_flush_pasid_all() iommu/amd: Add kernel parameters to limit V1 page-sizes iommu/arm-smmu-v3: Reorganize struct arm_smmu_ctx_desc_cfg iommu/arm-smmu-v3: Add types for each level of the CD table iommu/arm-smmu-v3: Shrink the cdtab l1_desc array iommu/arm-smmu-v3: Do not use devm for the cd table allocations iommu/arm-smmu-v3: Remove strtab_base/cfg iommu/arm-smmu-v3: Reorganize struct arm_smmu_strtab_cfg iommu/arm-smmu-v3: Add types for each level of the 2 level stream table iommu/arm-smmu-v3: Add arm_smmu_strtab_l1/2_idx() iommu/arm-smmu-qcom: apply num_context_bank fixes for SDM630 / SDM660 iommu/arm-smmu-v3: Use the new rb tree helpers dt-bindings: arm-smmu: document the support on SA8255p iommu/tegra241-cmdqv: Do not allocate vcmdq until dma_set_mask_and_coherent iommu/tegra241-cmdqv: Drop static at local variable iommu/tegra241-cmdqv: Fix ioremap() error handling in probe() iommu/amd: Do not set the D bit on AMD v2 table entries iommu/amd: Correct the reported page sizes from the V1 table ...