| Commit message (Collapse) | Author | Files | Lines |
|
Tracing tools like perf and trace-cmd read the /sys/kernel/tracing/events/*/*/format
files to know how to parse the data and also how to print it. For the
"print fmt" portion of that file, if anything uses an enum that is not
exported to the tracing system, user space will not be able to parse it.
The GFP flags use to be defines, and defines get translated in the print
fmt sections. But now they are converted to use enums, which is not.
The mm_page_alloc trace event format use to have:
print fmt: "page=%p pfn=0x%lx order=%d migratetype=%d gfp_flags=%s",
REC->pfn != -1UL ? (((struct page *)vmemmap_base) + (REC->pfn)) : ((void
*)0), REC->pfn != -1UL ? REC->pfn : 0, REC->order, REC->migratetype,
(REC->gfp_flags) ? __print_flags(REC->gfp_flags, "|", {( unsigned
long)(((((((( gfp_t)(0x400u|0x800u)) | (( gfp_t)0x40u) | (( gfp_t)0x80u) |
(( gfp_t)0x100000u)) | (( gfp_t)0x02u)) | (( gfp_t)0x08u) | (( gfp_t)0)) |
(( gfp_t)0x40000u) | (( gfp_t)0x80000u) | (( gfp_t)0x2000u)) & ~((
gfp_t)(0x400u|0x800u))) | (( gfp_t)0x400u)), "GFP_TRANSHUGE"}, {( unsigned
long)((((((( gfp_t)(0x400u|0x800u)) | (( gfp_t)0x40u) | (( gfp_t)0x80u) |
(( gfp_t)0x100000u)) | (( gfp_t)0x02u)) | (( gfp_t)0x08u) | (( gfp_t)0)) ...
Where the GFP values are shown and not their names. But after the GFP
flags were converted to use enums, it has:
print fmt: "page=%p pfn=0x%lx order=%d migratetype=%d gfp_flags=%s",
REC->pfn != -1UL ? (vmemmap + (REC->pfn)) : ((void *)0), REC->pfn != -1UL
? REC->pfn : 0, REC->order, REC->migratetype, (REC->gfp_flags) ?
__print_flags(REC->gfp_flags, "|", {( unsigned long)((((((((
gfp_t)(((((1UL))) << (___GFP_DIRECT_RECLAIM_BIT))|((((1UL))) <<
(___GFP_KSWAPD_RECLAIM_BIT)))) | (( gfp_t)((((1UL))) << (___GFP_IO_BIT)))
| (( gfp_t)((((1UL))) << (___GFP_FS_BIT))) | (( gfp_t)((((1UL))) <<
(___GFP_HARDWALL_BIT)))) | (( gfp_t)((((1UL))) << (___GFP_HIGHMEM_BIT))))
| (( gfp_t)((((1UL))) << (___GFP_MOVABLE_BIT))) | (( gfp_t)0)) | ((
gfp_t)((((1UL))) << (___GFP_COMP_BIT))) ...
Where the enums names like ___GFP_KSWAPD_RECLAIM_BIT are shown and not their
values. User space has no way to convert these names to their values and
the output will fail to parse. What is shown is now:
mm_page_alloc: page=0xffffffff981685f3 pfn=0x1d1ac1 order=0 migratetype=1 gfp_flags=0x140cca
The TRACE_DEFINE_ENUM() macro was created to handle enums in the print fmt
files. This causes them to be replaced at boot up with the numbers, so
that user space tooling can parse it. By using this macro, the output is
back to the human readable:
mm_page_alloc: page=0xffffffff981685f3 pfn=0x122233 order=0 migratetype=1 gfp_flags=GFP_HIGHUSER_MOVABLE|__GFP_COMP
Cc: stable@vger.kernel.org
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Veronika Molnarova <vmolnaro@redhat.com>
Cc: Suren Baghdasaryan <surenb@google.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: https://lore.kernel.org/20250116214438.749504792@goodmis.org
Reported-by: Michael Petlan <mpetlan@redhat.com>
Closes: https://lore.kernel.org/all/87be5f7c-1a0-dad-daa0-54e342efaea7@redhat.com/
Fixes: 772dd0342727c ("mm: enumerate all gfp flags")
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
|
|
RING_CMD_CCTL read index should be UC on iGPU parts due to L3 caching
structure. Having this as WB blocks ULLS from being enabled. Change to
UC to unblock ULLS on iGPU.
v2:
- Drop internal communications commnet, bspec is updated
Cc: Balasubramani Vivekanandan <balasubramani.vivekanandan@intel.com>
Cc: Michal Mrozek <michal.mrozek@intel.com>
Cc: Paulo Zanoni <paulo.r.zanoni@intel.com>
Cc: José Roberto de Souza <jose.souza@intel.com>
Cc: stable@vger.kernel.org
Fixes: 328e089bfb37 ("drm/xe: Leverage ComputeCS read L3 caching")
Signed-off-by: Matthew Brost <matthew.brost@intel.com>
Acked-by: Michal Mrozek <michal.mrozek@intel.com>
Reviewed-by: Stuart Summers <stuart.summers@intel.com>
Reviewed-by: Matt Roper <matthew.d.roper@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20250114002507.114087-1-matthew.brost@intel.com
(cherry picked from commit 758debf35b9cda5450e40996991a6e4b222899bd)
Signed-off-by: Thomas Hellström <thomas.hellstrom@linux.intel.com>
|
|
In order to allow serialize() to be used from noinstr code, make it
__always_inline.
Fixes: 0ef8047b737d ("x86/static-call: provide a way to do very early static-call updates")
Closes: https://lore.kernel.org/oe-kbuild-all/202412181756.aJvzih2K-lkp@intel.com/
Reported-by: kernel test robot <lkp@intel.com>
Signed-off-by: Juergen Gross <jgross@suse.com>
Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de>
Link: https://lore.kernel.org/r/20241218100918.22167-1-jgross@suse.com
|
|
Currently imx8mp_blk_ctrl_remove() will continue the for loop
until an out-of-bounds exception occurs.
pstate: 60000005 (nZCv daif -PAN -UAO -TCO -DIT -SSBS BTYPE=--)
pc : dev_pm_domain_detach+0x8/0x48
lr : imx8mp_blk_ctrl_shutdown+0x58/0x90
sp : ffffffc084f8bbf0
x29: ffffffc084f8bbf0 x28: ffffff80daf32ac0 x27: 0000000000000000
x26: ffffffc081658d78 x25: 0000000000000001 x24: ffffffc08201b028
x23: ffffff80d0db9490 x22: ffffffc082340a78 x21: 00000000000005b0
x20: ffffff80d19bc180 x19: 000000000000000a x18: ffffffffffffffff
x17: ffffffc080a39e08 x16: ffffffc080a39c98 x15: 4f435f464f006c72
x14: 0000000000000004 x13: ffffff80d0172110 x12: 0000000000000000
x11: ffffff80d0537740 x10: ffffff80d05376c0 x9 : ffffffc0808ed2d8
x8 : ffffffc084f8bab0 x7 : 0000000000000000 x6 : 0000000000000000
x5 : ffffff80d19b9420 x4 : fffffffe03466e60 x3 : 0000000080800077
x2 : 0000000000000000 x1 : 0000000000000001 x0 : 0000000000000000
Call trace:
dev_pm_domain_detach+0x8/0x48
platform_shutdown+0x2c/0x48
device_shutdown+0x158/0x268
kernel_restart_prepare+0x40/0x58
kernel_kexec+0x58/0xe8
__do_sys_reboot+0x198/0x258
__arm64_sys_reboot+0x2c/0x40
invoke_syscall+0x5c/0x138
el0_svc_common.constprop.0+0x48/0xf0
do_el0_svc+0x24/0x38
el0_svc+0x38/0xc8
el0t_64_sync_handler+0x120/0x130
el0t_64_sync+0x190/0x198
Code: 8128c2d0 ffffffc0 aa1e03e9 d503201f
Fixes: 556f5cf9568a ("soc: imx: add i.MX8MP HSIO blk-ctrl")
Cc: stable@vger.kernel.org
Signed-off-by: Xiaolei Wang <xiaolei.wang@windriver.com>
Reviewed-by: Lucas Stach <l.stach@pengutronix.de>
Reviewed-by: Fabio Estevam <festevam@gmail.com>
Reviewed-by: Frank Li <Frank.Li@nxp.com>
Link: https://lore.kernel.org/r/20250115014118.4086729-1-xiaolei.wang@windriver.com
Signed-off-by: Ulf Hansson <ulf.hansson@linaro.org>
|
|
Li Li reports that casting away callback type may cause issues
for CFI. Let's generate a small wrapper for each callback,
to make sure compiler sees the anticipated types.
Reported-by: Li Li <dualli@chromium.org>
Link: https://lore.kernel.org/CANBPYPjQVqmzZ4J=rVQX87a9iuwmaetULwbK_5_3YWk2eGzkaA@mail.gmail.com
Fixes: 170aafe35cb9 ("netdev: support binding dma-buf to netdevice")
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Reviewed-by: Mina Almasry <almasrymina@google.com>
Link: https://patch.msgid.link/20250115161436.648646-1-kuba@kernel.org
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
|
|
Consider a scenario where a CPU transitions from CPUHP_ONLINE to halfway
through a CPU hotunplug down to CPUHP_HRTIMERS_PREPARE, and then back to
CPUHP_ONLINE:
Since hrtimers_prepare_cpu() does not run, cpu_base.hres_active remains set
to 1 throughout. However, during a CPU unplug operation, the tick and the
clockevents are shut down at CPUHP_AP_TICK_DYING. On return to the online
state, for instance CFS incorrectly assumes that the hrtick is already
active, and the chance of the clockevent device to transition to oneshot
mode is also lost forever for the CPU, unless it goes back to a lower state
than CPUHP_HRTIMERS_PREPARE once.
This round-trip reveals another issue; cpu_base.online is not set to 1
after the transition, which appears as a WARN_ON_ONCE in enqueue_hrtimer().
Aside of that, the bulk of the per CPU state is not reset either, which
means there are dangling pointers in the worst case.
Address this by adding a corresponding startup() callback, which resets the
stale per CPU state and sets the online flag.
[ tglx: Make the new callback unconditionally available, remove the online
modification in the prepare() callback and clear the remaining
state in the starting callback instead of the prepare callback ]
Fixes: 5c0930ccaad5 ("hrtimers: Push pending hrtimers away from outgoing CPU earlier")
Signed-off-by: Koichiro Den <koichiro.den@canonical.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/all/20241220134421.3809834-1-koichiro.den@canonical.com
|
|
The group's ignore flag is:
_ read under the group's lock (idle entry, remote expiry)
_ turned on/off under the group's lock (idle entry, remote expiry)
_ turned on locklessly on idle exit
When idle entry or remote expiry clear the "ignore" flag of a group, the
operation must be synchronized against other concurrent idle entry or
remote expiry to make sure the related group timer is never missed. To
enforce this synchronization, both "ignore" clear and read are
performed under the group lock.
On the contrary, whether idle entry or remote expiry manage to observe
the "ignore" flag turned on by a CPU exiting idle is a matter of
optimization. If that flag set is missed or cleared concurrently, the
worst outcome is a migrator wasting time remotely handling a "ghost"
timer. This is why the ignore flag can be set locklessly.
Unfortunately, the related lockless accesses are bare and miss
appropriate annotations. KCSAN rightfully complains:
BUG: KCSAN: data-race in __tmigr_cpu_activate / print_report
write to 0xffff88842fc28004 of 1 bytes by task 0 on cpu 0:
__tmigr_cpu_activate
tmigr_cpu_activate
timer_clear_idle
tick_nohz_restart_sched_tick
tick_nohz_idle_exit
do_idle
cpu_startup_entry
kernel_init
do_initcalls
clear_bss
reserve_bios_regions
common_startup_64
read to 0xffff88842fc28004 of 1 bytes by task 0 on cpu 1:
print_report
kcsan_report_known_origin
kcsan_setup_watchpoint
tmigr_next_groupevt
tmigr_update_events
tmigr_inactive_up
__walk_groups+0x50/0x77
walk_groups
__tmigr_cpu_deactivate
tmigr_cpu_deactivate
__get_next_timer_interrupt
timer_base_try_to_set_idle
tick_nohz_stop_tick
tick_nohz_idle_stop_tick
cpuidle_idle_call
do_idle
Although the relevant accesses could be marked as data_race(), the
"ignore" flag being read several times within the same
tmigr_update_events() function is confusing and error prone. Prefer
reading it once in that function and make use of similar/paired accesses
elsewhere with appropriate comments when necessary.
Reported-by: kernel test robot <oliver.sang@intel.com>
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/all/20250114231507.21672-4-frederic@kernel.org
Closes: https://lore.kernel.org/oe-lkp/202501031612.62e0c498-lkp@intel.com
|
|
Commit 2522c84db513 ("timers/migration: Fix another race between hotplug
and idle entry/exit") fixed yet another race between idle exit and CPU
hotplug up leading to a wrong "0" value migrator assigned to the top
level. However there is yet another situation that remains unhandled:
[GRP0:0]
migrator = TMIGR_NONE
active = NONE
groupmask = 1
/ \ \
0 1 2..7
idle idle idle
0) The system is fully idle.
[GRP0:0]
migrator = CPU 0
active = CPU 0
groupmask = 1
/ \ \
0 1 2..7
active idle idle
1) CPU 0 is activating. It has done the cmpxchg on the top's ->migr_state
but it hasn't yet returned to __walk_groups().
[GRP0:0]
migrator = CPU 0
active = CPU 0, CPU 1
groupmask = 1
/ \ \
0 1 2..7
active active idle
2) CPU 1 is activating. CPU 0 stays the migrator (still stuck in
__walk_groups(), delayed by #VMEXIT for example).
[GRP1:0]
migrator = TMIGR_NONE
active = NONE
groupmask = 1
/ \
[GRP0:0] [GRP0:1]
migrator = CPU 0 migrator = TMIGR_NONE
active = CPU 0, CPU1 active = NONE
groupmask = 1 groupmask = 2
/ \ \
0 1 2..7 8
active active idle !online
3) CPU 8 is preparing to boot. CPUHP_TMIGR_PREPARE is being ran by CPU 1
which has created the GRP0:1 and the new top GRP1:0 connected to GRP0:1
and GRP0:0. CPU 1 hasn't yet propagated its activation up to GRP1:0.
[GRP1:0]
migrator = GRP0:0
active = GRP0:0
groupmask = 1
/ \
[GRP0:0] [GRP0:1]
migrator = CPU 0 migrator = TMIGR_NONE
active = CPU 0, CPU1 active = NONE
groupmask = 1 groupmask = 2
/ \ \
0 1 2..7 8
active active idle !online
4) CPU 0 finally resumed after its #VMEXIT. It's in __walk_groups()
returning from tmigr_cpu_active(). The new top GRP1:0 is visible and
fetched and the pre-initialized groupmask of GRP0:0 is also visible.
As a result tmigr_active_up() is called to GRP1:0 with GRP0:0 as active
and migrator. CPU 0 is returning to __walk_groups() but suffers again
a #VMEXIT.
[GRP1:0]
migrator = GRP0:0
active = GRP0:0
groupmask = 1
/ \
[GRP0:0] [GRP0:1]
migrator = CPU 0 migrator = TMIGR_NONE
active = CPU 0, CPU1 active = NONE
groupmask = 1 groupmask = 2
/ \ \
0 1 2..7 8
active active idle !online
5) CPU 1 propagates its activation of GRP0:0 to GRP1:0. This has no
effect since CPU 0 did it already.
[GRP1:0]
migrator = GRP0:0
active = GRP0:0, GRP0:1
groupmask = 1
/ \
[GRP0:0] [GRP0:1]
migrator = CPU 0 migrator = CPU 8
active = CPU 0, CPU1 active = CPU 8
groupmask = 1 groupmask = 2
/ \ \ \
0 1 2..7 8
active active idle active
6) CPU 1 links CPU 8 to its group. CPU 8 boots and goes through
CPUHP_AP_TMIGR_ONLINE which propagates activation.
[GRP2:0]
migrator = TMIGR_NONE
active = NONE
groupmask = 1
/ \
[GRP1:0] [GRP1:1]
migrator = GRP0:0 migrator = TMIGR_NONE
active = GRP0:0, GRP0:1 active = NONE
groupmask = 1 groupmask = 2
/ \
[GRP0:0] [GRP0:1] [GRP0:2]
migrator = CPU 0 migrator = CPU 8 migrator = TMIGR_NONE
active = CPU 0, CPU1 active = CPU 8 active = NONE
groupmask = 1 groupmask = 2 groupmask = 0
/ \ \ \
0 1 2..7 8 64
active active idle active !online
7) CPU 64 is booting. CPUHP_TMIGR_PREPARE is being ran by CPU 1
which has created the GRP1:1, GRP0:2 and the new top GRP2:0 connected to
GRP1:1 and GRP1:0. CPU 1 hasn't yet propagated its activation up to
GRP2:0.
[GRP2:0]
migrator = 0 (!!!)
active = NONE
groupmask = 1
/ \
[GRP1:0] [GRP1:1]
migrator = GRP0:0 migrator = TMIGR_NONE
active = GRP0:0, GRP0:1 active = NONE
groupmask = 1 groupmask = 2
/ \
[GRP0:0] [GRP0:1] [GRP0:2]
migrator = CPU 0 migrator = CPU 8 migrator = TMIGR_NONE
active = CPU 0, CPU1 active = CPU 8 active = NONE
groupmask = 1 groupmask = 2 groupmask = 0
/ \ \ \
0 1 2..7 8 64
active active idle active !online
8) CPU 0 finally resumed after its #VMEXIT. It's in __walk_groups()
returning from tmigr_cpu_active(). The new top GRP2:0 is visible and
fetched but the pre-initialized groupmask of GRP1:0 is not because no
ordering made its initialization visible. As a result tmigr_active_up()
may be called to GRP2:0 with a "0" child's groumask. Leaving the timers
ignored for ever when the system is fully idle.
The race is highly theoretical and perhaps impossible in practice but
the groupmask of the child is not the only concern here as the whole
initialization of the child is not guaranteed to be visible to any
tree walker racing against hotplug (idle entry/exit, remote handling,
etc...). Although the current code layout seem to be resilient to such
hazards, this doesn't tell much about the future.
Fix this with enforcing address dependency between group initialization
and the write/read to the group's parent's pointer. Fortunately that
doesn't involve any barrier addition in the fast paths.
Fixes: 10a0e6f3d3db ("timers/migration: Move hierarchy setup into cpuhotplug prepare callback")
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/all/20250114231507.21672-3-frederic@kernel.org
|
|
Commit 10a0e6f3d3db ("timers/migration: Move hierarchy setup into
cpuhotplug prepare callback") fixed a race between idle exit and CPU
hotplug up leading to a wrong "0" value migrator assigned to the top
level. However there is still a situation that remains unhandled:
[GRP0:0]
migrator = TMIGR_NONE
active = NONE
groupmask = 0
/ \ \
0 1 2..7
idle idle idle
0) The system is fully idle.
[GRP0:0]
migrator = CPU 0
active = CPU 0
groupmask = 0
/ \ \
0 1 2..7
active idle idle
1) CPU 0 is activating. It has done the cmpxchg on the top's ->migr_state
but it hasn't yet returned to __walk_groups().
[GRP0:0]
migrator = CPU 0
active = CPU 0, CPU 1
groupmask = 0
/ \ \
0 1 2..7
active active idle
2) CPU 1 is activating. CPU 0 stays the migrator (still stuck in
__walk_groups(), delayed by #VMEXIT for example).
[GRP1:0]
migrator = TMIGR_NONE
active = NONE
groupmask = 0
/ \
[GRP0:0] [GRP0:1]
migrator = CPU 0 migrator = TMIGR_NONE
active = CPU 0, CPU1 active = NONE
groupmask = 2 groupmask = 1
/ \ \
0 1 2..7 8
active active idle !online
3) CPU 8 is preparing to boot. CPUHP_TMIGR_PREPARE is being ran by CPU 1
which has created the GRP0:1 and the new top GRP1:0 connected to GRP0:1
and GRP0:0. The groupmask of GRP0:0 is now 2. CPU 1 hasn't yet
propagated its activation up to GRP1:0.
[GRP1:0]
migrator = 0 (!!!)
active = NONE
groupmask = 0
/ \
[GRP0:0] [GRP0:1]
migrator = CPU 0 migrator = TMIGR_NONE
active = CPU 0, CPU1 active = NONE
groupmask = 2 groupmask = 1
/ \ \
0 1 2..7 8
active active idle !online
4) CPU 0 finally resumed after its #VMEXIT. It's in __walk_groups()
returning from tmigr_cpu_active(). The new top GRP1:0 is visible and
fetched but the freshly updated groupmask of GRP0:0 may not be visible
due to lack of ordering! As a result tmigr_active_up() is called to
GRP0:0 with a child's groupmask of "0". This buggy "0" groupmask then
becomes the migrator for GRP1:0 forever. As a result, timers on a fully
idle system get ignored.
One possible fix would be to define TMIGR_NONE as "0" so that such a
race would have no effect. And after all TMIGR_NONE doesn't need to be
anything else. However this would leave an uncomfortable state machine
where gears happen not to break by chance but are vulnerable to future
modifications.
Keep TMIGR_NONE as is instead and pre-initialize to "1" the groupmask of
any newly created top level. This groupmask is guaranteed to be visible
upon fetching the corresponding group for the 1st time:
_ By the upcoming CPU thanks to CPU hotplug synchronization between the
control CPU (BP) and the booting one (AP).
_ By the control CPU since the groupmask and parent pointers are
initialized locally.
_ By all CPUs belonging to the same group than the control CPU because
they must wait for it to ever become idle before needing to walk to
the new top. The cmpcxhg() on ->migr_state then makes sure its
groupmask is visible.
With this pre-initialization, it is guaranteed that if a future top level
is linked to an old one, it is walked through with a valid groupmask.
Fixes: 10a0e6f3d3db ("timers/migration: Move hierarchy setup into cpuhotplug prepare callback")
Signed-off-by: Frederic Weisbecker <frederic@kernel.org>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/all/20250114231507.21672-2-frederic@kernel.org
|
|
According to RFC4303, section "3.3.3. Sequence Number Generation",
the first packet sent using a given SA will contain a sequence
number of 1.
This is applicable to both ESN and non-ESN mode, which was not covered
in commit mentioned in Fixes line.
Fixes: 3d42c8cc67a8 ("net/mlx5e: Ensure that IPsec sequence packet number starts from 1")
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
|
|
All packet offloads SAs have reqid in it to make sure they have
corresponding policy. While it is not strictly needed for transparent
mode, it is extremely important in tunnel mode. In that mode, policy and
SAs have different match criteria.
Policy catches the whole subnet addresses, and SA catches the tunnel gateways
addresses. The source address of such tunnel is not known during egress packet
traversal in flow steering as it is added only after successful encryption.
As reqid is required for packet offload and it is unique for every SA,
we can safely rely on it only.
The output below shows the configured egress policy and SA by strongswan:
[leonro@vm ~]$ sudo ip x s
src 192.169.101.2 dst 192.169.101.1
proto esp spi 0xc88b7652 reqid 1 mode tunnel
replay-window 0 flag af-unspec esn
aead rfc4106(gcm(aes)) 0xe406a01083986e14d116488549094710e9c57bc6 128
anti-replay esn context:
seq-hi 0x0, seq 0x0, oseq-hi 0x0, oseq 0x0
replay_window 1, bitmap-length 1
00000000
crypto offload parameters: dev eth2 dir out mode packet
[leonro@064 ~]$ sudo ip x p
src 192.170.0.0/16 dst 192.170.0.0/16
dir out priority 383615 ptype main
tmpl src 192.169.101.2 dst 192.169.101.1
proto esp spi 0xc88b7652 reqid 1 mode tunnel
crypto offload parameters: dev eth2 mode packet
Fixes: b3beba1fb404 ("net/mlx5e: Allow policies with reqid 0, to support IKE policy holes")
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
|
|
Attempt to enable IPsec packet offload in tunnel mode in debug kernel
generates the following kernel panic, which is happening due to two
issues:
1. In SA add section, the should be _bh() variant when marking SA mode.
2. There is not needed flush_workqueue in SA delete routine. It is not
needed as at this stage as it is removed from SADB and the running work
will be canceled later in SA free.
=====================================================
WARNING: SOFTIRQ-safe -> SOFTIRQ-unsafe lock order detected
6.12.0+ #4 Not tainted
-----------------------------------------------------
charon/1337 [HC0[0]:SC0[4]:HE1:SE0] is trying to acquire:
ffff88810f365020 (&xa->xa_lock#24){+.+.}-{3:3}, at: mlx5e_xfrm_del_state+0xca/0x1e0 [mlx5_core]
and this task is already holding:
ffff88813e0f0d48 (&x->lock){+.-.}-{3:3}, at: xfrm_state_delete+0x16/0x30
which would create a new lock dependency:
(&x->lock){+.-.}-{3:3} -> (&xa->xa_lock#24){+.+.}-{3:3}
but this new dependency connects a SOFTIRQ-irq-safe lock:
(&x->lock){+.-.}-{3:3}
... which became SOFTIRQ-irq-safe at:
lock_acquire+0x1be/0x520
_raw_spin_lock_bh+0x34/0x40
xfrm_timer_handler+0x91/0xd70
__hrtimer_run_queues+0x1dd/0xa60
hrtimer_run_softirq+0x146/0x2e0
handle_softirqs+0x266/0x860
irq_exit_rcu+0x115/0x1a0
sysvec_apic_timer_interrupt+0x6e/0x90
asm_sysvec_apic_timer_interrupt+0x16/0x20
default_idle+0x13/0x20
default_idle_call+0x67/0xa0
do_idle+0x2da/0x320
cpu_startup_entry+0x50/0x60
start_secondary+0x213/0x2a0
common_startup_64+0x129/0x138
to a SOFTIRQ-irq-unsafe lock:
(&xa->xa_lock#24){+.+.}-{3:3}
... which became SOFTIRQ-irq-unsafe at:
...
lock_acquire+0x1be/0x520
_raw_spin_lock+0x2c/0x40
xa_set_mark+0x70/0x110
mlx5e_xfrm_add_state+0xe48/0x2290 [mlx5_core]
xfrm_dev_state_add+0x3bb/0xd70
xfrm_add_sa+0x2451/0x4a90
xfrm_user_rcv_msg+0x493/0x880
netlink_rcv_skb+0x12e/0x380
xfrm_netlink_rcv+0x6d/0x90
netlink_unicast+0x42f/0x740
netlink_sendmsg+0x745/0xbe0
__sock_sendmsg+0xc5/0x190
__sys_sendto+0x1fe/0x2c0
__x64_sys_sendto+0xdc/0x1b0
do_syscall_64+0x6d/0x140
entry_SYSCALL_64_after_hwframe+0x4b/0x53
other info that might help us debug this:
Possible interrupt unsafe locking scenario:
CPU0 CPU1
---- ----
lock(&xa->xa_lock#24);
local_irq_disable();
lock(&x->lock);
lock(&xa->xa_lock#24);
<Interrupt>
lock(&x->lock);
*** DEADLOCK ***
2 locks held by charon/1337:
#0: ffffffff87f8f858 (&net->xfrm.xfrm_cfg_mutex){+.+.}-{4:4}, at: xfrm_netlink_rcv+0x5e/0x90
#1: ffff88813e0f0d48 (&x->lock){+.-.}-{3:3}, at: xfrm_state_delete+0x16/0x30
the dependencies between SOFTIRQ-irq-safe lock and the holding lock:
-> (&x->lock){+.-.}-{3:3} ops: 29 {
HARDIRQ-ON-W at:
lock_acquire+0x1be/0x520
_raw_spin_lock_bh+0x34/0x40
xfrm_alloc_spi+0xc0/0xe60
xfrm_alloc_userspi+0x5f6/0xbc0
xfrm_user_rcv_msg+0x493/0x880
netlink_rcv_skb+0x12e/0x380
xfrm_netlink_rcv+0x6d/0x90
netlink_unicast+0x42f/0x740
netlink_sendmsg+0x745/0xbe0
__sock_sendmsg+0xc5/0x190
__sys_sendto+0x1fe/0x2c0
__x64_sys_sendto+0xdc/0x1b0
do_syscall_64+0x6d/0x140
entry_SYSCALL_64_after_hwframe+0x4b/0x53
IN-SOFTIRQ-W at:
lock_acquire+0x1be/0x520
_raw_spin_lock_bh+0x34/0x40
xfrm_timer_handler+0x91/0xd70
__hrtimer_run_queues+0x1dd/0xa60
hrtimer_run_softirq+0x146/0x2e0
handle_softirqs+0x266/0x860
irq_exit_rcu+0x115/0x1a0
sysvec_apic_timer_interrupt+0x6e/0x90
asm_sysvec_apic_timer_interrupt+0x16/0x20
default_idle+0x13/0x20
default_idle_call+0x67/0xa0
do_idle+0x2da/0x320
cpu_startup_entry+0x50/0x60
start_secondary+0x213/0x2a0
common_startup_64+0x129/0x138
INITIAL USE at:
lock_acquire+0x1be/0x520
_raw_spin_lock_bh+0x34/0x40
xfrm_alloc_spi+0xc0/0xe60
xfrm_alloc_userspi+0x5f6/0xbc0
xfrm_user_rcv_msg+0x493/0x880
netlink_rcv_skb+0x12e/0x380
xfrm_netlink_rcv+0x6d/0x90
netlink_unicast+0x42f/0x740
netlink_sendmsg+0x745/0xbe0
__sock_sendmsg+0xc5/0x190
__sys_sendto+0x1fe/0x2c0
__x64_sys_sendto+0xdc/0x1b0
do_syscall_64+0x6d/0x140
entry_SYSCALL_64_after_hwframe+0x4b/0x53
}
... key at: [<ffffffff87f9cd20>] __key.18+0x0/0x40
the dependencies between the lock to be acquired
and SOFTIRQ-irq-unsafe lock:
-> (&xa->xa_lock#24){+.+.}-{3:3} ops: 9 {
HARDIRQ-ON-W at:
lock_acquire+0x1be/0x520
_raw_spin_lock_bh+0x34/0x40
mlx5e_xfrm_add_state+0xc5b/0x2290 [mlx5_core]
xfrm_dev_state_add+0x3bb/0xd70
xfrm_add_sa+0x2451/0x4a90
xfrm_user_rcv_msg+0x493/0x880
netlink_rcv_skb+0x12e/0x380
xfrm_netlink_rcv+0x6d/0x90
netlink_unicast+0x42f/0x740
netlink_sendmsg+0x745/0xbe0
__sock_sendmsg+0xc5/0x190
__sys_sendto+0x1fe/0x2c0
__x64_sys_sendto+0xdc/0x1b0
do_syscall_64+0x6d/0x140
entry_SYSCALL_64_after_hwframe+0x4b/0x53
SOFTIRQ-ON-W at:
lock_acquire+0x1be/0x520
_raw_spin_lock+0x2c/0x40
xa_set_mark+0x70/0x110
mlx5e_xfrm_add_state+0xe48/0x2290 [mlx5_core]
xfrm_dev_state_add+0x3bb/0xd70
xfrm_add_sa+0x2451/0x4a90
xfrm_user_rcv_msg+0x493/0x880
netlink_rcv_skb+0x12e/0x380
xfrm_netlink_rcv+0x6d/0x90
netlink_unicast+0x42f/0x740
netlink_sendmsg+0x745/0xbe0
__sock_sendmsg+0xc5/0x190
__sys_sendto+0x1fe/0x2c0
__x64_sys_sendto+0xdc/0x1b0
do_syscall_64+0x6d/0x140
entry_SYSCALL_64_after_hwframe+0x4b/0x53
INITIAL USE at:
lock_acquire+0x1be/0x520
_raw_spin_lock_bh+0x34/0x40
mlx5e_xfrm_add_state+0xc5b/0x2290 [mlx5_core]
xfrm_dev_state_add+0x3bb/0xd70
xfrm_add_sa+0x2451/0x4a90
xfrm_user_rcv_msg+0x493/0x880
netlink_rcv_skb+0x12e/0x380
xfrm_netlink_rcv+0x6d/0x90
netlink_unicast+0x42f/0x740
netlink_sendmsg+0x745/0xbe0
__sock_sendmsg+0xc5/0x190
__sys_sendto+0x1fe/0x2c0
__x64_sys_sendto+0xdc/0x1b0
do_syscall_64+0x6d/0x140
entry_SYSCALL_64_after_hwframe+0x4b/0x53
}
... key at: [<ffffffffa078ff60>] __key.48+0x0/0xfffffffffff210a0 [mlx5_core]
... acquired at:
__lock_acquire+0x30a0/0x5040
lock_acquire+0x1be/0x520
_raw_spin_lock_bh+0x34/0x40
mlx5e_xfrm_del_state+0xca/0x1e0 [mlx5_core]
xfrm_dev_state_delete+0x90/0x160
__xfrm_state_delete+0x662/0xae0
xfrm_state_delete+0x1e/0x30
xfrm_del_sa+0x1c2/0x340
xfrm_user_rcv_msg+0x493/0x880
netlink_rcv_skb+0x12e/0x380
xfrm_netlink_rcv+0x6d/0x90
netlink_unicast+0x42f/0x740
netlink_sendmsg+0x745/0xbe0
__sock_sendmsg+0xc5/0x190
__sys_sendto+0x1fe/0x2c0
__x64_sys_sendto+0xdc/0x1b0
do_syscall_64+0x6d/0x140
entry_SYSCALL_64_after_hwframe+0x4b/0x53
stack backtrace:
CPU: 7 UID: 0 PID: 1337 Comm: charon Not tainted 6.12.0+ #4
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
Call Trace:
<TASK>
dump_stack_lvl+0x74/0xd0
check_irq_usage+0x12e8/0x1d90
? print_shortest_lock_dependencies_backwards+0x1b0/0x1b0
? check_chain_key+0x1bb/0x4c0
? __lockdep_reset_lock+0x180/0x180
? check_path.constprop.0+0x24/0x50
? mark_lock+0x108/0x2fb0
? print_circular_bug+0x9b0/0x9b0
? mark_lock+0x108/0x2fb0
? print_usage_bug.part.0+0x670/0x670
? check_prev_add+0x1c4/0x2310
check_prev_add+0x1c4/0x2310
__lock_acquire+0x30a0/0x5040
? lockdep_set_lock_cmp_fn+0x190/0x190
? lockdep_set_lock_cmp_fn+0x190/0x190
lock_acquire+0x1be/0x520
? mlx5e_xfrm_del_state+0xca/0x1e0 [mlx5_core]
? lockdep_hardirqs_on_prepare+0x400/0x400
? __xfrm_state_delete+0x5f0/0xae0
? lock_downgrade+0x6b0/0x6b0
_raw_spin_lock_bh+0x34/0x40
? mlx5e_xfrm_del_state+0xca/0x1e0 [mlx5_core]
mlx5e_xfrm_del_state+0xca/0x1e0 [mlx5_core]
xfrm_dev_state_delete+0x90/0x160
__xfrm_state_delete+0x662/0xae0
xfrm_state_delete+0x1e/0x30
xfrm_del_sa+0x1c2/0x340
? xfrm_get_sa+0x250/0x250
? check_chain_key+0x1bb/0x4c0
xfrm_user_rcv_msg+0x493/0x880
? copy_sec_ctx+0x270/0x270
? check_chain_key+0x1bb/0x4c0
? lockdep_set_lock_cmp_fn+0x190/0x190
? lockdep_set_lock_cmp_fn+0x190/0x190
netlink_rcv_skb+0x12e/0x380
? copy_sec_ctx+0x270/0x270
? netlink_ack+0xd90/0xd90
? netlink_deliver_tap+0xcd/0xb60
xfrm_netlink_rcv+0x6d/0x90
netlink_unicast+0x42f/0x740
? netlink_attachskb+0x730/0x730
? lock_acquire+0x1be/0x520
netlink_sendmsg+0x745/0xbe0
? netlink_unicast+0x740/0x740
? __might_fault+0xbb/0x170
? netlink_unicast+0x740/0x740
__sock_sendmsg+0xc5/0x190
? fdget+0x163/0x1d0
__sys_sendto+0x1fe/0x2c0
? __x64_sys_getpeername+0xb0/0xb0
? do_user_addr_fault+0x856/0xe30
? lock_acquire+0x1be/0x520
? __task_pid_nr_ns+0x117/0x410
? lock_downgrade+0x6b0/0x6b0
__x64_sys_sendto+0xdc/0x1b0
? lockdep_hardirqs_on_prepare+0x284/0x400
do_syscall_64+0x6d/0x140
entry_SYSCALL_64_after_hwframe+0x4b/0x53
RIP: 0033:0x7f7d31291ba4
Code: 7d e8 89 4d d4 e8 4c 42 f7 ff 44 8b 4d d0 4c 8b 45 c8 89 c3 44 8b 55 d4 8b 7d e8 b8 2c 00 00 00 48 8b 55 d8 48 8b 75 e0 0f 05 <48> 3d 00 f0 ff ff 77 34 89 df 48 89 45 e8 e8 99 42 f7 ff 48 8b 45
RSP: 002b:00007f7d2ccd94f0 EFLAGS: 00000297 ORIG_RAX: 000000000000002c
RAX: ffffffffffffffda RBX: 0000000000000001 RCX: 00007f7d31291ba4
RDX: 0000000000000028 RSI: 00007f7d2ccd96a0 RDI: 000000000000000a
RBP: 00007f7d2ccd9530 R08: 00007f7d2ccd9598 R09: 000000000000000c
R10: 0000000000000000 R11: 0000000000000297 R12: 0000000000000028
R13: 00007f7d2ccd9598 R14: 00007f7d2ccd96a0 R15: 00000000000000e1
</TASK>
Fixes: 4c24272b4e2b ("net/mlx5e: Listen to ARP events to update IPsec L2 headers in tunnel mode")
Signed-off-by: Leon Romanovsky <leonro@nvidia.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
|
|
Clear the port select structure on error so no stale values left after
definers are destroyed. That's because the mlx5_lag_destroy_definers()
always try to destroy all lag definers in the tt_map, so in the flow
below lag definers get double-destroyed and cause kernel crash:
mlx5_lag_port_sel_create()
mlx5_lag_create_definers()
mlx5_lag_create_definer() <- Failed on tt 1
mlx5_lag_destroy_definers() <- definers[tt=0] gets destroyed
mlx5_lag_port_sel_create()
mlx5_lag_create_definers()
mlx5_lag_create_definer() <- Failed on tt 0
mlx5_lag_destroy_definers() <- definers[tt=0] gets double-destroyed
Unable to handle kernel NULL pointer dereference at virtual address 0000000000000008
Mem abort info:
ESR = 0x0000000096000005
EC = 0x25: DABT (current EL), IL = 32 bits
SET = 0, FnV = 0
EA = 0, S1PTW = 0
FSC = 0x05: level 1 translation fault
Data abort info:
ISV = 0, ISS = 0x00000005, ISS2 = 0x00000000
CM = 0, WnR = 0, TnD = 0, TagAccess = 0
GCS = 0, Overlay = 0, DirtyBit = 0, Xs = 0
user pgtable: 64k pages, 48-bit VAs, pgdp=0000000112ce2e00
[0000000000000008] pgd=0000000000000000, p4d=0000000000000000, pud=0000000000000000
Internal error: Oops: 0000000096000005 [#1] PREEMPT SMP
Modules linked in: iptable_raw bonding ip_gre ip6_gre gre ip6_tunnel tunnel6 geneve ip6_udp_tunnel udp_tunnel ipip tunnel4 ip_tunnel rdma_ucm(OE) rdma_cm(OE) iw_cm(OE) ib_ipoib(OE) ib_cm(OE) ib_umad(OE) mlx5_ib(OE) ib_uverbs(OE) mlx5_fwctl(OE) fwctl(OE) mlx5_core(OE) mlxdevm(OE) ib_core(OE) mlxfw(OE) memtrack(OE) mlx_compat(OE) openvswitch nsh nf_conncount psample xt_conntrack xt_MASQUERADE nf_conntrack_netlink nfnetlink xfrm_user xfrm_algo xt_addrtype iptable_filter iptable_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 br_netfilter bridge stp llc netconsole overlay efi_pstore sch_fq_codel zram ip_tables crct10dif_ce qemu_fw_cfg fuse ipv6 crc_ccitt [last unloaded: mlx_compat(OE)]
CPU: 3 UID: 0 PID: 217 Comm: kworker/u53:2 Tainted: G OE 6.11.0+ #2
Tainted: [O]=OOT_MODULE, [E]=UNSIGNED_MODULE
Hardware name: QEMU KVM Virtual Machine, BIOS 0.0.0 02/06/2015
Workqueue: mlx5_lag mlx5_do_bond_work [mlx5_core]
pstate: 60400005 (nZCv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
pc : mlx5_del_flow_rules+0x24/0x2c0 [mlx5_core]
lr : mlx5_lag_destroy_definer+0x54/0x100 [mlx5_core]
sp : ffff800085fafb00
x29: ffff800085fafb00 x28: ffff0000da0c8000 x27: 0000000000000000
x26: ffff0000da0c8000 x25: ffff0000da0c8000 x24: ffff0000da0c8000
x23: ffff0000c31f81a0 x22: 0400000000000000 x21: ffff0000da0c8000
x20: 0000000000000000 x19: 0000000000000001 x18: 0000000000000000
x17: 0000000000000000 x16: 0000000000000000 x15: 0000ffff8b0c9350
x14: 0000000000000000 x13: ffff800081390d18 x12: ffff800081dc3cc0
x11: 0000000000000001 x10: 0000000000000b10 x9 : ffff80007ab7304c
x8 : ffff0000d00711f0 x7 : 0000000000000004 x6 : 0000000000000190
x5 : ffff00027edb3010 x4 : 0000000000000000 x3 : 0000000000000000
x2 : ffff0000d39b8000 x1 : ffff0000d39b8000 x0 : 0400000000000000
Call trace:
mlx5_del_flow_rules+0x24/0x2c0 [mlx5_core]
mlx5_lag_destroy_definer+0x54/0x100 [mlx5_core]
mlx5_lag_destroy_definers+0xa0/0x108 [mlx5_core]
mlx5_lag_port_sel_create+0x2d4/0x6f8 [mlx5_core]
mlx5_activate_lag+0x60c/0x6f8 [mlx5_core]
mlx5_do_bond_work+0x284/0x5c8 [mlx5_core]
process_one_work+0x170/0x3e0
worker_thread+0x2d8/0x3e0
kthread+0x11c/0x128
ret_from_fork+0x10/0x20
Code: a9025bf5 aa0003f6 a90363f7 f90023f9 (f9400400)
---[ end trace 0000000000000000 ]---
Fixes: dc48516ec7d3 ("net/mlx5: Lag, add support to create definers for LAG")
Signed-off-by: Mark Zhang <markzhang@nvidia.com>
Reviewed-by: Leon Romanovsky <leonro@nvidia.com>
Reviewed-by: Mark Bloch <mbloch@nvidia.com>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
|
|
If failed to add SF, error handling doesn't delete the SF from the
SF table. But the hw resources are deleted. So when unload driver,
hw resources will be deleted again. Firmware will report syndrome
0x68def3 which means "SF is not allocated can not deallocate".
Fix it by delete SF from SF table if failed to add SF.
Fixes: 2597ee190b4e ("net/mlx5: Call mlx5_sf_id_erase() once in mlx5_sf_dealloc()")
Signed-off-by: Chris Mi <cmi@nvidia.com>
Reviewed-by: Shay Drori <shayd@nvidia.com>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
|
|
Fix a lockdep warning [1] observed during the write combining test.
The warning indicates a potential nested lock scenario that could lead
to a deadlock.
However, this is a false positive alarm because the SF lock and its
parent lock are distinct ones.
The lockdep confusion arises because the locks belong to the same object
class (i.e., struct mlx5_core_dev).
To resolve this, the code has been refactored to avoid taking both
locks. Instead, only the parent lock is acquired.
[1]
raw_ethernet_bw/2118 is trying to acquire lock:
[ 213.619032] ffff88811dd75e08 (&dev->wc_state_lock){+.+.}-{3:3}, at:
mlx5_wc_support_get+0x18c/0x210 [mlx5_core]
[ 213.620270]
[ 213.620270] but task is already holding lock:
[ 213.620943] ffff88810b585e08 (&dev->wc_state_lock){+.+.}-{3:3}, at:
mlx5_wc_support_get+0x10c/0x210 [mlx5_core]
[ 213.622045]
[ 213.622045] other info that might help us debug this:
[ 213.622778] Possible unsafe locking scenario:
[ 213.622778]
[ 213.623465] CPU0
[ 213.623815] ----
[ 213.624148] lock(&dev->wc_state_lock);
[ 213.624615] lock(&dev->wc_state_lock);
[ 213.625071]
[ 213.625071] *** DEADLOCK ***
[ 213.625071]
[ 213.625805] May be due to missing lock nesting notation
[ 213.625805]
[ 213.626522] 4 locks held by raw_ethernet_bw/2118:
[ 213.627019] #0: ffff88813f80d578 (&uverbs_dev->disassociate_srcu){.+.+}-{0:0},
at: ib_uverbs_ioctl+0xc4/0x170 [ib_uverbs]
[ 213.628088] #1: ffff88810fb23930 (&file->hw_destroy_rwsem){.+.+}-{3:3},
at: ib_init_ucontext+0x2d/0xf0 [ib_uverbs]
[ 213.629094] #2: ffff88810fb23878 (&file->ucontext_lock){+.+.}-{3:3},
at: ib_init_ucontext+0x49/0xf0 [ib_uverbs]
[ 213.630106] #3: ffff88810b585e08 (&dev->wc_state_lock){+.+.}-{3:3},
at: mlx5_wc_support_get+0x10c/0x210 [mlx5_core]
[ 213.631185]
[ 213.631185] stack backtrace:
[ 213.631718] CPU: 1 UID: 0 PID: 2118 Comm: raw_ethernet_bw Not tainted
6.12.0-rc7_internal_net_next_mlx5_89a0ad0 #1
[ 213.632722] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS
rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
[ 213.633785] Call Trace:
[ 213.634099]
[ 213.634393] dump_stack_lvl+0x7e/0xc0
[ 213.634806] print_deadlock_bug+0x278/0x3c0
[ 213.635265] __lock_acquire+0x15f4/0x2c40
[ 213.635712] lock_acquire+0xcd/0x2d0
[ 213.636120] ? mlx5_wc_support_get+0x18c/0x210 [mlx5_core]
[ 213.636722] ? mlx5_ib_enable_lb+0x24/0xa0 [mlx5_ib]
[ 213.637277] __mutex_lock+0x81/0xda0
[ 213.637697] ? mlx5_wc_support_get+0x18c/0x210 [mlx5_core]
[ 213.638305] ? mlx5_wc_support_get+0x18c/0x210 [mlx5_core]
[ 213.638902] ? rcu_read_lock_sched_held+0x3f/0x70
[ 213.639400] ? mlx5_wc_support_get+0x18c/0x210 [mlx5_core]
[ 213.640016] mlx5_wc_support_get+0x18c/0x210 [mlx5_core]
[ 213.640615] set_ucontext_resp+0x68/0x2b0 [mlx5_ib]
[ 213.641144] ? debug_mutex_init+0x33/0x40
[ 213.641586] mlx5_ib_alloc_ucontext+0x18e/0x7b0 [mlx5_ib]
[ 213.642145] ib_init_ucontext+0xa0/0xf0 [ib_uverbs]
[ 213.642679] ib_uverbs_handler_UVERBS_METHOD_GET_CONTEXT+0x95/0xc0
[ib_uverbs]
[ 213.643426] ? _copy_from_user+0x46/0x80
[ 213.643878] ib_uverbs_cmd_verbs+0xa6b/0xc80 [ib_uverbs]
[ 213.644426] ? ib_uverbs_handler_UVERBS_METHOD_INVOKE_WRITE+0x130/0x130
[ib_uverbs]
[ 213.645213] ? __lock_acquire+0xa99/0x2c40
[ 213.645675] ? lock_acquire+0xcd/0x2d0
[ 213.646101] ? ib_uverbs_ioctl+0xc4/0x170 [ib_uverbs]
[ 213.646625] ? reacquire_held_locks+0xcf/0x1f0
[ 213.647102] ? do_user_addr_fault+0x45d/0x770
[ 213.647586] ib_uverbs_ioctl+0xe0/0x170 [ib_uverbs]
[ 213.648102] ? ib_uverbs_ioctl+0xc4/0x170 [ib_uverbs]
[ 213.648632] __x64_sys_ioctl+0x4d3/0xaa0
[ 213.649060] ? do_user_addr_fault+0x4a8/0x770
[ 213.649528] do_syscall_64+0x6d/0x140
[ 213.649947] entry_SYSCALL_64_after_hwframe+0x4b/0x53
[ 213.650478] RIP: 0033:0x7fa179b0737b
[ 213.650893] Code: ff ff ff 85 c0 79 9b 49 c7 c4 ff ff ff ff 5b 5d 4c
89 e0 41 5c c3 66 0f 1f 84 00 00 00 00 00 f3 0f 1e fa b8
10 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d
7d 2a 0f 00 f7 d8 64 89 01 48
[ 213.652619] RSP: 002b:00007ffd2e6d46e8 EFLAGS: 00000246 ORIG_RAX:
0000000000000010
[ 213.653390] RAX: ffffffffffffffda RBX: 00007ffd2e6d47f8 RCX:
00007fa179b0737b
[ 213.654084] RDX: 00007ffd2e6d47e0 RSI: 00000000c0181b01 RDI:
0000000000000003
[ 213.654767] RBP: 00007ffd2e6d47c0 R08: 00007fa1799be010 R09:
0000000000000002
[ 213.655453] R10: 00007ffd2e6d4960 R11: 0000000000000246 R12:
00007ffd2e6d487c
[ 213.656170] R13: 0000000000000027 R14: 0000000000000001 R15:
00007ffd2e6d4f70
Fixes: d98995b4bf98 ("net/mlx5: Reimplement write combining test")
Signed-off-by: Yishai Hadas <yishaih@nvidia.com>
Reviewed-by: Michael Guralnik <michaelgur@nvidia.com>
Reviewed-by: Larysa Zaremba <larysa.zaremba@intel.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
|
|
User added steering rules at RDMA_TX were being added to the first prio,
which is the counters prio.
Fix that so that they are correctly added to the BYPASS_PRIO instead.
Fixes: 24670b1a3166 ("net/mlx5: Add support for RDMA TX steering")
Signed-off-by: Patrisious Haddad <phaddad@nvidia.com>
Reviewed-by: Mark Bloch <mbloch@nvidia.com>
Reviewed-by: Jacob Keller <jacob.e.keller@intel.com>
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
|
|
syz reports an out of bounds read:
==================================================================
BUG: KASAN: slab-out-of-bounds in ocfs2_match fs/ocfs2/dir.c:334
[inline]
BUG: KASAN: slab-out-of-bounds in ocfs2_search_dirblock+0x283/0x6e0
fs/ocfs2/dir.c:367
Read of size 1 at addr ffff88804d8b9982 by task syz-executor.2/14802
CPU: 0 UID: 0 PID: 14802 Comm: syz-executor.2 Not tainted 6.13.0-rc4 #2
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.15.0-1
04/01/2014
Sched_ext: serialise (enabled+all), task: runnable_at=-10ms
Call Trace:
<TASK>
__dump_stack lib/dump_stack.c:94 [inline]
dump_stack_lvl+0x229/0x350 lib/dump_stack.c:120
print_address_description mm/kasan/report.c:378 [inline]
print_report+0x164/0x530 mm/kasan/report.c:489
kasan_report+0x147/0x180 mm/kasan/report.c:602
ocfs2_match fs/ocfs2/dir.c:334 [inline]
ocfs2_search_dirblock+0x283/0x6e0 fs/ocfs2/dir.c:367
ocfs2_find_entry_id fs/ocfs2/dir.c:414 [inline]
ocfs2_find_entry+0x1143/0x2db0 fs/ocfs2/dir.c:1078
ocfs2_find_files_on_disk+0x18e/0x530 fs/ocfs2/dir.c:1981
ocfs2_lookup_ino_from_name+0xb6/0x110 fs/ocfs2/dir.c:2003
ocfs2_lookup+0x30a/0xd40 fs/ocfs2/namei.c:122
lookup_open fs/namei.c:3627 [inline]
open_last_lookups fs/namei.c:3748 [inline]
path_openat+0x145a/0x3870 fs/namei.c:3984
do_filp_open+0xe9/0x1c0 fs/namei.c:4014
do_sys_openat2+0x135/0x1d0 fs/open.c:1402
do_sys_open fs/open.c:1417 [inline]
__do_sys_openat fs/open.c:1433 [inline]
__se_sys_openat fs/open.c:1428 [inline]
__x64_sys_openat+0x15d/0x1c0 fs/open.c:1428
do_syscall_x64 arch/x86/entry/common.c:52 [inline]
do_syscall_64+0xf6/0x210 arch/x86/entry/common.c:83
entry_SYSCALL_64_after_hwframe+0x77/0x7f
RIP: 0033:0x7f01076903ad
Code: c3 e8 a7 2b 00 00 0f 1f 80 00 00 00 00 f3 0f 1e fa 48 89 f8 48 89
f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01
f0 ff ff 73 01 c3 48 c7 c1 b0 ff ff ff f7 d8 64 89 01 48
RSP: 002b:00007f01084acfc8 EFLAGS: 00000246 ORIG_RAX: 0000000000000101
RAX: ffffffffffffffda RBX: 00007f01077cbf80 RCX: 00007f01076903ad
RDX: 0000000000105042 RSI: 0000000020000080 RDI: ffffffffffffff9c
RBP: 00007f01077cbf80 R08: 0000000000000000 R09: 0000000000000000
R10: 00000000000001ff R11: 0000000000000246 R12: 0000000000000000
R13: 00007f01077cbf80 R14: 00007f010764fc90 R15: 00007f010848d000
</TASK>
==================================================================
And a general protection fault in ocfs2_prepare_dir_for_insert:
==================================================================
loop0: detected capacity change from 0 to 32768
JBD2: Ignoring recovery information on journal
ocfs2: Mounting device (7,0) on (node local, slot 0) with ordered data
mode.
Oops: general protection fault, probably for non-canonical address
0xdffffc0000000001: 0000 [#1] PREEMPT SMP KASAN NOPTI
KASAN: null-ptr-deref in range [0x0000000000000008-0x000000000000000f]
CPU: 0 UID: 0 PID: 5096 Comm: syz-executor792 Not tainted
6.11.0-rc4-syzkaller-00002-gb0da640826ba #0
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS
1.16.3-debian-1.16.3-2~bpo12+1 04/01/2014
RIP: 0010:ocfs2_find_dir_space_id fs/ocfs2/dir.c:3406 [inline]
RIP: 0010:ocfs2_prepare_dir_for_insert+0x3309/0x5c70 fs/ocfs2/dir.c:4280
Code: 00 00 e8 2a 25 13 fe e9 ba 06 00 00 e8 20 25 13 fe e9 4f 01 00 00
e8 16 25 13 fe 49 8d 7f 08 49 8d 5f 09 48 89 f8 48 c1 e8 03 <42> 0f b6
04 20 84 c0 0f 85 bd 23 00 00 48 89 d8 48 c1 e8 03 42 0f
RSP: 0018:ffffc9000af9f020 EFLAGS: 00010202
RAX: 0000000000000001 RBX: 0000000000000009 RCX: ffff88801e27a440
RDX: 0000000000000000 RSI: 0000000000000400 RDI: 0000000000000008
RBP: ffffc9000af9f830 R08: ffffffff8380395b R09: ffffffff838090a7
R10: 0000000000000002 R11: ffff88801e27a440 R12: dffffc0000000000
R13: ffff88803c660878 R14: f700000000000088 R15: 0000000000000000
FS: 000055555a677380(0000) GS:ffff888020800000(0000)
knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000560bce569178 CR3: 000000001de5a000 CR4: 0000000000350ef0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
<TASK>
ocfs2_mknod+0xcaf/0x2b40 fs/ocfs2/namei.c:292
vfs_mknod+0x36d/0x3b0 fs/namei.c:4088
do_mknodat+0x3ec/0x5b0
__do_sys_mknodat fs/namei.c:4166 [inline]
__se_sys_mknodat fs/namei.c:4163 [inline]
__x64_sys_mknodat+0xa7/0xc0 fs/namei.c:4163
do_syscall_x64 arch/x86/entry/common.c:52 [inline]
do_syscall_64+0xf3/0x230 arch/x86/entry/common.c:83
entry_SYSCALL_64_after_hwframe+0x77/0x7f
RIP: 0033:0x7f2dafda3a99
Code: 28 00 00 00 75 05 48 83 c4 28 c3 e8 f1 17 00 00 90 48 89
f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08
0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 b8 ff ff ff f7 d8
64 89 01 48
RSP: 002b:00007ffe336a6658 EFLAGS: 00000246 ORIG_RAX:
0000000000000103
RAX: ffffffffffffffda RBX: 0000000000000000 RCX:
00007f2dafda3a99
RDX: 00000000000021c0 RSI: 0000000020000040 RDI:
00000000ffffff9c
RBP: 00007f2dafe1b5f0 R08: 0000000000004480 R09:
000055555a6784c0
R10: 0000000000000103 R11: 0000000000000246 R12:
00007ffe336a6680
R13: 00007ffe336a68a8 R14: 431bde82d7b634db R15:
00007f2dafdec03b
</TASK>
==================================================================
The two reports are all caused invalid negative i_size of dir inode. For
ocfs2, dir_inode can't be negative or zero.
Here add a check in which is called by ocfs2_check_dir_for_entry(). It
fixes the second report as ocfs2_check_dir_for_entry() must be called
before ocfs2_prepare_dir_for_insert(). Also set a up limit for dir with
OCFS2_INLINE_DATA_FL. The i_size can't be great than blocksize.
Link: https://lkml.kernel.org/r/20250106140640.92260-1-glass.su@suse.com
Reported-by: Jiacheng Xu <stitch@zju.edu.cn>
Link: https://lore.kernel.org/ocfs2-devel/17a04f01.1ae74.19436d003fc.Coremail.stitch@zju.edu.cn/T/#u
Reported-by: syzbot+5a64828fcc4c2ad9b04f@syzkaller.appspotmail.com
Link: https://lore.kernel.org/all/0000000000005894f3062018caf1@google.com/T/
Signed-off-by: Su Yue <glass.su@suse.com>
Reviewed-by: Heming Zhao <heming.zhao@suse.com>
Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Cc: Mark Fasheh <mark@fasheh.com>
Cc: Joel Becker <jlbec@evilplan.org>
Cc: Junxiao Bi <junxiao.bi@oracle.com>
Cc: Changwei Ge <gechangwei@live.cn>
Cc: Jun Piao <piaojun@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Map old gmail + name to my current full name and email.
Link: https://lkml.kernel.org/r/xbfkmvmp4wyxrvlan57bjnul5icrwfyt67vnhhw2cyr5rzbnee@mfvihhd6s7l5
Signed-off-by: Ethan Carter Edwards <ethan@ethancedwards.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
In zswap_cpu_comp_prepare(), allocations are made and assigned to various
members of acomp_ctx under acomp_ctx->mutex. However, allocations may
recurse into zswap through reclaim, trying to acquire the same mutex and
deadlocking.
Move the allocations before the mutex critical section. Only the
initialization of acomp_ctx needs to be done with the mutex held.
Link: https://lkml.kernel.org/r/20250113214458.2123410-1-yosryahmed@google.com
Fixes: 12dcb0ef5406 ("mm: zswap: properly synchronize freeing resources during CPU hotunplug")
Signed-off-by: Yosry Ahmed <yosryahmed@google.com>
Reviewed-by: Chengming Zhou <chengming.zhou@linux.dev>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Nhat Pham <nphamcs@gmail.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
syzkaller reported such a BUG_ON():
------------[ cut here ]------------
kernel BUG at mm/khugepaged.c:1835!
Internal error: Oops - BUG: 00000000f2000800 [#1] SMP
...
CPU: 6 UID: 0 PID: 8009 Comm: syz.15.106 Kdump: loaded Tainted: G W 6.13.0-rc6 #22
Tainted: [W]=WARN
Hardware name: QEMU KVM Virtual Machine, BIOS 0.0.0 02/06/2015
pstate: 00400005 (nzcv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
pc : collapse_file+0xa44/0x1400
lr : collapse_file+0x88/0x1400
sp : ffff80008afe3a60
...
Call trace:
collapse_file+0xa44/0x1400 (P)
hpage_collapse_scan_file+0x278/0x400
madvise_collapse+0x1bc/0x678
madvise_vma_behavior+0x32c/0x448
madvise_walk_vmas.constprop.0+0xbc/0x140
do_madvise.part.0+0xdc/0x2c8
__arm64_sys_madvise+0x68/0x88
invoke_syscall+0x50/0x120
el0_svc_common.constprop.0+0xc8/0xf0
do_el0_svc+0x24/0x38
el0_svc+0x34/0x128
el0t_64_sync_handler+0xc8/0xd0
el0t_64_sync+0x190/0x198
This indicates that the pgoff is unaligned. After analysis, I confirm the
vma is mapped to /dev/zero. Such a vma certainly has vm_file, but it is
set to anonymous by mmap_zero(). So even if it's mmapped by 2m-unaligned,
it can pass the check in thp_vma_allowable_order() as it is an
anonymous-mmap, but then be collapsed as a file-mmap.
It seems the problem has existed for a long time, but actually, since we
have khugepaged_max_ptes_none check before, we will skip collapse it as it
is /dev/zero and so has no present page. But commit d8ea7cc8547c limit
the check for only khugepaged, so the BUG_ON() can be triggered by
madvise_collapse().
Add vma_is_anonymous() check to make such vma be processed by
hpage_collapse_scan_pmd().
Link: https://lkml.kernel.org/r/20250111034511.2223353-1-liushixin2@huawei.com
Fixes: d8ea7cc8547c ("mm/khugepaged: add flag to predicate khugepaged-only behavior")
Signed-off-by: Liu Shixin <liushixin2@huawei.com>
Reviewed-by: Yang Shi <yang@os.amperecomputing.com>
Acked-by: David Hildenbrand <david@redhat.com>
Cc: Chengming Zhou <chengming.zhou@linux.dev>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
Cc: Mattew Wilcox <willy@infradead.org>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: Nanyong Sun <sunnanyong@huawei.com>
Cc: Qi Zheng <zhengqi.arch@bytedance.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
Fixes an issue where the use of an unsigned data type in
`shmem_parse_opt_casefold()` caused incorrect evaluation of negative
conditions.
Link: https://lkml.kernel.org/r/20250111-unsignedcompare1601569-v3-1-c861b4221831@gmail.com
Fixes: 58e55efd6c72 ("tmpfs: Add casefold lookup support")
Reviewed-by: André Almeida <andrealmeid@igalia.com>
Reviewed-by: Gabriel Krisman Bertazi <gabriel@krisman.be>
Signed-off-by: Karan Sanghavi <karansanghvi98@gmail.com>
Cc: Christian Brauner <brauner@kernel.org>
Cc: Hugh Dickens <hughd@google.com>
Cc: Shuah khan <skhan@linuxfoundation.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
When memory allocation profiling is disabled, there is no need to swap
allocation tags during migration. Skip it to avoid unnecessary overhead.
Once I added these checks, the overhead of the mode when memory profiling
is enabled but turned off went down by about 50%.
Link: https://lkml.kernel.org/r/20241226211639.1357704-2-surenb@google.com
Fixes: e0a955bf7f61 ("mm/codetag: add pgalloc_tag_copy()")
Signed-off-by: Suren Baghdasaryan <surenb@google.com>
Cc: David Wang <00107082@163.com>
Cc: Kent Overstreet <kent.overstreet@linux.dev>
Cc: Yu Zhao <yuzhao@google.com>
Cc: Zhenhua Huang <quic_zhenhuah@quicinc.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
adjust_managed_page_count
In the kernel, the zone's lowmem_reserve and _watermark, and the global
variable 'totalreserve_pages' depend on the value of managed_pages, but
after running adjust_managed_page_count, these values aren't updated,
which causes some problems.
For example, in a system with six 1GB large pages, we found that the value
of protection in zoneinfo (zone->lowmem_reserve), is not right. Its value
seems to be calculated from the initial managed_pages, but after the
managed_pages changed, was not updated. Only after reading the file
/proc/sys/vm/lowmem_reserve_ratio, updates happen.
read file /proc/sys/vm/lowmem_reserve_ratio:
lowmem_reserve_ratio_sysctl_handler
----setup_per_zone_lowmem_reserve
--------calculate_totalreserve_pages
protection changed after reading file:
[root@test ~]# cat /proc/zoneinfo | grep protection
protection: (0, 2719, 57360, 0)
protection: (0, 0, 54640, 0)
protection: (0, 0, 0, 0)
protection: (0, 0, 0, 0)
[root@test ~]# cat /proc/sys/vm/lowmem_reserve_ratio
256 256 32 0
[root@test ~]# cat /proc/zoneinfo | grep protection
protection: (0, 2735, 63524, 0)
protection: (0, 0, 60788, 0)
protection: (0, 0, 0, 0)
protection: (0, 0, 0, 0)
lowmem_reserve increased also makes the totalreserve_pages increased,
which causes a decrease in available memory. The one above is just a test
machine, and the increase is not significant. On our online machine, the
reserved memory will increase by several GB due to reading this file. It
is clearly unreasonable to cause a sharp drop in available memory just by
reading a file.
In this patch, we update reserve memory when update managed_pages, The
size of reserved memory becomes stable. But it seems that the _watermark
should also be updated along with the managed_pages. We have not done it
because we are unsure if it is reasonable to set the watermark through the
initial managed_pages. If it is not reasonable, we will propose new
patch.
Link: https://lkml.kernel.org/r/20241225021034.45693-1-15645113830zzh@gmail.com
Signed-off-by: zihan zhou <15645113830zzh@gmail.com>
Signed-off-by: yaowenchao <yaowenchao@jd.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
page_pool_ref_netmem() should work with either netmem representation, but
currently it casts to a page with netmem_to_page(), which will fail with
net iovs. Use netmem_get_pp_ref_count_ref() instead.
Fixes: 8ab79ed50cf1 ("page_pool: devmem support")
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: David Wei <dw@davidwei.uk>
Link: https://lore.kernel.org/20250108220644.3528845-2-dw@davidwei.uk
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
When shutting down the server in cifs_put_tcp_session(), cifsd thread
might be reconnecting to multiple DFS targets before it realizes it
should exit the loop, so @server->hostname can't be freed as long as
cifsd thread isn't done. Otherwise the following can happen:
RIP: 0010:__slab_free+0x223/0x3c0
Code: 5e 41 5f c3 cc cc cc cc 4c 89 de 4c 89 cf 44 89 44 24 08 4c 89
1c 24 e8 fb cf 8e 00 44 8b 44 24 08 4c 8b 1c 24 e9 5f fe ff ff <0f>
0b 41 f7 45 08 00 0d 21 00 0f 85 2d ff ff ff e9 1f ff ff ff 80
RSP: 0018:ffffb26180dbfd08 EFLAGS: 00010246
RAX: ffff8ea34728e510 RBX: ffff8ea34728e500 RCX: 0000000000800068
RDX: 0000000000800068 RSI: 0000000000000000 RDI: ffff8ea340042400
RBP: ffffe112041ca380 R08: 0000000000000001 R09: 0000000000000000
R10: 6170732e31303000 R11: 70726f632e786563 R12: ffff8ea34728e500
R13: ffff8ea340042400 R14: ffff8ea34728e500 R15: 0000000000800068
FS: 0000000000000000(0000) GS:ffff8ea66fd80000(0000)
000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007ffc25376080 CR3: 000000012a2ba001 CR4:
PKRU: 55555554
Call Trace:
<TASK>
? show_trace_log_lvl+0x1c4/0x2df
? show_trace_log_lvl+0x1c4/0x2df
? __reconnect_target_unlocked+0x3e/0x160 [cifs]
? __die_body.cold+0x8/0xd
? die+0x2b/0x50
? do_trap+0xce/0x120
? __slab_free+0x223/0x3c0
? do_error_trap+0x65/0x80
? __slab_free+0x223/0x3c0
? exc_invalid_op+0x4e/0x70
? __slab_free+0x223/0x3c0
? asm_exc_invalid_op+0x16/0x20
? __slab_free+0x223/0x3c0
? extract_hostname+0x5c/0xa0 [cifs]
? extract_hostname+0x5c/0xa0 [cifs]
? __kmalloc+0x4b/0x140
__reconnect_target_unlocked+0x3e/0x160 [cifs]
reconnect_dfs_server+0x145/0x430 [cifs]
cifs_handle_standard+0x1ad/0x1d0 [cifs]
cifs_demultiplex_thread+0x592/0x730 [cifs]
? __pfx_cifs_demultiplex_thread+0x10/0x10 [cifs]
kthread+0xdd/0x100
? __pfx_kthread+0x10/0x10
ret_from_fork+0x29/0x50
</TASK>
Fixes: 7be3248f3139 ("cifs: To match file servers, make sure the server hostname matches")
Reported-by: Jay Shin <jaeshin@redhat.com>
Signed-off-by: Paulo Alcantara (Red Hat) <pc@manguebit.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
|
|
Fix use of DIV_ROUND_CLOSEST where a possibly negative value is divided
by an unsigned type by casting the unsigned type to the signed type of
the same size (st->r_sense_uohm[channel] has type of u32).
The docs on the DIV_ROUND_CLOSEST macro explain that dividing a negative
value by an unsigned type is undefined behavior. The actual behavior is
that it converts both values to unsigned before doing the division, for
example:
int ret = DIV_ROUND_CLOSEST(-100, 3U);
results in ret == 1431655732 instead of -33.
Fixes: 2b9ea4262ae9 ("hwmon: Add driver for ltc2991")
Signed-off-by: David Lechner <dlechner@baylibre.com>
Link: https://lore.kernel.org/r/20250115-hwmon-ltc2991-fix-div-round-closest-v1-1-b4929667e457@baylibre.com
Signed-off-by: Guenter Roeck <linux@roeck-us.net>
|
|
In 4.19, before the switch to linkmode bitmaps, PHY_GBIT_FEATURES
included feature bits for aneg and TP/MII ports.
SUPPORTED_TP | \
SUPPORTED_MII)
SUPPORTED_10baseT_Full)
SUPPORTED_100baseT_Full)
SUPPORTED_1000baseT_Full)
PHY_100BT_FEATURES | \
PHY_DEFAULT_FEATURES)
PHY_1000BT_FEATURES)
Referenced commit expanded PHY_GBIT_FEATURES, silently removing
PHY_DEFAULT_FEATURES. The removed part can be re-added by using
the new PHY_GBIT_FEATURES definition.
Not clear to me is why nobody seems to have noticed this issue.
I stumbled across this when checking what it takes to make
phy_10_100_features_array et al private to phylib.
Fixes: d0939c26c53a ("net: ethernet: xgbe: expand PHY_GBIT_FEAUTRES")
Cc: stable@vger.kernel.org
Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
Link: https://patch.msgid.link/46521973-7738-4157-9f5e-0bb6f694acba@gmail.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
xpcs_config_2500basex() sets DW_VR_MII_DIG_CTRL1_2G5_EN, but
xpcs_config_aneg_c37_sgmii() never unsets it. So, on a protocol change
from 2500base-x to sgmii, the DW_VR_MII_DIG_CTRL1_2G5_EN bit will remain
set.
Fixes: f27abde3042a ("net: pcs: add 2500BASEX support for Intel mGbE controller")
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Maxime Chevallier <maxime.chevallier@bootlin.com>
Reviewed-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Link: https://patch.msgid.link/20250114164721.2879380-2-vladimir.oltean@nxp.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
w/o inband
On a port with SGMII fixed-link at SPEED_1000, DW_VR_MII_DIG_CTRL1 gets
set to 0x2404. This is incorrect, because bit 2 (DW_VR_MII_DIG_CTRL1_2G5_EN)
is set.
It comes from the previous write to DW_VR_MII_AN_CTRL, because the "val"
variable is reused and is dirty. Actually, its value is 0x4, aka
FIELD_PREP(DW_VR_MII_PCS_MODE_MASK, DW_VR_MII_PCS_MODE_C37_SGMII).
Resolve the issue by clearing "val" to 0 when writing to a new register.
After the fix, the register value is 0x2400.
Prior to the blamed commit, when the read-modify-write was open-coded,
the code saved the content of the DW_VR_MII_DIG_CTRL1 register in the
"ret" variable.
Fixes: ce8d6081fcf4 ("net: pcs: xpcs: add _modify() accessors")
Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com>
Reviewed-by: Maxime Chevallier <maxime.chevallier@bootlin.com>
Reviewed-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk>
Link: https://patch.msgid.link/20250114164721.2879380-1-vladimir.oltean@nxp.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
This backend requests a NACK from the controller driver when it detects
an error. If that request gets ignored from some reason, subsequent
accesses will wrongly be handled OK. To fix this, an error now changes
the state machine, so the backend will report NACK until a STOP
condition has been detected. This make the driver more robust against
controllers which will sadly apply the NACK not to the current byte but
the next one.
Fixes: a8335c64c5f0 ("i2c: add slave testunit driver")
Signed-off-by: Wolfram Sang <wsa+renesas@sang-engineering.com>
|
|
When this controller is a target, the NACK handling had two issues.
First, the return value from the backend was not checked on the initial
WRITE_REQUESTED. So, the driver missed to send a NACK in this case.
Also, the NACK always arrives one byte late on the bus, even in the
WRITE_RECEIVED case. This seems to be a HW issue. We should then not
rely on the backend to correctly NACK the superfluous byte as well. Fix
both issues by introducing a flag which gets set whenever the backend
requests a NACK and keep sending it until we get a STOP condition.
Fixes: de20d1857dd6 ("i2c: rcar: add slave support")
Signed-off-by: Wolfram Sang <wsa+renesas@sang-engineering.com>
|
|
Two characters flipped, fix them.
Signed-off-by: Wolfram Sang <wsa+renesas@sang-engineering.com>
|
|
When misconfigured, the initial setup of the current mux channel can
fail, too. It must be checked as well.
Fixes: 50a5ba876908 ("i2c: mux: demux-pinctrl: add driver")
Signed-off-by: Wolfram Sang <wsa+renesas@sang-engineering.com>
|
|
This reverts commit 98d1fb94ce75f39febd456d6d3cbbe58b6678795.
The commit uses data nbits instead of addr nbits for dummy phase. This
causes a regression for all boards where spi-tx-bus-width is smaller
than spi-rx-bus-width. It is a common pattern for boards to have
spi-tx-bus-width == 1 and spi-rx-bus-width > 1. The regression causes
all reads with a dummy phase to become unavailable for such boards,
leading to a usually slower 0-dummy-cycle read being selected.
Most controllers' supports_op hooks call spi_mem_default_supports_op().
In spi_mem_default_supports_op(), spi_mem_check_buswidth() is called to
check if the buswidths for the op can actually be supported by the
board's wiring. This wiring information comes from (among other things)
the spi-{tx,rx}-bus-width DT properties. Based on these properties,
SPI_TX_* or SPI_RX_* flags are set by of_spi_parse_dt().
spi_mem_check_buswidth() then uses these flags to make the decision
whether an op can be supported by the board's wiring (in a way,
indirectly checking against spi-{rx,tx}-bus-width).
Now the tricky bit here is that spi_mem_check_buswidth() does:
if (op->dummy.nbytes &&
spi_check_buswidth_req(mem, op->dummy.buswidth, true))
return false;
The true argument to spi_check_buswidth_req() means the op is treated as
a TX op. For a board that has say 1-bit TX and 4-bit RX, a 4-bit dummy
TX is considered as unsupported, and the op gets rejected.
The commit being reverted uses the data buswidth for dummy buswidth. So
for reads, the RX buswidth gets used for the dummy phase, uncovering
this issue. In reality, a dummy phase is neither RX nor TX. As the name
suggests, these are just dummy cycles that send or receive no data, and
thus don't really need to have any buswidth at all.
Ideally, dummy phases should not be checked against the board's wiring
capabilities at all, and should only be sanity-checked for having a sane
buswidth value. Since we are now at rc7 and such a change might
introduce many unexpected bugs, revert the commit for now. It can be
sent out later along with the spi_mem_check_buswidth() fix.
Fixes: 98d1fb94ce75 ("mtd: spi-nor: core: replace dummy buswidth from addr to data")
Reported-by: Alexander Stein <alexander.stein@ew.tq-group.com>
Closes: https://lore.kernel.org/linux-mtd/3342163.44csPzL39Z@steina-w/
Tested-by: Alexander Stein <alexander.stein@ew.tq-group.com>
Reviewed-by: Tudor Ambarus <tudor.ambarus@linaro.org>
Signed-off-by: Pratyush Yadav <pratyush@kernel.org>
Signed-off-by: Miquel Raynal <miquel.raynal@bootlin.com>
|
|
syzbot triggered the warning in posixtimer_send_sigqueue(), which warns
about a non-ignored signal being already queued on the ignored list.
The warning is actually bogus, as the following sequence causes this:
signal($SIG, SIGIGN);
timer_settime(...); // arm periodic timer
timer fires, signal is ignored and queued on ignored list
sigprocmask(SIG_BLOCK, ...); // block the signal
timer_settime(...); // re-arm periodic timer
timer fires, signal is not ignored because it is blocked
---> Warning triggers as signal is on the ignored list
Ideally timer_settime() could remove the signal, but that's racy and
incomplete vs. other scenarios and requires a full reevaluation of the
pending signal list.
Instead of adding more complexity, handle it gracefully by removing the
warning and requeueing the signal to the pending list. That's correct
versus:
1) sig[timed]wait() as that does not check for SIGIGN and only relies on
dequeue_signal() -> posixtimers_deliver_signal() to check whether the
pending signal is still valid.
2) Unblocking of the signal.
- If the unblocking happens before SIGIGN is replaced by a signal
handler, then the timer is rearmed in dequeue_signal(), but
get_signal() will ignore it. The next timer expiry will move it back
to the ignored list.
- If SIGIGN was replaced before unblocking, then the signal will be
delivered and a subsequent expiry will queue a signal on the pending
list again.
There is a related scenario to trigger the complementary warning in the
signal ignored path, which does not expect the signal to be on the pending
list when it is ignored. That can be triggered even before the above change
via:
task1 task2
signal($SIG, SIGIGN);
sigprocmask(SIG_BLOCK, ...);
timer_create(); // Signal target is task2
timer_settime(...); // arm periodic timer
timer fires, signal is not ignored because it is blocked
and queued on the pending list of task2
syscall()
// Sets the pending flag
sigprocmask(SIG_UNBLOCK, ...);
-> preemption, task2 cannot dequeue the signal
timer_settime(...); // re-arm periodic timer
timer fires, signal is ignored
---> Warning triggers as signal is on task2's pending list
and the thread group is not exiting
Consequently, remove that warning too and just keep the signal on the
pending list.
The following attempt to deliver the signal on return to user space of
task2 will ignore the signal and a subsequent expiry will bring it back to
the ignored list, if it did not get blocked or un-ignored before that.
Fixes: df7a996b4dab ("signal: Queue ignored posixtimers on ignore list")
Reported-by: syzbot+3c2e3cc60665d71de2f7@syzkaller.appspotmail.com
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Frederic Weisbecker <frederic@kernel.org>
Link: https://lore.kernel.org/all/87ikqhcnjn.ffs@tglx
|
|
The SQ and CQ ring heads are read twice - once for verifying that it's
within bounds, and once inside the loops copying SQE and CQE entries.
This is technically incorrect, in case the values could get modified
in between verifying them and using them in the copy loop. While this
won't lead to anything truly nefarious, it may cause longer loop times
for the copies than expected.
Read the ring head values once, and use the verified value in the copy
loops.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
It can be a bit hard to tell which parts of io_register_resize_rings()
are operating on shared memory, and which ones are not. And anything
reading or writing to those regions should really use the read/write
once primitives.
Hence add those, ensuring sanity in how this memory is accessed, and
helping document the shared nature of it.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
Normally the kernel would not expect an application to modify any of
the data shared with the kernel during a resize operation, but of
course the kernel cannot always assume good intent on behalf of the
application.
As part of resizing the rings, existing SQEs and CQEs are copied over
to the new storage. Resizing uses the masks in the newly allocated
shared storage to index the arrays, however it's possible that malicious
userspace could modify these after they have been sanity checked.
Use the validated and locally stored CQ and SQ ring sizing for masking
to ensure the values are both stable and valid.
Fixes: 79cfe9e59c2a ("io_uring/register: add IORING_REGISTER_RESIZE_RINGS")
Reported-by: Jann Horn <jannh@google.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|
There's at least one drive (MaxDigitalData OOS14000G) such that if it
receives a large amount of I/O while entering an idle power state will
first exit idle before responding, including causing SMART temperature
requests to be delayed.
This causes the drivetemp request to exceed its timeout of 1 second.
Signed-off-by: Russell Harmon <russ@har.mn>
Link: https://lore.kernel.org/r/20250115131340.3178988-1-russ@har.mn
Signed-off-by: Guenter Roeck <linux@roeck-us.net>
|
|
read_domain_devices().
After commit fabb1f813ec0 ("hwmon: (acpi_power_meter) Fix fail to load
module on platform without _PMD method"),
the acpi_power_meter driver fails to load if the platform has _PMD method.
To address this, add a check for successful read_domain_devices().
Tested on Nvidia Grace machine.
Fixes: fabb1f813ec0 ("hwmon: (acpi_power_meter) Fix fail to load module on platform without _PMD method")
Signed-off-by: Kazuhiro Abe <fj1078ii@aa.jp.fujitsu.com>
Link: https://lore.kernel.org/r/20250115073532.3211000-1-fj1078ii@aa.jp.fujitsu.com
[groeck: Dropped unnecessary () from expression]
Signed-off-by: Guenter Roeck <linux@roeck-us.net>
|
|
Add missing VISACTL mux registers required for some OA
config's (e.g. RenderPipeCtrl).
Fixes: cdf02fe1a94a ("drm/xe/oa/uapi: Add/remove OA config perf ops")
Cc: stable@vger.kernel.org
Signed-off-by: Ashutosh Dixit <ashutosh.dixit@intel.com>
Reviewed-by: Umesh Nerlige Ramappa <umesh.nerlige.ramappa@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20250111021539.2920346-1-ashutosh.dixit@intel.com
(cherry picked from commit c26f22dac3449d8a687237cdfc59a6445eb8f75a)
Signed-off-by: Thomas Hellström <thomas.hellstrom@linux.intel.com>
|
|
If ccs_mode is being modified via
/sys/class/drm/cardX/device/tileY/gtY/ccs_mode
the asynchronous reset is triggered and the write returns immediately.
With that some test receive false information about number of CCS engines
or even fail if they proceed without delay after changing the ccs_mode.
Changing the ccs_mode change from async to sync to prevent failures in
tests.
Signed-off-by: Maciej Patelczyk <maciej.patelczyk@intel.com>
Reviewed-by: Lucas De Marchi <lucas.demarchi@intel.com>
Fixes: f3bc5bb4d53d ("drm/xe: Allow userspace to configure CCS mode")
Reviewed-by: Nirmoy Das <nirmoy.das@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20241211111727.1481476-3-maciej.patelczyk@intel.com
Signed-off-by: Nirmoy Das <nirmoy.das@intel.com>
(cherry picked from commit 480fb9806e2e073532f7786166287114c696b340)
Signed-off-by: Thomas Hellström <thomas.hellstrom@linux.intel.com>
|
|
Add synchronous version gt reset as there are few places where it
is expected.
Also add a wait helper to wait until gt reset is done.
Signed-off-by: Maciej Patelczyk <maciej.patelczyk@intel.com>
Reviewed-by: Lucas De Marchi <lucas.demarchi@intel.com>
Fixes: f3bc5bb4d53d ("drm/xe: Allow userspace to configure CCS mode")
Reviewed-by: Nirmoy Das <nirmoy.das@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20241211111727.1481476-2-maciej.patelczyk@intel.com
Signed-off-by: Nirmoy Das <nirmoy.das@intel.com>
(cherry picked from commit 155c77f45f63dd58a37eeb0896b0b140ab785836)
Signed-off-by: Thomas Hellström <thomas.hellstrom@linux.intel.com>
|
|
The guc_mmio_reg interface supports steering, but it is currently not
implemented. This will allow the GuC to control steering of MMIO
registers after save-restore and avoid reading from fused off MCR
register instances.
Fixes: 9c57bc08652a ("drm/xe/lnl: Drop force_probe requirement")
Signed-off-by: Jesus Narvaez <jesus.narvaez@intel.com>
Cc: Matt Roper <matthew.d.roper@intel.com>
Cc: Lucas De Marchi <lucas.demarchi@intel.com>
Cc: Daniele Ceraolo Spurio <daniele.ceraolospurio@intel.com>
Reviewed-by: Jonathan Cavitt <jonathan.cavitt@intel.com>
Signed-off-by: Daniele Ceraolo Spurio <daniele.ceraolospurio@intel.com>
Link: https://patchwork.freedesktop.org/patch/msgid/20241212190100.3768068-1-jesus.narvaez@intel.com
(cherry picked from commit ee5a1321df90891d59d83b7c9d5b6c5b755d059d)
Signed-off-by: Thomas Hellström <thomas.hellstrom@linux.intel.com>
|
|
platform_irqchip_probe() leaks a OF node when irq_init_cb() fails. Fix it
by declaring par_np with the __free(device_node) cleanup construct.
This bug was found by an experimental static analysis tool that I am
developing.
Fixes: f8410e626569 ("irqchip: Add IRQCHIP_PLATFORM_DRIVER_BEGIN/END and IRQCHIP_MATCH helper macros")
Signed-off-by: Joe Hattori <joe@pf.is.s.u-tokyo.ac.jp>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/all/20241215033945.3414223-1-joe@pf.is.s.u-tokyo.ac.jp
|
|
Because of patch[1] the graft behaviour changed
So the command:
tcq replace parent 100:1 handle 204:
Is no longer valid and will not delete 100:4 added by command:
tcq replace parent 100:4 handle 204: pfifo_fast
So to maintain the original behaviour, this patch manually deletes 100:4
and grafts 100:1
Note: This change will also work fine without [1]
[1] https://lore.kernel.org/netdev/20250111151455.75480-1-jhs@mojatatu.com/T/#u
Signed-off-by: Victor Nogueira <victor@mojatatu.com>
Reviewed-by: Jamal Hadi Salim <jhs@mojatatu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
Some boards with Allwinner SoCs connect the PMIC's IRQ pin to the SoC's NMI
pin instead of a normal GPIO. Since the power key is connected to the PMIC,
and people expect to wake up a suspended system via this key, the NMI IRQ
controller must stay alive when the system goes into suspend.
Add the SKIP_WAKE flag to prevent the sunxi NMI controller from going to
sleep, so that the power key can wake up those systems.
[ tglx: Fixed up coding style ]
Signed-off-by: Philippe Simons <simons.philippe@gmail.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Link: https://lore.kernel.org/all/20250112123402.388520-1-simons.philippe@gmail.com
|
|
The following call-chain leads to enabling interrupts in a nested interrupt
disabled section:
irq_set_vcpu_affinity()
irq_get_desc_lock()
raw_spin_lock_irqsave() <--- Disable interrupts
its_irq_set_vcpu_affinity()
guard(raw_spinlock_irq) <--- Enables interrupts when leaving the guard()
irq_put_desc_unlock() <--- Warns because interrupts are enabled
This was broken in commit b97e8a2f7130, which replaced the original
raw_spin_[un]lock() pair with guard(raw_spinlock_irq).
Fix the issue by using guard(raw_spinlock).
[ tglx: Massaged change log ]
Fixes: b97e8a2f7130 ("irqchip/gic-v3-its: Fix potential race condition in its_vlpi_prop_update()")
Signed-off-by: Tomas Krcka <krckatom@amazon.de>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Reviewed-by: Marc Zyngier <maz@kernel.org>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/all/20241230150825.62894-1-krckatom@amazon.de
|
|
When a CPU attempts to enter low power mode, it disables the redistributor
and Group 1 interrupts and reinitializes the system registers upon wakeup.
If the transition into low power mode fails, then the CPU_PM framework
invokes the PM notifier callback with CPU_PM_ENTER_FAILED to allow the
drivers to undo the state changes.
The GIC V3 driver ignores CPU_PM_ENTER_FAILED, which leaves the GIC in
disabled state.
Handle CPU_PM_ENTER_FAILED in the same way as CPU_PM_EXIT to restore normal
operation.
[ tglx: Massage change log, add Fixes tag ]
Fixes: 3708d52fc6bb ("irqchip: gic-v3: Implement CPU PM notifier")
Signed-off-by: Yogesh Lal <quic_ylal@quicinc.com>
Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
Acked-by: Marc Zyngier <maz@kernel.org>
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/all/20241220093907.2747601-1-quic_ylal@quicinc.com
|
|
If devm_i2c_new_dummy_device() fails then we were supposed to return an
error code, but instead the function continues and will crash on the next
line. Add the missing return statement.
Fixes: 049723628716 ("drm/bridge: Add ITE IT6263 LVDS to HDMI converter")
Signed-off-by: Dan Carpenter <dan.carpenter@linaro.org>
Reviewed-by: Biju Das <biju.das.jz@bp.renesas.com>
Reviewed-by: Liu Ying <victor.liu@nxp.com>
Signed-off-by: Liu Ying <victor.liu@nxp.com>
Link: https://patchwork.freedesktop.org/patch/msgid/804a758b-f2e7-4116-b72d-29bc8905beed@stanley.mountain
|