summaryrefslogtreecommitdiffstats
path: root/fs (follow)
Commit message (Collapse)AuthorAgeFilesLines
* bpf: treewide: Align kfunc signatures to prog point-of-viewDaniel Xu2024-06-121-2/+3
| | | | | | | | | | | | | | | | | | | | | | | | Previously, kfunc declarations in bpf_kfuncs.h (and others) used "user facing" types for kfuncs prototypes while the actual kfunc definitions used "kernel facing" types. More specifically: bpf_dynptr vs bpf_dynptr_kern, __sk_buff vs sk_buff, and xdp_md vs xdp_buff. It wasn't an issue before, as the verifier allows aliased types. However, since we are now generating kfunc prototypes in vmlinux.h (in addition to keeping bpf_kfuncs.h around), this conflict creates compilation errors. Fix this conflict by using "user facing" types in kfunc definitions. This results in more casts, but otherwise has no additional runtime cost. Note, similar to 5b268d1ebcdc ("bpf: Have bpf_rdonly_cast() take a const pointer"), we also make kfuncs take const arguments where appropriate in order to make the kfunc more permissive. Signed-off-by: Daniel Xu <dxu@dxuuu.xyz> Link: https://lore.kernel.org/r/b58346a63a0e66bc9b7504da751b526b0b189a67.1718207789.git.dxu@dxuuu.xyz Signed-off-by: Alexei Starovoitov <ast@kernel.org>
* Merge tag 'for-6.10-rc2-tag' of ↵Linus Torvalds2024-06-053-0/+57
|\ | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux Pull btrfs fix from David Sterba: "A fix for fast fsync that needs to handle errors during writes after some COW failure so it does not lead to an inconsistent state" * tag 'for-6.10-rc2-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: btrfs: ensure fast fsync waits for ordered extents after a write failure
| * btrfs: ensure fast fsync waits for ordered extents after a write failureFilipe Manana2024-05-283-0/+57
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | If a write path in COW mode fails, either before submitting a bio for the new extents or an actual IO error happens, we can end up allowing a fast fsync to log file extent items that point to unwritten extents. This is because dropping the extent maps happens when completing ordered extents, at btrfs_finish_one_ordered(), and the completion of an ordered extent is executed in a work queue. This can result in a fast fsync to start logging file extent items based on existing extent maps before the ordered extents complete, therefore resulting in a log that has file extent items that point to unwritten extents, resulting in a corrupt file if a crash happens after and the log tree is replayed the next time the fs is mounted. This can happen for both direct IO writes and buffered writes. For example consider a direct IO write, in COW mode, that fails at btrfs_dio_submit_io() because btrfs_extract_ordered_extent() returned an error: 1) We call btrfs_finish_ordered_extent() with the 'uptodate' parameter set to false, meaning an error happened; 2) That results in marking the ordered extent with the BTRFS_ORDERED_IOERR flag; 3) btrfs_finish_ordered_extent() queues the completion of the ordered extent - so that btrfs_finish_one_ordered() will be executed later in a work queue. That function will drop extent maps in the range when it's executed, since the extent maps point to unwritten locations (signaled by the BTRFS_ORDERED_IOERR flag); 4) After calling btrfs_finish_ordered_extent() we keep going down the write path and unlock the inode; 5) After that a fast fsync starts and locks the inode; 6) Before the work queue executes btrfs_finish_one_ordered(), the fsync task sees the extent maps that point to the unwritten locations and logs file extent items based on them - it does not know they are unwritten, and the fast fsync path does not wait for ordered extents to complete, which is an intentional behaviour in order to reduce latency. For the buffered write case, here's one example: 1) A fast fsync begins, and it starts by flushing delalloc and waiting for the writeback to complete by calling filemap_fdatawait_range(); 2) Flushing the dellaloc created a new extent map X; 3) During the writeback some IO error happened, and at the end io callback (end_bbio_data_write()) we call btrfs_finish_ordered_extent(), which sets the BTRFS_ORDERED_IOERR flag in the ordered extent and queues its completion; 4) After queuing the ordered extent completion, the end io callback clears the writeback flag from all pages (or folios), and from that moment the fast fsync can proceed; 5) The fast fsync proceeds sees extent map X and logs a file extent item based on extent map X, resulting in a log that points to an unwritten data extent - because the ordered extent completion hasn't run yet, it happens only after the logging. To fix this make btrfs_finish_ordered_extent() set the inode flag BTRFS_INODE_NEEDS_FULL_SYNC in case an error happened for a COW write, so that a fast fsync will wait for ordered extent completion. Note that this issues of using extent maps that point to unwritten locations can not happen for reads, because in read paths we start by locking the extent range and wait for any ordered extents in the range to complete before looking for extent maps. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
* | bcachefs: Fix trans->locked assertKent Overstreet2024-06-051-0/+1
| | | | | | | | | | | | | | in bch2_move_data_btree, we might start with the trans unlocked from a previous loop iteration - we need a trans_begin() before iter_init(). Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
* | bcachefs: Rereplicate now moves data off of durability=0 devicesKent Overstreet2024-06-051-1/+14
| | | | | | | | | | | | | | This fixes an issue where setting a device to durability=0 after it's been used makes it impossible to remove. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
* | bcachefs: Fix GFP_KERNEL allocation in break_cycle()Kent Overstreet2024-06-051-0/+1
| | | | | | | | Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
* | Merge tag '6.10-rc1-smb3-client-fixes' of git://git.samba.org/sfrench/cifs-2.6Linus Torvalds2024-06-015-1/+10
|\ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull smb client fixes from Steve French: "Two small smb3 fixes: - Fix socket creation with sfu mount option (spotted by test generic/423) - Minor cleanup: fix missing description in two files" * tag '6.10-rc1-smb3-client-fixes' of git://git.samba.org/sfrench/cifs-2.6: cifs: fix creating sockets when using sfu mount options fs: smb: common: add missing MODULE_DESCRIPTION() macros
| * | cifs: fix creating sockets when using sfu mount optionsSteve French2024-05-313-1/+8
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When running fstest generic/423 with sfu mount option, it was being skipped due to inability to create sockets: generic/423 [not run] cifs does not support mknod/mkfifo which can also be easily reproduced with their af_unix tool: ./src/af_unix /mnt1/socket-two bind: Operation not permitted Fix sfu mount option to allow creating and reporting sockets. Cc: stable@vger.kernel.org Signed-off-by: Steve French <stfrench@microsoft.com>
| * | fs: smb: common: add missing MODULE_DESCRIPTION() macrosJeff Johnson2024-05-272-0/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | Fix the 'make W=1' warnings: WARNING: modpost: missing MODULE_DESCRIPTION() in fs/smb/common/cifs_arc4.o WARNING: modpost: missing MODULE_DESCRIPTION() in fs/smb/common/cifs_md4.o Signed-off-by: Jeff Johnson <quic_jjohnson@quicinc.com> Signed-off-by: Steve French <stfrench@microsoft.com>
* | | Merge tag 'xfs-6.10-fixes-1' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linuxLinus Torvalds2024-06-0111-46/+73
|\ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull xfs fixes from Chandan Babu: - Fix a livelock by dropping an xfarray sortinfo folio when an error is encountered - During extended attribute operations, Initialize transaction reservation computation based on attribute operation code - Relax symbolic link's ondisk verification code to allow symbolic links with short remote targets - Prevent soft lockups when unmapping file ranges and also during remapping blocks during a reflink operation - Fix compilation warnings when XFS is built with W=1 option * tag 'xfs-6.10-fixes-1' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux: xfs: Add cond_resched to block unmap range and reflink remap path xfs: don't open-code u64_to_user_ptr xfs: allow symlinks with short remote targets xfs: fix xfs_init_attr_trans not handling explicit operation codes xfs: drop xfarray sortinfo folio on error xfs: Stop using __maybe_unused in xfs_alloc.c xfs: Clear W=1 warning in xfs_iwalk_run_callbacks()
| * | | xfs: Add cond_resched to block unmap range and reflink remap pathRitesh Harjani (IBM)2024-05-272-0/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | An async dio write to a sparse file can generate a lot of extents and when we unlink this file (using rm), the kernel can be busy in umapping and freeing those extents as part of transaction processing. Similarly xfs reflink remapping path can also iterate over a million extent entries in xfs_reflink_remap_blocks(). Since we can busy loop in these two functions, so let's add cond_resched() to avoid softlockup messages like these. watchdog: BUG: soft lockup - CPU#1 stuck for 22s! [kworker/1:0:82435] CPU: 1 PID: 82435 Comm: kworker/1:0 Tainted: G S L 6.9.0-rc5-0-default #1 Workqueue: xfs-inodegc/sda2 xfs_inodegc_worker NIP [c000000000beea10] xfs_extent_busy_trim+0x100/0x290 LR [c000000000bee958] xfs_extent_busy_trim+0x48/0x290 Call Trace: xfs_alloc_get_rec+0x54/0x1b0 (unreliable) xfs_alloc_compute_aligned+0x5c/0x144 xfs_alloc_ag_vextent_size+0x238/0x8d4 xfs_alloc_fix_freelist+0x540/0x694 xfs_free_extent_fix_freelist+0x84/0xe0 __xfs_free_extent+0x74/0x1ec xfs_extent_free_finish_item+0xcc/0x214 xfs_defer_finish_one+0x194/0x388 xfs_defer_finish_noroll+0x1b4/0x5c8 xfs_defer_finish+0x2c/0xc4 xfs_bunmapi_range+0xa4/0x100 xfs_itruncate_extents_flags+0x1b8/0x2f4 xfs_inactive_truncate+0xe0/0x124 xfs_inactive+0x30c/0x3e0 xfs_inodegc_worker+0x140/0x234 process_scheduled_works+0x240/0x57c worker_thread+0x198/0x468 kthread+0x138/0x140 start_kernel_thread+0x14/0x18 run fstests generic/175 at 2024-02-02 04:40:21 [ C17] watchdog: BUG: soft lockup - CPU#17 stuck for 23s! [xfs_io:7679] watchdog: BUG: soft lockup - CPU#17 stuck for 23s! [xfs_io:7679] CPU: 17 PID: 7679 Comm: xfs_io Kdump: loaded Tainted: G X 6.4.0 NIP [c008000005e3ec94] xfs_rmapbt_diff_two_keys+0x54/0xe0 [xfs] LR [c008000005e08798] xfs_btree_get_leaf_keys+0x110/0x1e0 [xfs] Call Trace: 0xc000000014107c00 (unreliable) __xfs_btree_updkeys+0x8c/0x2c0 [xfs] xfs_btree_update_keys+0x150/0x170 [xfs] xfs_btree_lshift+0x534/0x660 [xfs] xfs_btree_make_block_unfull+0x19c/0x240 [xfs] xfs_btree_insrec+0x4e4/0x630 [xfs] xfs_btree_insert+0x104/0x2d0 [xfs] xfs_rmap_insert+0xc4/0x260 [xfs] xfs_rmap_map_shared+0x228/0x630 [xfs] xfs_rmap_finish_one+0x2d4/0x350 [xfs] xfs_rmap_update_finish_item+0x44/0xc0 [xfs] xfs_defer_finish_noroll+0x2e4/0x740 [xfs] __xfs_trans_commit+0x1f4/0x400 [xfs] xfs_reflink_remap_extent+0x2d8/0x650 [xfs] xfs_reflink_remap_blocks+0x154/0x320 [xfs] xfs_file_remap_range+0x138/0x3a0 [xfs] do_clone_file_range+0x11c/0x2f0 vfs_clone_file_range+0x60/0x1c0 ioctl_file_clone+0x78/0x140 sys_ioctl+0x934/0x1270 system_call_exception+0x158/0x320 system_call_vectored_common+0x15c/0x2ec Cc: Ojaswin Mujoo <ojaswin@linux.ibm.com> Signed-off-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Tested-by: Disha Goel<disgoel@linux.ibm.com> Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
| * | | xfs: don't open-code u64_to_user_ptrDarrick J. Wong2024-05-272-7/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Don't open-code what the kernel already provides. Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
| * | | xfs: allow symlinks with short remote targetsDarrick J. Wong2024-05-271-4/+24
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | An internal user complained about log recovery failing on a symlink ("Bad dinode after recovery") with the following (excerpted) format: core.magic = 0x494e core.mode = 0120777 core.version = 3 core.format = 2 (extents) core.nlinkv2 = 1 core.nextents = 1 core.size = 297 core.nblocks = 1 core.naextents = 0 core.forkoff = 0 core.aformat = 2 (extents) u3.bmx[0] = [startoff,startblock,blockcount,extentflag] 0:[0,12,1,0] This is a symbolic link with a 297-byte target stored in a disk block, which is to say this is a symlink with a remote target. The forkoff is 0, which is to say that there's 512 - 176 == 336 bytes in the inode core to store the data fork. Eventually, testing of generic/388 failed with the same inode corruption message during inode recovery. In writing a debugging patch to call xfs_dinode_verify on dirty inode log items when we're committing transactions, I observed that xfs/298 can reproduce the problem quite quickly. xfs/298 creates a symbolic link, adds some extended attributes, then deletes them all. The test failure occurs when the final removexattr also deletes the attr fork because that does not convert the remote symlink back into a shortform symlink. That is how we trip this test. The only reason why xfs/298 only triggers with the debug patch added is that it deletes the symlink, so the final iflush shows the inode as free. I wrote a quick fstest to emulate the behavior of xfs/298, except that it leaves the symlinks on the filesystem after inducing the "corrupt" state. Kernels going back at least as far as 4.18 have written out symlink inodes in this manner and prior to 1eb70f54c445f they did not object to reading them back in. Because we've been writing out inodes this way for quite some time, the only way to fix this is to relax the check for symbolic links. Directories don't have this problem because di_size is bumped to blocksize during the sf->data conversion. Fixes: 1eb70f54c445f ("xfs: validate inode fork size against fork format") Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
| * | | xfs: fix xfs_init_attr_trans not handling explicit operation codesDarrick J. Wong2024-05-273-25/+35
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When we were converting the attr code to use an explicit operation code instead of keying off of attr->value being null, we forgot to change the code that initializes the transaction reservation. Split the function into two helpers that handle the !remove and remove cases, then fix both callsites to handle this correctly. Fixes: c27411d4c640 ("xfs: make attr removal an explicit operation") Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
| * | | xfs: drop xfarray sortinfo folio on errorDarrick J. Wong2024-05-271-3/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Chandan Babu reports the following livelock in xfs/708: run fstests xfs/708 at 2024-05-04 15:35:29 XFS (loop16): EXPERIMENTAL online scrub feature in use. Use at your own risk! XFS (loop5): Mounting V5 Filesystem e96086f0-a2f9-4424-a1d5-c75d53d823be XFS (loop5): Ending clean mount XFS (loop5): Quotacheck needed: Please wait. XFS (loop5): Quotacheck: Done. XFS (loop5): EXPERIMENTAL online scrub feature in use. Use at your own risk! INFO: task xfs_io:143725 blocked for more than 122 seconds. Not tainted 6.9.0-rc4+ #1 "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. task:xfs_io state:D stack:0 pid:143725 tgid:143725 ppid:117661 flags:0x00004006 Call Trace: <TASK> __schedule+0x69c/0x17a0 schedule+0x74/0x1b0 io_schedule+0xc4/0x140 folio_wait_bit_common+0x254/0x650 shmem_undo_range+0x9d5/0xb40 shmem_evict_inode+0x322/0x8f0 evict+0x24e/0x560 __dentry_kill+0x17d/0x4d0 dput+0x263/0x430 __fput+0x2fc/0xaa0 task_work_run+0x132/0x210 get_signal+0x1a8/0x1910 arch_do_signal_or_restart+0x7b/0x2f0 syscall_exit_to_user_mode+0x1c2/0x200 do_syscall_64+0x72/0x170 entry_SYSCALL_64_after_hwframe+0x76/0x7e The shmem code is trying to drop all the folios attached to a shmem file and gets stuck on a locked folio after a bnobt repair. It looks like the process has a signal pending, so I started looking for places where we lock an xfile folio and then deal with a fatal signal. I found a bug in xfarray_sort_scan via code inspection. This function is called to set up the scanning phase of a quicksort operation, which may involve grabbing a locked xfile folio. If we exit the function with an error code, the caller does not call xfarray_sort_scan_done to put the xfile folio. If _sort_scan returns an error code while si->folio is set, we leak the reference and never unlock the folio. Therefore, change xfarray_sort to call _scan_done on exit. This is safe to call multiple times because it sets si->folio to NULL and ignores a NULL si->folio. Also change _sort_scan to use an intermediate variable so that we never pollute si->folio with an errptr. Fixes: 232ea052775f9 ("xfs: enable sorting of xfile-backed arrays") Reported-by: Chandan Babu R <chandanbabu@kernel.org> Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
| * | | xfs: Stop using __maybe_unused in xfs_alloc.cJohn Garry2024-05-271-4/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | In both xfs_alloc_cur_finish() and xfs_alloc_ag_vextent_exact(), local variable @afg is tagged as __maybe_unused. Otherwise an unused variable warning would be generated for when building with W=1 and CONFIG_XFS_DEBUG unset. In both cases, the variable is unused as it is only referenced in an ASSERT() call, which is compiled out (in this config). It is generally a poor programming style to use __maybe_unused for variables. The ASSERT() call is to verify that agbno of the end of the extent is within bounds for both functions. @afg is used as an intermediate variable to find the AG length. However xfs_verify_agbext() already exists to verify a valid extent range. The arguments for calling xfs_verify_agbext() are already available, so use that instead. An advantage of using xfs_verify_agbext() is that it verifies that both the start and the end of the extent are within the bounds of the AG and catches overflows. Suggested-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: John Garry <john.g.garry@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
| * | | xfs: Clear W=1 warning in xfs_iwalk_run_callbacks()John Garry2024-05-271-3/+2
| |/ / | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | For CONFIG_XFS_DEBUG unset, xfs_iwalk_run_callbacks() generates the following warning for when building with W=1: fs/xfs/xfs_iwalk.c: In function ‘xfs_iwalk_run_callbacks’: fs/xfs/xfs_iwalk.c:354:42: error: variable ‘irec’ set but not used [-Werror=unused-but-set-variable] 354 | struct xfs_inobt_rec_incore *irec; | ^~~~ cc1: all warnings being treated as errors Drop @irec, as it is only an intermediate variable. Suggested-by: Christoph Hellwig <hch@lst.de> Signed-off-by: John Garry <john.g.garry@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
* | | Merge tag 'bcachefs-2024-05-30' of https://evilpiepirate.org/git/bcachefsLinus Torvalds2024-05-3127-686/+706
|\ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull bcachefs fixes from Kent Overstreet: "Assorted odds and ends... - two downgrade fixes - a couple snapshot deletion and repair fixes, thanks to noradtux for finding these and providing the image to debug them - a couple assert fixes - convert to folio helper, from Matthew - some improved error messages - bit of code reorganization (just moving things around); doing this while things are quiet so I'm not rebasing fixes past reorgs - don't return -EROFS on inconsistency error in recovery, this confuses util-linux and has it retry the mount - fix failure to return error on misaligned dio write; reported as an issue with coreutils shred" * tag 'bcachefs-2024-05-30' of https://evilpiepirate.org/git/bcachefs: (21 commits) bcachefs: Fix failure to return error on misaligned dio write bcachefs: Don't return -EROFS from mount on inconsistency error bcachefs: Fix uninitialized var warning bcachefs: Split out sb-errors_format.h bcachefs: Split out journal_seq_blacklist_format.h bcachefs: Split out replicas_format.h bcachefs: Split out disk_groups_format.h bcachefs: split out sb-downgrade_format.h bcachefs: split out sb-members_format.h bcachefs: Better fsck error message for key version bcachefs: btree_gc can now handle unknown btrees bcachefs: add missing MODULE_DESCRIPTION() bcachefs: Fix setting of downgrade recovery passes/errors bcachefs: Run check_key_has_snapshot in snapshot_delete_keys() bcachefs: Refactor delete_dead_snapshots() bcachefs: Fix locking assert bcachefs: Fix lookup_first_inode() when inode_generations are present bcachefs: Plumb bkey into __btree_err() bcachefs: Use copy_folio_from_iter_atomic() bcachefs: Fix sb-downgrade validation ...
| * | | bcachefs: Fix failure to return error on misaligned dio writeKent Overstreet2024-05-291-1/+3
| | | | | | | | | | | | | | | | | | | | | | | | This was reported as an error when running coreutils shred. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
| * | | bcachefs: Don't return -EROFS from mount on inconsistency errorKent Overstreet2024-05-291-2/+10
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | We were accidentally returning -EROFS during recovery on filesystem inconsistency - since this is what the journal returns on emergency shutdown. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
| * | | bcachefs: Fix uninitialized var warningKent Overstreet2024-05-292-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | Can't actually be used uninitialized, but gcc was being silly. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
| * | | bcachefs: Split out sb-errors_format.hKent Overstreet2024-05-283-292/+297
| | | | | | | | | | | | | | | | Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
| * | | bcachefs: Split out journal_seq_blacklist_format.hKent Overstreet2024-05-282-10/+16
| | | | | | | | | | | | | | | | Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
| * | | bcachefs: Split out replicas_format.hKent Overstreet2024-05-282-32/+36
| | | | | | | | | | | | | | | | Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
| * | | bcachefs: Split out disk_groups_format.hKent Overstreet2024-05-282-18/+22
| | | | | | | | | | | | | | | | Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
| * | | bcachefs: split out sb-downgrade_format.hKent Overstreet2024-05-282-13/+18
| | | | | | | | | | | | | | | | Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
| * | | bcachefs: split out sb-members_format.hKent Overstreet2024-05-282-101/+111
| | | | | | | | | | | | | | | | Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
| * | | bcachefs: Better fsck error message for key versionKent Overstreet2024-05-281-4/+5
| | | | | | | | | | | | | | | | Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
| * | | bcachefs: btree_gc can now handle unknown btreesKent Overstreet2024-05-285-73/+55
| | | | | | | | | | | | | | | | | | | | | | | | | | | | Compatibility fix - we no longer have a separate table for which order gc walks btrees in, and special case the stripes btree directly. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
| * | | bcachefs: add missing MODULE_DESCRIPTION()Jeff Johnson2024-05-281-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Fix the 'make W=1' warning: WARNING: modpost: missing MODULE_DESCRIPTION() in fs/bcachefs/mean_and_variance_test.o Signed-off-by: Jeff Johnson <quic_jjohnson@quicinc.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
| * | | bcachefs: Fix setting of downgrade recovery passes/errorsKent Overstreet2024-05-281-9/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | bch2_check_version_downgrade() was setting c->sb.version, which bch2_sb_set_downgrade() expects to be at the previous version; and it shouldn't even have been set directly because c->sb.version is updated by write_super(). Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
| * | | bcachefs: Run check_key_has_snapshot in snapshot_delete_keys()Kent Overstreet2024-05-283-24/+29
| | | | | | | | | | | | | | | | | | | | | | | | | | | | delete_dead_snapshots now runs before the main fsck.c passes which check for keys for invalid snapshots; thus, it needs those checks as well. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
| * | | bcachefs: Refactor delete_dead_snapshots()Kent Overstreet2024-05-281-40/+25
| | | | | | | | | | | | | | | | | | | | | | | | | | | | Consolidate per-key work into delete_dead_snapshots_process_key(), so we now walk all keys once, not twice. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
| * | | bcachefs: Fix locking assertKent Overstreet2024-05-281-5/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | We now track whether a transaction is locked, and verify that we don't have nodes locked when the transaction isn't locked; reorder relocks to not pop the new assert. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
| * | | bcachefs: Fix lookup_first_inode() when inode_generations are presentKent Overstreet2024-05-281-14/+10
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This function is used for finding the hash seed (which is the same in all versions of an inode in different snapshots): ff an inode has been deleted in a child snapshot we need to iterate until we find a live version. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
| * | | bcachefs: Plumb bkey into __btree_err()Kent Overstreet2024-05-281-40/+45
| | | | | | | | | | | | | | | | | | | | | | | | | | | | It can be useful to know the exact byte offset within a btree node where an error occured. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
| * | | bcachefs: Use copy_folio_from_iter_atomic()Matthew Wilcox (Oracle)2024-05-271-3/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | copy_page_from_iter_atomic() will be removed at some point. Also fixup a comment for folios. Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
| * | | bcachefs: Fix sb-downgrade validationKent Overstreet2024-05-261-3/+10
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Superblock downgrade entries are only two byte aligned, but section sizes are 8 byte aligned, which means we have to be careful about overrun checks; an entry that crosses the end of the section is allowed (and ignored) as long as it has zero errors. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
| * | | bcachefs: Fix debug assertKent Overstreet2024-05-261-1/+1
| | | | | | | | | | | | | | | | | | | | Reported-by: syzbot+a8074a75b8d73328751e@syzkaller.appspotmail.com Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
* | | | Revert "vfs: Delete the associated dentry when deleting a file"Linus Torvalds2024-05-291-7/+8
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This reverts commit 681ce8623567ba7e7333908e9826b77145312dda. We gave it a try, but it turns out the kernel test robot did in fact find performance regressions for it, so we'll have to look at the more involved alternative fixes for Yafang Shao's Elasticsearch load issue. There were several alternatives discussed, they just weren't as simple as this first attempt. The report is of a -7.4% regression of filebench.sum_operations/s, which appears significant enough to trigger my "this patch may get reverted if somebody finds a performance regression on some other load" rule. So it's still the case that we should end up deleting dentries more aggressively - or just be better at pruning them later - but it needs a bit more finesse than this simple thing. Link: https://lore.kernel.org/all/202405291318.4dfbb352-oliver.sang@intel.com/ Cc: Yafang Shao <laoar.shao@gmail.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Christian Brauner <brauner@kernel.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | | | Merge tag '9p-for-6.10-rc2' of https://github.com/martinetd/linuxLinus Torvalds2024-05-291-2/+7
|\ \ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull 9p fixes from Dominique Martinet: "Two fixes headed to stable trees: - a trace event was dumping uninitialized values - a missing lock that was thought to have exclusive access, and it turned out not to" * tag '9p-for-6.10-rc2' of https://github.com/martinetd/linux: 9p: add missing locking around taking dentry fid list net/9p: fix uninit-value in p9_client_rpc()
| * | | | 9p: add missing locking around taking dentry fid listDominique Martinet2024-05-231-2/+7
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Fix a use-after-free on dentry's d_fsdata fid list when a thread looks up a fid through dentry while another thread unlinks it: UAF thread: refcount_t: addition on 0; use-after-free. p9_fid_get linux/./include/net/9p/client.h:262 v9fs_fid_find+0x236/0x280 linux/fs/9p/fid.c:129 v9fs_fid_lookup_with_uid linux/fs/9p/fid.c:181 v9fs_fid_lookup+0xbf/0xc20 linux/fs/9p/fid.c:314 v9fs_vfs_getattr_dotl+0xf9/0x360 linux/fs/9p/vfs_inode_dotl.c:400 vfs_statx+0xdd/0x4d0 linux/fs/stat.c:248 Freed by: p9_fid_destroy (inlined) p9_client_clunk+0xb0/0xe0 linux/net/9p/client.c:1456 p9_fid_put linux/./include/net/9p/client.h:278 v9fs_dentry_release+0xb5/0x140 linux/fs/9p/vfs_dentry.c:55 v9fs_remove+0x38f/0x620 linux/fs/9p/vfs_inode.c:518 vfs_unlink+0x29a/0x810 linux/fs/namei.c:4335 The problem is that d_fsdata was not accessed under d_lock, because d_release() normally is only called once the dentry is otherwise no longer accessible but since we also call it explicitly in v9fs_remove that lock is required: move the hlist out of the dentry under lock then unref its fids once they are no longer accessible. Fixes: 154372e67d40 ("fs/9p: fix create-unlink-getattr idiom") Cc: stable@vger.kernel.org Reported-by: Meysam Firouzi Reported-by: Amirmohammad Eftekhar Reviewed-by: Christian Schoenebeck <linux_oss@crudebyte.com> Message-ID: <20240521122947.1080227-1-asmadeus@codewreck.org> Signed-off-by: Dominique Martinet <asmadeus@codewreck.org>
* | | | | Merge tag 'vfs-6.10-rc2.fixes' of ↵Linus Torvalds2024-05-2711-13/+28
|\ \ \ \ \ | |_|_|/ / |/| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull vfs fixes from Christian Brauner: - Fix io_uring based write-through after converting cifs to use the netfs library - Fix aio error handling when doing write-through via netfs library - Fix performance regression in iomap when used with non-large folio mappings - Fix signalfd error code - Remove obsolete comment in signalfd code - Fix async request indication in netfs_perform_write() by raising BDP_ASYNC when IOCB_NOWAIT is set - Yield swap device immediately to prevent spurious EBUSY errors - Don't cross a .backup mountpoint from backup volumes in afs to avoid infinite loops - Fix a race between umount and async request completion in 9p after 9p was converted to use the netfs library * tag 'vfs-6.10-rc2.fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: netfs, 9p: Fix race between umount and async request completion afs: Don't cross .backup mountpoint from backup volume swap: yield device immediately netfs: Fix setting of BDP_ASYNC from iocb flags signalfd: drop an obsolete comment signalfd: fix error return code iomap: fault in smaller chunks for non-large folio mappings filemap: add helper mapping_max_folio_size() netfs: Fix AIO error handling when doing write-through netfs: Fix io_uring based write-through
| * | | | netfs, 9p: Fix race between umount and async request completionDavid Howells2024-05-274-0/+8
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | There's a problem in 9p's interaction with netfslib whereby a crash occurs because the 9p_fid structs get forcibly destroyed during client teardown (without paying attention to their refcounts) before netfslib has finished with them. However, it's not a simple case of deferring the clunking that p9_fid_put() does as that requires the p9_client record to still be present. The problem is that netfslib has to unlock pages and clear the IN_PROGRESS flag before destroying the objects involved - including the fid - and, in any case, nothing checks to see if writeback completed barring looking at the page flags. Fix this by keeping a count of outstanding I/O requests (of any type) and waiting for it to quiesce during inode eviction. Reported-by: syzbot+df038d463cca332e8414@syzkaller.appspotmail.com Link: https://lore.kernel.org/all/0000000000005be0aa061846f8d6@google.com/ Reported-by: syzbot+d7c7a495a5e466c031b6@syzkaller.appspotmail.com Link: https://lore.kernel.org/all/000000000000b86c5e06130da9c6@google.com/ Reported-by: syzbot+1527696d41a634cc1819@syzkaller.appspotmail.com Link: https://lore.kernel.org/all/000000000000041f960618206d7e@google.com/ Signed-off-by: David Howells <dhowells@redhat.com> Link: https://lore.kernel.org/r/755891.1716560771@warthog.procyon.org.uk Tested-by: syzbot+d7c7a495a5e466c031b6@syzkaller.appspotmail.com Reviewed-by: Dominique Martinet <asmadeus@codewreck.org> cc: Eric Van Hensbergen <ericvh@kernel.org> cc: Latchesar Ionkov <lucho@ionkov.net> cc: Christian Schoenebeck <linux_oss@crudebyte.com> cc: Jeff Layton <jlayton@kernel.org> cc: Steve French <sfrench@samba.org> cc: Hillf Danton <hdanton@sina.com> cc: v9fs@lists.linux.dev cc: linux-afs@lists.infradead.org cc: linux-cifs@vger.kernel.org cc: netfs@lists.linux.dev cc: linux-fsdevel@vger.kernel.org Reported-and-tested-by: syzbot+d7c7a495a5e466c031b6@syzkaller.appspotmail.com Signed-off-by: Christian Brauner <brauner@kernel.org>
| * | | | afs: Don't cross .backup mountpoint from backup volumeMarc Dionne2024-05-251-0/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Don't cross a mountpoint that explicitly specifies a backup volume (target is <vol>.backup) when starting from a backup volume. It it not uncommon to mount a volume's backup directly in the volume itself. This can cause tools that are not paying attention to get into a loop mounting the volume onto itself as they attempt to traverse the tree, leading to a variety of problems. This doesn't prevent the general case of loops in a sequence of mountpoints, but addresses a common special case in the same way as other afs clients. Reported-by: Jan Henrik Sylvester <jan.henrik.sylvester@uni-hamburg.de> Link: http://lists.infradead.org/pipermail/linux-afs/2024-May/008454.html Reported-by: Markus Suvanto <markus.suvanto@gmail.com> Link: http://lists.infradead.org/pipermail/linux-afs/2024-February/008074.html Signed-off-by: Marc Dionne <marc.dionne@auristor.com> Signed-off-by: David Howells <dhowells@redhat.com> Link: https://lore.kernel.org/r/768760.1716567475@warthog.procyon.org.uk Reviewed-by: Jeffrey Altman <jaltman@auristor.com> cc: linux-afs@lists.infradead.org Signed-off-by: Christian Brauner <brauner@kernel.org>
| * | | | netfs: Fix setting of BDP_ASYNC from iocb flagsDavid Howells2024-05-241-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Fix netfs_perform_write() to set BDP_ASYNC if IOCB_NOWAIT is set rather than if IOCB_SYNC is not set. It reflects asynchronicity in the sense of not waiting rather than synchronicity in the sense of not returning until the op is complete. Without this, generic/590 fails on cifs in strict caching mode with a complaint that one of the writes fails with EAGAIN. The test can be distilled down to: mount -t cifs /my/share /mnt -ostuff xfs_io -i -c 'falloc 0 8191M -c fsync -f /mnt/file xfs_io -i -c 'pwrite -b 1M -W 0 8191M' /mnt/file Fixes: c38f4e96e605 ("netfs: Provide func to copy data to pagecache for buffered write") Signed-off-by: David Howells <dhowells@redhat.com> Link: https://lore.kernel.org/r/316306.1716306586@warthog.procyon.org.uk Reviewed-by: Jens Axboe <axboe@kernel.dk> cc: Jeff Layton <jlayton@kernel.org> cc: Enzo Matsumiya <ematsumiya@suse.de> cc: Jens Axboe <axboe@kernel.dk> cc: Matthew Wilcox <willy@infradead.org> cc: netfs@lists.linux.dev cc: v9fs@lists.linux.dev cc: linux-afs@lists.infradead.org cc: linux-cifs@vger.kernel.org cc: linux-fsdevel@vger.kernel.org Signed-off-by: Christian Brauner <brauner@kernel.org>
| * | | | signalfd: drop an obsolete commentFedor Pchelkin2024-05-241-4/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Commit fbe38120eb1d ("signalfd: convert to ->read_iter()") removed the call to anon_inode_getfd() by splitting fd setup into two parts. Drop the comment referencing the internal details of that function. Signed-off-by: Fedor Pchelkin <pchelkin@ispras.ru> Link: https://lore.kernel.org/r/20240520090819.76342-2-pchelkin@ispras.ru Reviewed-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Christian Brauner <brauner@kernel.org>
| * | | | signalfd: fix error return codeFedor Pchelkin2024-05-241-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | If anon_inode_getfile() fails, return appropriate error code. This looks like a single typo: the similar code changes in timerfd and userfaultfd are okay. Found by Linux Verification Center (linuxtesting.org). Fixes: fbe38120eb1d ("signalfd: convert to ->read_iter()") Signed-off-by: Fedor Pchelkin <pchelkin@ispras.ru> Link: https://lore.kernel.org/r/20240520090819.76342-1-pchelkin@ispras.ru Reviewed-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Christian Brauner <brauner@kernel.org>
| * | | | iomap: fault in smaller chunks for non-large folio mappingsXu Yang2024-05-241-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Since commit (5d8edfb900d5 "iomap: Copy larger chunks from userspace"), iomap will try to copy in larger chunks than PAGE_SIZE. However, if the mapping doesn't support large folio, only one page of maximum 4KB will be created and 4KB data will be writen to pagecache each time. Then, next 4KB will be handled in next iteration. This will cause potential write performance problem. If chunk is 2MB, total 512 pages need to be handled finally. During this period, fault_in_iov_iter_readable() is called to check iov_iter readable validity. Since only 4KB will be handled each time, below address space will be checked over and over again: start end - buf, buf+2MB buf+4KB, buf+2MB buf+8KB, buf+2MB ... buf+2044KB buf+2MB Obviously the checking size is wrong since only 4KB will be handled each time. So this will get a correct chunk to let iomap work well in non-large folio case. With this change, the write speed will be stable. Tested on ARM64 device. Before: - dd if=/dev/zero of=/dev/sda bs=400K count=10485 (334 MB/s) - dd if=/dev/zero of=/dev/sda bs=800K count=5242 (278 MB/s) - dd if=/dev/zero of=/dev/sda bs=1600K count=2621 (204 MB/s) - dd if=/dev/zero of=/dev/sda bs=2200K count=1906 (170 MB/s) - dd if=/dev/zero of=/dev/sda bs=3000K count=1398 (150 MB/s) - dd if=/dev/zero of=/dev/sda bs=4500K count=932 (139 MB/s) After: - dd if=/dev/zero of=/dev/sda bs=400K count=10485 (339 MB/s) - dd if=/dev/zero of=/dev/sda bs=800K count=5242 (330 MB/s) - dd if=/dev/zero of=/dev/sda bs=1600K count=2621 (332 MB/s) - dd if=/dev/zero of=/dev/sda bs=2200K count=1906 (333 MB/s) - dd if=/dev/zero of=/dev/sda bs=3000K count=1398 (333 MB/s) - dd if=/dev/zero of=/dev/sda bs=4500K count=932 (333 MB/s) Fixes: 5d8edfb900d5 ("iomap: Copy larger chunks from userspace") Cc: stable@vger.kernel.org Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Xu Yang <xu.yang_2@nxp.com> Link: https://lore.kernel.org/r/20240521114939.2541461-2-xu.yang_2@nxp.com Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Christian Brauner <brauner@kernel.org>
| * | | | netfs: Fix AIO error handling when doing write-throughDavid Howells2024-05-241-1/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | If an error occurs whilst we're doing an AIO write in write-through mode, we may end up calling ->ki_complete() *and* returning an error from ->write_iter(). This can result in either a UAF (the ->ki_complete() func pointer may get overwritten, for example) or a refcount underflow in io_submit() as ->ki_complete is called twice. Fix this by making netfs_end_writethrough() - and thus netfs_perform_write() - unconditionally return -EIOCBQUEUED if we're doing an AIO write and wait for completion if we're not. Fixes: 288ace2f57c9 ("netfs: New writeback implementation") Signed-off-by: David Howells <dhowells@redhat.com> Link: https://lore.kernel.org/r/295052.1716298587@warthog.procyon.org.uk cc: Jeff Layton <jlayton@kernel.org> cc: Enzo Matsumiya <ematsumiya@suse.de> cc: netfs@lists.linux.dev cc: v9fs@lists.linux.dev cc: linux-afs@lists.infradead.org cc: linux-cifs@vger.kernel.org cc: linux-fsdevel@vger.kernel.org Signed-off-by: Christian Brauner <brauner@kernel.org>