summaryrefslogtreecommitdiffstats
path: root/block (follow)
Commit message (Collapse)AuthorAgeFilesLines
* block: avoid to reuse `hctx` not removed from cpuhp callback listMing Lei2024-12-181-1/+10
| | | | | | | | | | | | | If the 'hctx' isn't removed from cpuhp callback list, we can't reuse it, otherwise use-after-free may be triggered. Reported-by: kernel test robot <oliver.sang@intel.com> Closes: https://lore.kernel.org/oe-lkp/202412172217.b906db7c-lkp@intel.com Tested-by: kernel test robot <oliver.sang@intel.com> Fixes: 22465bbac53c ("blk-mq: move cpuhp callback registering out of q->sysfs_lock") Signed-off-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20241218101617.3275704-3-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
* block: Revert "block: Fix potential deadlock while freezing queue and ↵Ming Lei2024-12-183-26/+23
| | | | | | | | | | | | | | | | | | | acquiring sysfs_lock" This reverts commit be26ba96421ab0a8fa2055ccf7db7832a13c44d2. Commit be26ba96421a ("block: Fix potential deadlock while freezing queue and acquiring sysfs_loc") actually reverts commit 22465bbac53c ("blk-mq: move cpuhp callback registering out of q->sysfs_lock"), and causes the original resctrl lockdep warning. So revert it and we need to fix the issue in another way. Cc: Nilay Shroff <nilay@linux.ibm.com> Fixes: be26ba96421a ("block: Fix potential deadlock while freezing queue and acquiring sysfs_loc") Signed-off-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20241218101617.3275704-2-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
* block/bdev: use helper for max block size checkLuis Chamberlain2024-12-181-2/+1
| | | | | | | | | | | | | | We already have a helper for checking the limits on the block size both low and high, just use that. No functional changes. Reviewed-by: John Garry <john.g.garry@oracle.com> Signed-off-by: Luis Chamberlain <mcgrof@kernel.org> Reviewed-by: Keith Busch <kbusch@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20241218020212.3657139-2-mcgrof@kernel.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
* block: Fix potential deadlock while freezing queue and acquiring sysfs_lockNilay Shroff2024-12-133-23/+26
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | For storing a value to a queue attribute, the queue_attr_store function first freezes the queue (->q_usage_counter(io)) and then acquire ->sysfs_lock. This seems not correct as the usual ordering should be to acquire ->sysfs_lock before freezing the queue. This incorrect ordering causes the following lockdep splat which we are able to reproduce always simply by accessing /sys/kernel/debug file using ls command: [ 57.597146] WARNING: possible circular locking dependency detected [ 57.597154] 6.12.0-10553-gb86545e02e8c #20 Tainted: G W [ 57.597162] ------------------------------------------------------ [ 57.597168] ls/4605 is trying to acquire lock: [ 57.597176] c00000003eb56710 (&mm->mmap_lock){++++}-{4:4}, at: __might_fault+0x58/0xc0 [ 57.597200] but task is already holding lock: [ 57.597207] c0000018e27c6810 (&sb->s_type->i_mutex_key#3){++++}-{4:4}, at: iterate_dir+0x94/0x1d4 [ 57.597226] which lock already depends on the new lock. [ 57.597233] the existing dependency chain (in reverse order) is: [ 57.597241] -> #5 (&sb->s_type->i_mutex_key#3){++++}-{4:4}: [ 57.597255] down_write+0x6c/0x18c [ 57.597264] start_creating+0xb4/0x24c [ 57.597274] debugfs_create_dir+0x2c/0x1e8 [ 57.597283] blk_register_queue+0xec/0x294 [ 57.597292] add_disk_fwnode+0x2e4/0x548 [ 57.597302] brd_alloc+0x2c8/0x338 [ 57.597309] brd_init+0x100/0x178 [ 57.597317] do_one_initcall+0x88/0x3e4 [ 57.597326] kernel_init_freeable+0x3cc/0x6e0 [ 57.597334] kernel_init+0x34/0x1cc [ 57.597342] ret_from_kernel_user_thread+0x14/0x1c [ 57.597350] -> #4 (&q->debugfs_mutex){+.+.}-{4:4}: [ 57.597362] __mutex_lock+0xfc/0x12a0 [ 57.597370] blk_register_queue+0xd4/0x294 [ 57.597379] add_disk_fwnode+0x2e4/0x548 [ 57.597388] brd_alloc+0x2c8/0x338 [ 57.597395] brd_init+0x100/0x178 [ 57.597402] do_one_initcall+0x88/0x3e4 [ 57.597410] kernel_init_freeable+0x3cc/0x6e0 [ 57.597418] kernel_init+0x34/0x1cc [ 57.597426] ret_from_kernel_user_thread+0x14/0x1c [ 57.597434] -> #3 (&q->sysfs_lock){+.+.}-{4:4}: [ 57.597446] __mutex_lock+0xfc/0x12a0 [ 57.597454] queue_attr_store+0x9c/0x110 [ 57.597462] sysfs_kf_write+0x70/0xb0 [ 57.597471] kernfs_fop_write_iter+0x1b0/0x2ac [ 57.597480] vfs_write+0x3dc/0x6e8 [ 57.597488] ksys_write+0x84/0x140 [ 57.597495] system_call_exception+0x130/0x360 [ 57.597504] system_call_common+0x160/0x2c4 [ 57.597516] -> #2 (&q->q_usage_counter(io)#21){++++}-{0:0}: [ 57.597530] __submit_bio+0x5ec/0x828 [ 57.597538] submit_bio_noacct_nocheck+0x1e4/0x4f0 [ 57.597547] iomap_readahead+0x2a0/0x448 [ 57.597556] xfs_vm_readahead+0x28/0x3c [ 57.597564] read_pages+0x88/0x41c [ 57.597571] page_cache_ra_unbounded+0x1ac/0x2d8 [ 57.597580] filemap_get_pages+0x188/0x984 [ 57.597588] filemap_read+0x13c/0x4bc [ 57.597596] xfs_file_buffered_read+0x88/0x17c [ 57.597605] xfs_file_read_iter+0xac/0x158 [ 57.597614] vfs_read+0x2d4/0x3b4 [ 57.597622] ksys_read+0x84/0x144 [ 57.597629] system_call_exception+0x130/0x360 [ 57.597637] system_call_common+0x160/0x2c4 [ 57.597647] -> #1 (mapping.invalidate_lock#2){++++}-{4:4}: [ 57.597661] down_read+0x6c/0x220 [ 57.597669] filemap_fault+0x870/0x100c [ 57.597677] xfs_filemap_fault+0xc4/0x18c [ 57.597684] __do_fault+0x64/0x164 [ 57.597693] __handle_mm_fault+0x1274/0x1dac [ 57.597702] handle_mm_fault+0x248/0x484 [ 57.597711] ___do_page_fault+0x428/0xc0c [ 57.597719] hash__do_page_fault+0x30/0x68 [ 57.597727] do_hash_fault+0x90/0x35c [ 57.597736] data_access_common_virt+0x210/0x220 [ 57.597745] _copy_from_user+0xf8/0x19c [ 57.597754] sel_write_load+0x178/0xd54 [ 57.597762] vfs_write+0x108/0x6e8 [ 57.597769] ksys_write+0x84/0x140 [ 57.597777] system_call_exception+0x130/0x360 [ 57.597785] system_call_common+0x160/0x2c4 [ 57.597794] -> #0 (&mm->mmap_lock){++++}-{4:4}: [ 57.597806] __lock_acquire+0x17cc/0x2330 [ 57.597814] lock_acquire+0x138/0x400 [ 57.597822] __might_fault+0x7c/0xc0 [ 57.597830] filldir64+0xe8/0x390 [ 57.597839] dcache_readdir+0x80/0x2d4 [ 57.597846] iterate_dir+0xd8/0x1d4 [ 57.597855] sys_getdents64+0x88/0x2d4 [ 57.597864] system_call_exception+0x130/0x360 [ 57.597872] system_call_common+0x160/0x2c4 [ 57.597881] other info that might help us debug this: [ 57.597888] Chain exists of: &mm->mmap_lock --> &q->debugfs_mutex --> &sb->s_type->i_mutex_key#3 [ 57.597905] Possible unsafe locking scenario: [ 57.597911] CPU0 CPU1 [ 57.597917] ---- ---- [ 57.597922] rlock(&sb->s_type->i_mutex_key#3); [ 57.597932] lock(&q->debugfs_mutex); [ 57.597940] lock(&sb->s_type->i_mutex_key#3); [ 57.597950] rlock(&mm->mmap_lock); [ 57.597958] *** DEADLOCK *** [ 57.597965] 2 locks held by ls/4605: [ 57.597971] #0: c0000000137c12f8 (&f->f_pos_lock){+.+.}-{4:4}, at: fdget_pos+0xcc/0x154 [ 57.597989] #1: c0000018e27c6810 (&sb->s_type->i_mutex_key#3){++++}-{4:4}, at: iterate_dir+0x94/0x1d4 Prevent the above lockdep warning by acquiring ->sysfs_lock before freezing the queue while storing a queue attribute in queue_attr_store function. Later, we also found[1] another function __blk_mq_update_nr_ hw_queues where we first freeze queue and then acquire the ->sysfs_lock. So we've also updated lock ordering in __blk_mq_update_nr_hw_queues function and ensured that in all code paths we follow the correct lock ordering i.e. acquire ->sysfs_lock before freezing the queue. [1] https://lore.kernel.org/all/CAFj5m9Ke8+EHKQBs_Nk6hqd=LGXtk4mUxZUN5==ZcCjnZSBwHw@mail.gmail.com/ Reported-by: kjain@linux.ibm.com Fixes: af2814149883 ("block: freeze the queue in queue_attr_store") Tested-by: kjain@linux.ibm.com Cc: hch@lst.de Cc: axboe@kernel.dk Cc: ritesh.list@gmail.com Cc: ming.lei@redhat.com Cc: gjoyce@linux.ibm.com Signed-off-by: Nilay Shroff <nilay@linux.ibm.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20241210144222.1066229-1-nilay@linux.ibm.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
* block: Fix queue_iostats_passthrough_show()Bart Van Assche2024-12-131-1/+1
| | | | | | | | | | | | | | | | Make queue_iostats_passthrough_show() report 0/1 in sysfs instead of 0/4. This patch fixes the following sparse warning: block/blk-sysfs.c:266:31: warning: incorrect type in argument 1 (different base types) block/blk-sysfs.c:266:31: expected unsigned long var block/blk-sysfs.c:266:31: got restricted blk_flags_t Cc: Keith Busch <kbusch@kernel.org> Cc: Christoph Hellwig <hch@lst.de> Fixes: 110234da18ab ("block: enable passthrough command statistics") Signed-off-by: Bart Van Assche <bvanassche@acm.org> Link: https://lore.kernel.org/r/20241212212941.1268662-4-bvanassche@acm.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
* blk-mq: Clean up blk_mq_requeue_work()Bart Van Assche2024-12-131-6/+4
| | | | | | | | | | | | | | | Move a statement that occurs in both branches of an if-statement in front of the if-statement. Fix a typo in a source code comment. No functionality has been changed. Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Cc: Christoph Hellwig <hch@lst.de> Signed-off-by: Bart Van Assche <bvanassche@acm.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Nitesh Shetty <nj.shetty@samsung.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Link: https://lore.kernel.org/r/20241212212941.1268662-3-bvanassche@acm.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
* mq-deadline: Remove a local variableBart Van Assche2024-12-131-4/+1
| | | | | | | | | | | | | | | Since commit fde02699c242 ("block: mq-deadline: Remove support for zone write locking"), the local variable 'insert_before' is assigned once and is used once. Hence remove this local variable. Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Cc: Christoph Hellwig <hch@lst.de> Signed-off-by: Bart Van Assche <bvanassche@acm.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Nitesh Shetty <nj.shetty@samsung.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Link: https://lore.kernel.org/r/20241212212941.1268662-2-bvanassche@acm.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
* blk-iocost: Avoid using clamp() on inuse in __propagate_weights()Nathan Chancellor2024-12-121-1/+8
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | After a recent change to clamp() and its variants [1] that increases the coverage of the check that high is greater than low because it can be done through inlining, certain build configurations (such as s390 defconfig) fail to build with clang with: block/blk-iocost.c:1101:11: error: call to '__compiletime_assert_557' declared with 'error' attribute: clamp() low limit 1 greater than high limit active 1101 | inuse = clamp_t(u32, inuse, 1, active); | ^ include/linux/minmax.h:218:36: note: expanded from macro 'clamp_t' 218 | #define clamp_t(type, val, lo, hi) __careful_clamp(type, val, lo, hi) | ^ include/linux/minmax.h:195:2: note: expanded from macro '__careful_clamp' 195 | __clamp_once(type, val, lo, hi, __UNIQUE_ID(v_), __UNIQUE_ID(l_), __UNIQUE_ID(h_)) | ^ include/linux/minmax.h:188:2: note: expanded from macro '__clamp_once' 188 | BUILD_BUG_ON_MSG(statically_true(ulo > uhi), \ | ^ __propagate_weights() is called with an active value of zero in ioc_check_iocgs(), which results in the high value being less than the low value, which is undefined because the value returned depends on the order of the comparisons. The purpose of this expression is to ensure inuse is not more than active and at least 1. This could be written more simply with a ternary expression that uses min(inuse, active) as the condition so that the value of that condition can be used if it is not zero and one if it is. Do this conversion to resolve the error and add a comment to deter people from turning this back into clamp(). Fixes: 7caa47151ab2 ("blkcg: implement blk-iocost") Link: https://lore.kernel.org/r/34d53778977747f19cce2abb287bb3e6@AcuMS.aculab.com/ [1] Suggested-by: David Laight <david.laight@aculab.com> Reported-by: Linux Kernel Functional Testing <lkft@linaro.org> Closes: https://lore.kernel.org/llvm/CA+G9fYsD7mw13wredcZn0L-KBA3yeoVSTuxnss-AEWMN3ha0cA@mail.gmail.com/ Reported-by: kernel test robot <lkp@intel.com> Closes: https://lore.kernel.org/oe-kbuild-all/202412120322.3GfVe3vF-lkp@intel.com/ Signed-off-by: Nathan Chancellor <nathan@kernel.org> Acked-by: Tejun Heo <tj@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* block: Make bio_iov_bvec_set() accept pointer to const iov_iterJohn Garry2024-12-122-2/+2
| | | | | | | | | | | Make bio_iov_bvec_set() accept a pointer to const iov_iter, which means that we can drop the undesirable casting to struct iov_iter pointer in blk_rq_map_user_bvec(). Signed-off-by: John Garry <john.g.garry@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20241202115727.2320401-1-john.g.garry@oracle.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
* block: get wp_offset by bdev_offset_from_zone_startLongPing Wei2024-12-101-1/+1
| | | | | | | | | | | Call bdev_offset_from_zone_start() instead of open-coding it. Fixes: dd291d77cc90 ("block: Introduce zone write plugging") Signed-off-by: LongPing Wei <weilongping@oppo.com> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Bart Van Assche <bvanassche@acm.org> Link: https://lore.kernel.org/r/20241107020439.1644577-1-weilongping@oppo.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
* blk-cgroup: Fix UAF in blkcg_unpin_online()Tejun Heo2024-12-101-1/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | blkcg_unpin_online() walks up the blkcg hierarchy putting the online pin. To walk up, it uses blkcg_parent(blkcg) but it was calling that after blkcg_destroy_blkgs(blkcg) which could free the blkcg, leading to the following UAF: ================================================================== BUG: KASAN: slab-use-after-free in blkcg_unpin_online+0x15a/0x270 Read of size 8 at addr ffff8881057678c0 by task kworker/9:1/117 CPU: 9 UID: 0 PID: 117 Comm: kworker/9:1 Not tainted 6.13.0-rc1-work-00182-gb8f52214c61a-dirty #48 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS unknown 02/02/2022 Workqueue: cgwb_release cgwb_release_workfn Call Trace: <TASK> dump_stack_lvl+0x27/0x80 print_report+0x151/0x710 kasan_report+0xc0/0x100 blkcg_unpin_online+0x15a/0x270 cgwb_release_workfn+0x194/0x480 process_scheduled_works+0x71b/0xe20 worker_thread+0x82a/0xbd0 kthread+0x242/0x2c0 ret_from_fork+0x33/0x70 ret_from_fork_asm+0x1a/0x30 </TASK> ... Freed by task 1944: kasan_save_track+0x2b/0x70 kasan_save_free_info+0x3c/0x50 __kasan_slab_free+0x33/0x50 kfree+0x10c/0x330 css_free_rwork_fn+0xe6/0xb30 process_scheduled_works+0x71b/0xe20 worker_thread+0x82a/0xbd0 kthread+0x242/0x2c0 ret_from_fork+0x33/0x70 ret_from_fork_asm+0x1a/0x30 Note that the UAF is not easy to trigger as the free path is indirected behind a couple RCU grace periods and a work item execution. I could only trigger it with artifical msleep() injected in blkcg_unpin_online(). Fix it by reading the parent pointer before destroying the blkcg's blkg's. Signed-off-by: Tejun Heo <tj@kernel.org> Reported-by: Abagail ren <renzezhongucas@gmail.com> Suggested-by: Linus Torvalds <torvalds@linuxfoundation.org> Fixes: 4308a434e5e0 ("blkcg: don't offline parent blkcg first") Cc: stable@vger.kernel.org # v5.7+ Signed-off-by: Jens Axboe <axboe@kernel.dk>
* block: Prevent potential deadlocks in zone write plug error recoveryDamien Le Moal2024-12-101-247/+61
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Zone write plugging for handling writes to zones of a zoned block device always execute a zone report whenever a write BIO to a zone fails. The intent of this is to ensure that the tracking of a zone write pointer is always correct to ensure that the alignment to a zone write pointer of write BIOs can be checked on submission and that we can always correctly emulate zone append operations using regular write BIOs. However, this error recovery scheme introduces a potential deadlock if a device queue freeze is initiated while BIOs are still plugged in a zone write plug and one of these write operation fails. In such case, the disk zone write plug error recovery work is scheduled and executes a report zone. This in turn can result in a request allocation in the underlying driver to issue the report zones command to the device. But with the device queue freeze already started, this allocation will block, preventing the report zone execution and the continuation of the processing of the plugged BIOs. As plugged BIOs hold a queue usage reference, the queue freeze itself will never complete, resulting in a deadlock. Avoid this problem by completely removing from the zone write plugging code the use of report zones operations after a failed write operation, instead relying on the device user to either execute a report zones, reset the zone, finish the zone, or give up writing to the device (which is a fairly common pattern for file systems which degrade to read-only after write failures). This is not an unreasonnable requirement as all well-behaved applications, FSes and device mapper already use report zones to recover from write errors whenever possible by comparing the current position of a zone write pointer with what their assumption about the position is. The changes to remove the automatic error recovery are as follows: - Completely remove the error recovery work and its associated resources (zone write plug list head, disk error list, and disk zone_wplugs_work work struct). This also removes the functions disk_zone_wplug_set_error() and disk_zone_wplug_clear_error(). - Change the BLK_ZONE_WPLUG_ERROR zone write plug flag into BLK_ZONE_WPLUG_NEED_WP_UPDATE. This new flag is set for a zone write plug whenever a write opration targetting the zone of the zone write plug fails. This flag indicates that the zone write pointer offset is not reliable and that it must be updated when the next report zone, reset zone, finish zone or disk revalidation is executed. - Modify blk_zone_write_plug_bio_endio() to set the BLK_ZONE_WPLUG_NEED_WP_UPDATE flag for the target zone of a failed write BIO. - Modify the function disk_zone_wplug_set_wp_offset() to clear this new flag, thus implementing recovery of a correct write pointer offset with the reset (all) zone and finish zone operations. - Modify blkdev_report_zones() to always use the disk_report_zones_cb() callback so that disk_zone_wplug_sync_wp_offset() can be called for any zone marked with the BLK_ZONE_WPLUG_NEED_WP_UPDATE flag. This implements recovery of a correct write pointer offset for zone write plugs marked with BLK_ZONE_WPLUG_NEED_WP_UPDATE and within the range of the report zones operation executed by the user. - Modify blk_revalidate_seq_zone() to call disk_zone_wplug_sync_wp_offset() for all sequential write required zones when a zoned block device is revalidated, thus always resolving any inconsistency between the write pointer offset of zone write plugs and the actual write pointer position of sequential zones. Fixes: dd291d77cc90 ("block: Introduce zone write plugging") Cc: stable@vger.kernel.org Signed-off-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Link: https://lore.kernel.org/r/20241209122357.47838-5-dlemoal@kernel.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
* dm: Fix dm-zoned-reclaim zone write pointer alignmentDamien Le Moal2024-12-101-23/+118
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The zone reclaim processing of the dm-zoned device mapper uses blkdev_issue_zeroout() to align the write pointer of a zone being used for reclaiming another zone, to write the valid data blocks from the zone being reclaimed at the same position relative to the zone start in the reclaim target zone. The first call to blkdev_issue_zeroout() will try to use hardware offload using a REQ_OP_WRITE_ZEROES operation if the device reports a non-zero max_write_zeroes_sectors queue limit. If this operation fails because of the lack of hardware support, blkdev_issue_zeroout() falls back to using a regular write operation with the zero-page as buffer. Currently, such REQ_OP_WRITE_ZEROES failure is automatically handled by the block layer zone write plugging code which will execute a report zones operation to ensure that the write pointer of the target zone of the failed operation has not changed and to "rewind" the zone write pointer offset of the target zone as it was advanced when the write zero operation was submitted. So the REQ_OP_WRITE_ZEROES failure does not cause any issue and blkdev_issue_zeroout() works as expected. However, since the automatic recovery of zone write pointers by the zone write plugging code can potentially cause deadlocks with queue freeze operations, a different recovery must be implemented in preparation for the removal of zone write plugging report zones based recovery. Do this by introducing the new function blk_zone_issue_zeroout(). This function first calls blkdev_issue_zeroout() with the flag BLKDEV_ZERO_NOFALLBACK to intercept failures on the first execution which attempt to use the device hardware offload with the REQ_OP_WRITE_ZEROES operation. If this attempt fails, a report zone operation is issued to restore the zone write pointer offset of the target zone to the correct position and blkdev_issue_zeroout() is called again without the BLKDEV_ZERO_NOFALLBACK flag. The report zones operation performing this recovery is implemented using the helper function disk_zone_sync_wp_offset() which calls the gendisk report_zones file operation with the callback disk_report_zones_cb(). This callback updates the target write pointer offset of the target zone using the new function disk_zone_wplug_sync_wp_offset(). dmz_reclaim_align_wp() is modified to change its call to blkdev_issue_zeroout() to a call to blk_zone_issue_zeroout() without any other change needed as the two functions are functionnally equivalent. Fixes: dd291d77cc90 ("block: Introduce zone write plugging") Cc: stable@vger.kernel.org Signed-off-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Acked-by: Mike Snitzer <snitzer@kernel.org> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Link: https://lore.kernel.org/r/20241209122357.47838-4-dlemoal@kernel.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
* block: Ignore REQ_NOWAIT for zone reset and zone finish operationsDamien Le Moal2024-12-101-0/+9
| | | | | | | | | | | | | | | | | | There are currently any issuer of REQ_OP_ZONE_RESET and REQ_OP_ZONE_FINISH operations that set REQ_NOWAIT. However, as we cannot handle this flag correctly due to the potential request allocation failure that may happen in blk_mq_submit_bio() after blk_zone_plug_bio() has handled the zone write plug write pointer updates for the targeted zones, modify blk_zone_wplug_handle_reset_or_finish() to warn if this flag is set and ignore it. Fixes: dd291d77cc90 ("block: Introduce zone write plugging") Cc: stable@vger.kernel.org Signed-off-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Link: https://lore.kernel.org/r/20241209122357.47838-3-dlemoal@kernel.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
* block: Use a zone write plug BIO work for REQ_NOWAIT BIOsDamien Le Moal2024-12-101-20/+42
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | For zoned block devices, a write BIO issued to a zone that has no on-going writes will be prepared for execution and allowed to execute immediately by blk_zone_wplug_handle_write() (called from blk_zone_plug_bio()). However, if this BIO specifies REQ_NOWAIT, the allocation of a request for its execution in blk_mq_submit_bio() may fail after blk_zone_plug_bio() completed, marking the target zone of the BIO as plugged. When this BIO is retried later on, it will be blocked as the zone write plug of the target zone is in a plugged state without any on-going write operation (completion of write operations trigger unplugging of the next write BIOs for a zone). This leads to a BIO that is stuck in a zone write plug and never completes, which results in various issues such as hung tasks. Avoid this problem by always executing REQ_NOWAIT write BIOs using the BIO work of a zone write plug. This ensure that we never block the BIO issuer and can thus safely ignore the REQ_NOWAIT flag when executing the BIO from the zone write plug BIO work. Since such BIO may be the first write BIO issued to a zone with no on-going write, modify disk_zone_wplug_add_bio() to schedule the zone write plug BIO work if the write plug is not already marked with the BLK_ZONE_WPLUG_PLUGGED flag. This scheduling is otherwise not necessary as the completion of the on-going write for the zone will schedule the execution of the next plugged BIOs. blk_zone_wplug_handle_write() is also fixed to better handle zone write plug allocation failures for REQ_NOWAIT BIOs by failing a write BIO using bio_wouldblock_error() instead of bio_io_error(). Reported-by: Bart Van Assche <bvanassche@acm.org> Fixes: dd291d77cc90 ("block: Introduce zone write plugging") Cc: stable@vger.kernel.org Signed-off-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Link: https://lore.kernel.org/r/20241209122357.47838-2-dlemoal@kernel.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
* blk-mq: move cpuhp callback registering out of q->sysfs_lockMing Lei2024-12-061-11/+92
| | | | | | | | | | | | | | | | | | | | | | | | | Registering and unregistering cpuhp callback requires global cpu hotplug lock, which is used everywhere. Meantime q->sysfs_lock is used in block layer almost everywhere. It is easy to trigger lockdep warning[1] by connecting the two locks. Fix the warning by moving blk-mq's cpuhp callback registering out of q->sysfs_lock. Add one dedicated global lock for covering registering & unregistering hctx's cpuhp, and it is safe to do so because hctx is guaranteed to be live if our request_queue is live. [1] https://lore.kernel.org/lkml/Z04pz3AlvI4o0Mr8@agluck-desk3/ Cc: Reinette Chatre <reinette.chatre@intel.com> Cc: Fenghua Yu <fenghua.yu@intel.com> Cc: Peter Newman <peternewman@google.com> Cc: Babu Moger <babu.moger@amd.com> Reported-by: Luck Tony <tony.luck@intel.com> Signed-off-by: Ming Lei <ming.lei@redhat.com> Tested-by: Tony Luck <tony.luck@intel.com> Link: https://lore.kernel.org/r/20241206111611.978870-3-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
* blk-mq: register cpuhp callback after hctx is added to xarray tableMing Lei2024-12-061-8/+7
| | | | | | | | | | | | | | | We need to retrieve 'hctx' from xarray table in the cpuhp callback, so the callback should be registered after this 'hctx' is added to xarray table. Cc: Reinette Chatre <reinette.chatre@intel.com> Cc: Fenghua Yu <fenghua.yu@intel.com> Cc: Peter Newman <peternewman@google.com> Cc: Babu Moger <babu.moger@amd.com> Cc: Luck Tony <tony.luck@intel.com> Signed-off-by: Ming Lei <ming.lei@redhat.com> Tested-by: Tony Luck <tony.luck@intel.com> Link: https://lore.kernel.org/r/20241206111611.978870-2-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
* Merge tag 'block-6.13-20242901' of git://git.kernel.dk/linuxLinus Torvalds2024-12-0110-77/+204
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull more block updates from Jens Axboe: - NVMe pull request via Keith: - Use correct srcu list traversal (Breno) - Scatter-gather support for metadata (Keith) - Fabrics shutdown race condition fix (Nilay) - Persistent reservations updates (Guixin) - Add the required bits for MD atomic write support for raid0/1/10 - Correct return value for unknown opcode in ublk - Fix deadlock with zone revalidation - Fix for the io priority request vs bio cleanups - Use the correct unsigned int type for various limit helpers - Fix for a race in loop - Cleanup blk_rq_prep_clone() to prevent uninit-value warning and make it easier for actual humans to read - Fix potential UAF when iterating tags - A few fixes for bfq-iosched UAF issues - Fix for brd discard not decrementing the allocated page count - Various little fixes and cleanups * tag 'block-6.13-20242901' of git://git.kernel.dk/linux: (36 commits) brd: decrease the number of allocated pages which discarded block, bfq: fix bfqq uaf in bfq_limit_depth() block: Don't allow an atomic write be truncated in blkdev_write_iter() mq-deadline: don't call req_get_ioprio from the I/O completion handler block: Prevent potential deadlock in blk_revalidate_disk_zones() block: Remove extra part pointer NULLify in blk_rq_init() nvme: tuning pr code by using defined structs and macros nvme: introduce change ptpl and iekey definition block: return bool from get_disk_ro and bdev_read_only block: remove a duplicate definition for bdev_read_only block: return bool from blk_rq_aligned block: return unsigned int from blk_lim_dma_alignment_and_pad block: return unsigned int from queue_dma_alignment block: return unsigned int from bdev_io_opt block: req->bio is always set in the merge code block: don't bother checking the data direction for merges block: blk-mq: fix uninit-value in blk_rq_prep_clone and refactor Revert "block, bfq: merge bfq_release_process_ref() into bfq_put_cooperator()" md/raid10: Atomic write support md/raid1: Atomic write support ...
| * block, bfq: fix bfqq uaf in bfq_limit_depth()Yu Kuai2024-11-291-13/+24
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Set new allocated bfqq to bic or remove freed bfqq from bic are both protected by bfqd->lock, however bfq_limit_depth() is deferencing bfqq from bic without the lock, this can lead to UAF if the io_context is shared by multiple tasks. For example, test bfq with io_uring can trigger following UAF in v6.6: ================================================================== BUG: KASAN: slab-use-after-free in bfqq_group+0x15/0x50 Call Trace: <TASK> dump_stack_lvl+0x47/0x80 print_address_description.constprop.0+0x66/0x300 print_report+0x3e/0x70 kasan_report+0xb4/0xf0 bfqq_group+0x15/0x50 bfqq_request_over_limit+0x130/0x9a0 bfq_limit_depth+0x1b5/0x480 __blk_mq_alloc_requests+0x2b5/0xa00 blk_mq_get_new_requests+0x11d/0x1d0 blk_mq_submit_bio+0x286/0xb00 submit_bio_noacct_nocheck+0x331/0x400 __block_write_full_folio+0x3d0/0x640 writepage_cb+0x3b/0xc0 write_cache_pages+0x254/0x6c0 write_cache_pages+0x254/0x6c0 do_writepages+0x192/0x310 filemap_fdatawrite_wbc+0x95/0xc0 __filemap_fdatawrite_range+0x99/0xd0 filemap_write_and_wait_range.part.0+0x4d/0xa0 blkdev_read_iter+0xef/0x1e0 io_read+0x1b6/0x8a0 io_issue_sqe+0x87/0x300 io_wq_submit_work+0xeb/0x390 io_worker_handle_work+0x24d/0x550 io_wq_worker+0x27f/0x6c0 ret_from_fork_asm+0x1b/0x30 </TASK> Allocated by task 808602: kasan_save_stack+0x1e/0x40 kasan_set_track+0x21/0x30 __kasan_slab_alloc+0x83/0x90 kmem_cache_alloc_node+0x1b1/0x6d0 bfq_get_queue+0x138/0xfa0 bfq_get_bfqq_handle_split+0xe3/0x2c0 bfq_init_rq+0x196/0xbb0 bfq_insert_request.isra.0+0xb5/0x480 bfq_insert_requests+0x156/0x180 blk_mq_insert_request+0x15d/0x440 blk_mq_submit_bio+0x8a4/0xb00 submit_bio_noacct_nocheck+0x331/0x400 __blkdev_direct_IO_async+0x2dd/0x330 blkdev_write_iter+0x39a/0x450 io_write+0x22a/0x840 io_issue_sqe+0x87/0x300 io_wq_submit_work+0xeb/0x390 io_worker_handle_work+0x24d/0x550 io_wq_worker+0x27f/0x6c0 ret_from_fork+0x2d/0x50 ret_from_fork_asm+0x1b/0x30 Freed by task 808589: kasan_save_stack+0x1e/0x40 kasan_set_track+0x21/0x30 kasan_save_free_info+0x27/0x40 __kasan_slab_free+0x126/0x1b0 kmem_cache_free+0x10c/0x750 bfq_put_queue+0x2dd/0x770 __bfq_insert_request.isra.0+0x155/0x7a0 bfq_insert_request.isra.0+0x122/0x480 bfq_insert_requests+0x156/0x180 blk_mq_dispatch_plug_list+0x528/0x7e0 blk_mq_flush_plug_list.part.0+0xe5/0x590 __blk_flush_plug+0x3b/0x90 blk_finish_plug+0x40/0x60 do_writepages+0x19d/0x310 filemap_fdatawrite_wbc+0x95/0xc0 __filemap_fdatawrite_range+0x99/0xd0 filemap_write_and_wait_range.part.0+0x4d/0xa0 blkdev_read_iter+0xef/0x1e0 io_read+0x1b6/0x8a0 io_issue_sqe+0x87/0x300 io_wq_submit_work+0xeb/0x390 io_worker_handle_work+0x24d/0x550 io_wq_worker+0x27f/0x6c0 ret_from_fork+0x2d/0x50 ret_from_fork_asm+0x1b/0x30 Fix the problem by protecting bic_to_bfqq() with bfqd->lock. CC: Jan Kara <jack@suse.cz> Fixes: 76f1df88bbc2 ("bfq: Limit number of requests consumed by each cgroup") Signed-off-by: Yu Kuai <yukuai3@huawei.com> Link: https://lore.kernel.org/r/20241129091509.2227136-1-yukuai1@huaweicloud.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * block: Don't allow an atomic write be truncated in blkdev_write_iter()John Garry2024-11-271-1/+4
| | | | | | | | | | | | | | | | | | | | | | | | A write which goes past the end of the bdev in blkdev_write_iter() will be truncated. Truncating cannot tolerated for an atomic write, so error that condition. Fixes: caf336f81b3a ("block: Add fops atomic write support") Signed-off-by: John Garry <john.g.garry@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20241127092318.632790-1-john.g.garry@oracle.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * mq-deadline: don't call req_get_ioprio from the I/O completion handlerChristoph Hellwig2024-11-261-9/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | req_get_ioprio looks at req->bio to find the I/O priority, which is not set when completing bios that the driver fully iterated through. Stash away the dd_per_prio in the elevator private data instead of looking it up again to optimize the code a bit while fixing the regression from removing the per-request ioprio value. Fixes: 6975c1a486a4 ("block: remove the ioprio field from struct request") Reported-by: Chris Bainbridge <chris.bainbridge@gmail.com> Reported-by: Sam Protsenko <semen.protsenko@linaro.org> Signed-off-by: Christoph Hellwig <hch@lst.de> Tested-by: Chris Bainbridge <chris.bainbridge@gmail.com> Tested-by: Sam Protsenko <semen.protsenko@linaro.org> Reviewed-by: Bart Van Assche <bvanassche@acm.org> Link: https://lore.kernel.org/r/20241126102136.619067-1-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * block: Prevent potential deadlock in blk_revalidate_disk_zones()Damien Le Moal2024-11-261-4/+10
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The function blk_revalidate_disk_zones() calls the function disk_update_zone_resources() after freezing the device queue. In turn, disk_update_zone_resources() calls queue_limits_start_update() which takes a queue limits mutex lock, resulting in the ordering: q->q_usage_counter check -> q->limits_lock. However, the usual ordering is to always take a queue limit lock before freezing the queue to commit the limits updates, e.g., the code pattern: lim = queue_limits_start_update(q); ... blk_mq_freeze_queue(q); ret = queue_limits_commit_update(q, &lim); blk_mq_unfreeze_queue(q); Thus, blk_revalidate_disk_zones() introduces a potential circular locking dependency deadlock that lockdep sometimes catches with the splat: [ 51.934109] ====================================================== [ 51.935916] WARNING: possible circular locking dependency detected [ 51.937561] 6.12.0+ #2107 Not tainted [ 51.938648] ------------------------------------------------------ [ 51.940351] kworker/u16:4/157 is trying to acquire lock: [ 51.941805] ffff9fff0aa0bea8 (&q->limits_lock){+.+.}-{4:4}, at: disk_update_zone_resources+0x86/0x170 [ 51.944314] but task is already holding lock: [ 51.945688] ffff9fff0aa0b890 (&q->q_usage_counter(queue)#3){++++}-{0:0}, at: blk_revalidate_disk_zones+0x15f/0x340 [ 51.948527] which lock already depends on the new lock. [ 51.951296] the existing dependency chain (in reverse order) is: [ 51.953708] -> #1 (&q->q_usage_counter(queue)#3){++++}-{0:0}: [ 51.956131] blk_queue_enter+0x1c9/0x1e0 [ 51.957290] blk_mq_alloc_request+0x187/0x2a0 [ 51.958365] scsi_execute_cmd+0x78/0x490 [scsi_mod] [ 51.959514] read_capacity_16+0x111/0x410 [sd_mod] [ 51.960693] sd_revalidate_disk.isra.0+0x872/0x3240 [sd_mod] [ 51.962004] sd_probe+0x2d7/0x520 [sd_mod] [ 51.962993] really_probe+0xd5/0x330 [ 51.963898] __driver_probe_device+0x78/0x110 [ 51.964925] driver_probe_device+0x1f/0xa0 [ 51.965916] __driver_attach_async_helper+0x60/0xe0 [ 51.967017] async_run_entry_fn+0x2e/0x140 [ 51.968004] process_one_work+0x21f/0x5a0 [ 51.968987] worker_thread+0x1dc/0x3c0 [ 51.969868] kthread+0xe0/0x110 [ 51.970377] ret_from_fork+0x31/0x50 [ 51.970983] ret_from_fork_asm+0x11/0x20 [ 51.971587] -> #0 (&q->limits_lock){+.+.}-{4:4}: [ 51.972479] __lock_acquire+0x1337/0x2130 [ 51.973133] lock_acquire+0xc5/0x2d0 [ 51.973691] __mutex_lock+0xda/0xcf0 [ 51.974300] disk_update_zone_resources+0x86/0x170 [ 51.975032] blk_revalidate_disk_zones+0x16c/0x340 [ 51.975740] sd_zbc_revalidate_zones+0x73/0x160 [sd_mod] [ 51.976524] sd_revalidate_disk.isra.0+0x465/0x3240 [sd_mod] [ 51.977824] sd_probe+0x2d7/0x520 [sd_mod] [ 51.978917] really_probe+0xd5/0x330 [ 51.979915] __driver_probe_device+0x78/0x110 [ 51.981047] driver_probe_device+0x1f/0xa0 [ 51.982143] __driver_attach_async_helper+0x60/0xe0 [ 51.983282] async_run_entry_fn+0x2e/0x140 [ 51.984319] process_one_work+0x21f/0x5a0 [ 51.985873] worker_thread+0x1dc/0x3c0 [ 51.987289] kthread+0xe0/0x110 [ 51.988546] ret_from_fork+0x31/0x50 [ 51.989926] ret_from_fork_asm+0x11/0x20 [ 51.991376] other info that might help us debug this: [ 51.994127] Possible unsafe locking scenario: [ 51.995651] CPU0 CPU1 [ 51.996694] ---- ---- [ 51.997716] lock(&q->q_usage_counter(queue)#3); [ 51.998817] lock(&q->limits_lock); [ 52.000043] lock(&q->q_usage_counter(queue)#3); [ 52.001638] lock(&q->limits_lock); [ 52.002485] *** DEADLOCK *** Prevent this issue by moving the calls to blk_mq_freeze_queue() and blk_mq_unfreeze_queue() around the call to queue_limits_commit_update() in disk_update_zone_resources(). In case of revalidation failure, the call to disk_free_zone_resources() in blk_revalidate_disk_zones() is still done with the queue frozen as before. Fixes: 843283e96e5a ("block: Fake max open zones limit when there is no limit") Cc: stable@vger.kernel.org Signed-off-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20241126104705.183996-1-dlemoal@kernel.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * block: Remove extra part pointer NULLify in blk_rq_init()John Garry2024-11-251-1/+0
| | | | | | | | | | | | | | | | | | The rq->part pointer is already NULLified in the memset() call, so - like for other pointers in rq - don't re-NULLify. Signed-off-by: John Garry <john.g.garry@oracle.com> Link: https://lore.kernel.org/r/20241125100258.4172774-1-john.g.garry@oracle.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * block: req->bio is always set in the merge codeChristoph Hellwig2024-11-201-22/+8
| | | | | | | | | | | | | | | | | | | | | | | | | | | | As smatch, which is a lot smarter than me noticed. So remove the checks for it, and condense these checks a bit including the comments stating the obvious. Reported-by: Dan Carpenter <dan.carpenter@linaro.org> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: John Garry <john.g.garry@oracle.com> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Link: https://lore.kernel.org/r/20241119161157.1328171-3-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * block: don't bother checking the data direction for mergesChristoph Hellwig2024-11-201-7/+0
| | | | | | | | | | | | | | | | | | | | | | Because it already is encoded in the opcode. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: John Garry <john.g.garry@oracle.com> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Link: https://lore.kernel.org/r/20241119161157.1328171-2-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * block: blk-mq: fix uninit-value in blk_rq_prep_clone and refactorSuraj Sonawane2024-11-201-7/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Fix an issue detected by the `smatch` tool: block/blk-mq.c:3314 blk_rq_prep_clone() error: uninitialized symbol 'bio'. This patch refactors `blk_rq_prep_clone()` to improve code readability and ensure safety by addressing potential misuse of the `bio` variable: - Move the bio_put(bio); call to the bio_ctr error handling block, which is the only place where it can be triggered. - Move the bio variable into the __rq_for_each_bio loop scope. This change removes the need to set bio to NULL at the loop's end. discussion on why bio remains uninitialized: https://lore.kernel.org/lkml/20241004141037.43277-1-surajsonawane0215@gmail.com Summary of above discussion: - I pointed out that `bio` can remain uninitialized if the allocation with `bio_alloc_clone` fails. - Keith Busch explained that `bio` is initialized to `NULL` when `bio_alloc_clone()` fails, preventing uninitialized usage. - John Garry questioned whether `rq_src->bio` being `NULL` could leave `bio` uninitialized. Keith clarified that in such cases, `bio` is not referenced, so it does not need initialization. - Christoph Hellwig recommended code improvements: - move the bio_put to the bio_ctr error handling, which is the only case where it can happen - move the bio variable into the __rq_for_each_bio scope, which also removed the need to zero it at the end of the loop These changes enhance code clarity, address static analysis tool warnings, and make the function more maintainable. thread of previous version patch discussion: https://lore.kernel.org/lkml/20241004100842.9052-1-surajsonawane0215@gmail.com Signed-off-by: Suraj Sonawane <surajsonawane0215@gmail.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20241119164412.37609-1-surajsonawane0215@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * Revert "block, bfq: merge bfq_release_process_ref() into bfq_put_cooperator()"Zach Wade2024-11-202-2/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This reverts commit bc3b1e9e7c50e1de0f573eea3871db61dd4787de. The bic is associated with sync_bfqq, and bfq_release_process_ref cannot be put into bfq_put_cooperator. kasan report: [ 400.347277] ================================================================== [ 400.347287] BUG: KASAN: slab-use-after-free in bic_set_bfqq+0x200/0x230 [ 400.347420] Read of size 8 at addr ffff88881cab7d60 by task dockerd/5800 [ 400.347430] [ 400.347436] CPU: 24 UID: 0 PID: 5800 Comm: dockerd Kdump: loaded Tainted: G E 6.12.0 #32 [ 400.347450] Tainted: [E]=UNSIGNED_MODULE [ 400.347454] Hardware name: VMware, Inc. VMware20,1/440BX Desktop Reference Platform, BIOS VMW201.00V.20192059.B64.2207280713 07/28/2022 [ 400.347460] Call Trace: [ 400.347464] <TASK> [ 400.347468] dump_stack_lvl+0x5d/0x80 [ 400.347490] print_report+0x174/0x505 [ 400.347521] kasan_report+0xe0/0x160 [ 400.347541] bic_set_bfqq+0x200/0x230 [ 400.347549] bfq_bic_update_cgroup+0x419/0x740 [ 400.347560] bfq_bio_merge+0x133/0x320 [ 400.347584] blk_mq_submit_bio+0x1761/0x1e20 [ 400.347625] __submit_bio+0x28b/0x7b0 [ 400.347664] submit_bio_noacct_nocheck+0x6b2/0xd30 [ 400.347690] iomap_readahead+0x50c/0x680 [ 400.347731] read_pages+0x17f/0x9c0 [ 400.347785] page_cache_ra_unbounded+0x366/0x4a0 [ 400.347795] filemap_fault+0x83d/0x2340 [ 400.347819] __xfs_filemap_fault+0x11a/0x7d0 [xfs] [ 400.349256] __do_fault+0xf1/0x610 [ 400.349270] do_fault+0x977/0x11a0 [ 400.349281] __handle_mm_fault+0x5d1/0x850 [ 400.349314] handle_mm_fault+0x1f8/0x560 [ 400.349324] do_user_addr_fault+0x324/0x970 [ 400.349337] exc_page_fault+0x76/0xf0 [ 400.349350] asm_exc_page_fault+0x26/0x30 [ 400.349360] RIP: 0033:0x55a480d77375 [ 400.349384] Code: cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc cc 49 3b 66 10 0f 86 ae 02 00 00 55 48 89 e5 48 83 ec 58 48 8b 10 <83> 7a 10 00 0f 84 27 02 00 00 44 0f b6 42 28 44 0f b6 4a 29 41 80 [ 400.349392] RSP: 002b:00007f18c37fd8b8 EFLAGS: 00010216 [ 400.349401] RAX: 00007f18c37fd9d0 RBX: 0000000000000000 RCX: 0000000000000000 [ 400.349407] RDX: 000055a484407d38 RSI: 000000c000e8b0c0 RDI: 0000000000000000 [ 400.349412] RBP: 00007f18c37fd910 R08: 000055a484017f60 R09: 000055a484066f80 [ 400.349417] R10: 0000000000194000 R11: 0000000000000005 R12: 0000000000000008 [ 400.349422] R13: 0000000000000000 R14: 000000c000476a80 R15: 0000000000000000 [ 400.349430] </TASK> [ 400.349452] [ 400.349454] Allocated by task 5800: [ 400.349459] kasan_save_stack+0x30/0x50 [ 400.349469] kasan_save_track+0x14/0x30 [ 400.349475] __kasan_slab_alloc+0x89/0x90 [ 400.349482] kmem_cache_alloc_node_noprof+0xdc/0x2a0 [ 400.349492] bfq_get_queue+0x1ef/0x1100 [ 400.349502] __bfq_get_bfqq_handle_split+0x11a/0x510 [ 400.349511] bfq_insert_requests+0xf55/0x9030 [ 400.349519] blk_mq_flush_plug_list+0x446/0x14c0 [ 400.349527] __blk_flush_plug+0x27c/0x4e0 [ 400.349534] blk_finish_plug+0x52/0xa0 [ 400.349540] _xfs_buf_ioapply+0x739/0xc30 [xfs] [ 400.350246] __xfs_buf_submit+0x1b2/0x640 [xfs] [ 400.350967] xfs_buf_read_map+0x306/0xa20 [xfs] [ 400.351672] xfs_trans_read_buf_map+0x285/0x7d0 [xfs] [ 400.352386] xfs_imap_to_bp+0x107/0x270 [xfs] [ 400.353077] xfs_iget+0x70d/0x1eb0 [xfs] [ 400.353786] xfs_lookup+0x2ca/0x3a0 [xfs] [ 400.354506] xfs_vn_lookup+0x14e/0x1a0 [xfs] [ 400.355197] __lookup_slow+0x19c/0x340 [ 400.355204] lookup_one_unlocked+0xfc/0x120 [ 400.355211] ovl_lookup_single+0x1b3/0xcf0 [overlay] [ 400.355255] ovl_lookup_layer+0x316/0x490 [overlay] [ 400.355295] ovl_lookup+0x844/0x1fd0 [overlay] [ 400.355351] lookup_one_qstr_excl+0xef/0x150 [ 400.355357] do_unlinkat+0x22a/0x620 [ 400.355366] __x64_sys_unlinkat+0x109/0x1e0 [ 400.355375] do_syscall_64+0x82/0x160 [ 400.355384] entry_SYSCALL_64_after_hwframe+0x76/0x7e [ 400.355393] [ 400.355395] Freed by task 5800: [ 400.355400] kasan_save_stack+0x30/0x50 [ 400.355407] kasan_save_track+0x14/0x30 [ 400.355413] kasan_save_free_info+0x3b/0x70 [ 400.355422] __kasan_slab_free+0x4f/0x70 [ 400.355429] kmem_cache_free+0x176/0x520 [ 400.355438] bfq_put_queue+0x67e/0x980 [ 400.355447] bfq_bic_update_cgroup+0x407/0x740 [ 400.355454] bfq_bio_merge+0x133/0x320 [ 400.355460] blk_mq_submit_bio+0x1761/0x1e20 [ 400.355467] __submit_bio+0x28b/0x7b0 [ 400.355473] submit_bio_noacct_nocheck+0x6b2/0xd30 [ 400.355480] iomap_readahead+0x50c/0x680 [ 400.355490] read_pages+0x17f/0x9c0 [ 400.355498] page_cache_ra_unbounded+0x366/0x4a0 [ 400.355505] filemap_fault+0x83d/0x2340 [ 400.355514] __xfs_filemap_fault+0x11a/0x7d0 [xfs] [ 400.356204] __do_fault+0xf1/0x610 [ 400.356213] do_fault+0x977/0x11a0 [ 400.356221] __handle_mm_fault+0x5d1/0x850 [ 400.356230] handle_mm_fault+0x1f8/0x560 [ 400.356238] do_user_addr_fault+0x324/0x970 [ 400.356248] exc_page_fault+0x76/0xf0 [ 400.356258] asm_exc_page_fault+0x26/0x30 [ 400.356266] [ 400.356269] The buggy address belongs to the object at ffff88881cab7bc0 which belongs to the cache bfq_queue of size 576 [ 400.356276] The buggy address is located 416 bytes inside of freed 576-byte region [ffff88881cab7bc0, ffff88881cab7e00) [ 400.356285] [ 400.356287] The buggy address belongs to the physical page: [ 400.356292] page: refcount:1 mapcount:0 mapping:0000000000000000 index:0xffff88881cab0b00 pfn:0x81cab0 [ 400.356300] head: order:3 mapcount:0 entire_mapcount:0 nr_pages_mapped:0 pincount:0 [ 400.356323] flags: 0x50000000000040(head|node=1|zone=2) [ 400.356331] page_type: f5(slab) [ 400.356340] raw: 0050000000000040 ffff88880a00c280 dead000000000122 0000000000000000 [ 400.356347] raw: ffff88881cab0b00 00000000802e0025 00000001f5000000 0000000000000000 [ 400.356354] head: 0050000000000040 ffff88880a00c280 dead000000000122 0000000000000000 [ 400.356359] head: ffff88881cab0b00 00000000802e0025 00000001f5000000 0000000000000000 [ 400.356365] head: 0050000000000003 ffffea002072ac01 ffffffffffffffff 0000000000000000 [ 400.356370] head: 0000000000000008 0000000000000000 00000000ffffffff 0000000000000000 [ 400.356378] page dumped because: kasan: bad access detected [ 400.356381] [ 400.356383] Memory state around the buggy address: [ 400.356387] ffff88881cab7c00: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb [ 400.356392] ffff88881cab7c80: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb [ 400.356397] >ffff88881cab7d00: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb [ 400.356400] ^ [ 400.356405] ffff88881cab7d80: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb [ 400.356409] ffff88881cab7e00: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc [ 400.356413] ================================================================== Cc: stable@vger.kernel.org Fixes: bc3b1e9e7c50 ("block, bfq: merge bfq_release_process_ref() into bfq_put_cooperator()") Signed-off-by: Zach Wade <zachwade.k@gmail.com> Cc: Ding Hui <dinghui@sangfor.com.cn> Reviewed-by: Yu Kuai <yukuai3@huawei.com> Link: https://lore.kernel.org/r/20241119153410.2546-1-zachwade.k@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * block: Support atomic writes limits for stacked devicesJohn Garry2024-11-191-0/+115
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Allow stacked devices to support atomic writes by aggregating the minimum capability of all bottom devices. Flag BLK_FEAT_ATOMIC_WRITES_STACKED is set for stacked devices which have been enabled to support atomic writes. Some things to note on the implementation: - For simplicity, all bottom devices must have same atomic write boundary value (if any) - The atomic write boundary must be a power-of-2 already, but this restriction could be relaxed. Furthermore, it is now required that the chunk sectors for a top device must be aligned with this boundary. - If a bottom device atomic write unit min/max are not aligned with the top device chunk sectors, the top device atomic write unit min/max are reduced to a value which works for the chunk sectors. Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: John Garry <john.g.garry@oracle.com> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Link: https://lore.kernel.org/r/20241118105018.1870052-3-john.g.garry@oracle.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * block: Add extra checks in blk_validate_atomic_write_limits()John Garry2024-11-191-0/+17
| | | | | | | | | | | | | | | | | | | | | | | | | | | | It is so far expected that the limits passed are valid. In future atomic writes will be supported for stacked block devices, and calculating the limits there will be complicated, so add extra sanity checks to ensure that the values are always valid. Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: John Garry <john.g.garry@oracle.com> Reviewed-by: Martin K. Petersen <martin.petersen@oracle.com> Link: https://lore.kernel.org/r/20241118105018.1870052-2-john.g.garry@oracle.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * block: Drop granularity check in queue_limit_discard_alignment()John Garry2024-11-191-2/+0
| | | | | | | | | | | | | | | | | | | | lim->discard_granularity is always at least SECTOR_SIZE, so drop the pointless check for granularity less than SECTOR_SIZE. Signed-off-by: John Garry <john.g.garry@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20241112092144.4059847-1-john.g.garry@oracle.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * block: fix uaf for flush rq while iterating tagsYu Kuai2024-11-192-10/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | blk_mq_clear_flush_rq_mapping() is not called during scsi probe, by checking blk_queue_init_done(). However, QUEUE_FLAG_INIT_DONE is cleared in del_gendisk by commit aec89dc5d421 ("block: keep q_usage_counter in atomic mode after del_gendisk"), hence for disk like scsi, following blk_mq_destroy_queue() will not clear flush rq from tags->rqs[] as well, cause following uaf that is found by our syzkaller for v6.6: ================================================================== BUG: KASAN: slab-use-after-free in blk_mq_find_and_get_req+0x16e/0x1a0 block/blk-mq-tag.c:261 Read of size 4 at addr ffff88811c969c20 by task kworker/1:2H/224909 CPU: 1 PID: 224909 Comm: kworker/1:2H Not tainted 6.6.0-ga836a5060850 #32 Workqueue: kblockd blk_mq_timeout_work Call Trace: __dump_stack lib/dump_stack.c:88 [inline] dump_stack_lvl+0x91/0xf0 lib/dump_stack.c:106 print_address_description.constprop.0+0x66/0x300 mm/kasan/report.c:364 print_report+0x3e/0x70 mm/kasan/report.c:475 kasan_report+0xb8/0xf0 mm/kasan/report.c:588 blk_mq_find_and_get_req+0x16e/0x1a0 block/blk-mq-tag.c:261 bt_iter block/blk-mq-tag.c:288 [inline] __sbitmap_for_each_set include/linux/sbitmap.h:295 [inline] sbitmap_for_each_set include/linux/sbitmap.h:316 [inline] bt_for_each+0x455/0x790 block/blk-mq-tag.c:325 blk_mq_queue_tag_busy_iter+0x320/0x740 block/blk-mq-tag.c:534 blk_mq_timeout_work+0x1a3/0x7b0 block/blk-mq.c:1673 process_one_work+0x7c4/0x1450 kernel/workqueue.c:2631 process_scheduled_works kernel/workqueue.c:2704 [inline] worker_thread+0x804/0xe40 kernel/workqueue.c:2785 kthread+0x346/0x450 kernel/kthread.c:388 ret_from_fork+0x4d/0x80 arch/x86/kernel/process.c:147 ret_from_fork_asm+0x1b/0x30 arch/x86/entry/entry_64.S:293 Allocated by task 942: kasan_save_stack+0x22/0x50 mm/kasan/common.c:45 kasan_set_track+0x25/0x30 mm/kasan/common.c:52 ____kasan_kmalloc mm/kasan/common.c:374 [inline] __kasan_kmalloc mm/kasan/common.c:383 [inline] __kasan_kmalloc+0xaa/0xb0 mm/kasan/common.c:380 kasan_kmalloc include/linux/kasan.h:198 [inline] __do_kmalloc_node mm/slab_common.c:1007 [inline] __kmalloc_node+0x69/0x170 mm/slab_common.c:1014 kmalloc_node include/linux/slab.h:620 [inline] kzalloc_node include/linux/slab.h:732 [inline] blk_alloc_flush_queue+0x144/0x2f0 block/blk-flush.c:499 blk_mq_alloc_hctx+0x601/0x940 block/blk-mq.c:3788 blk_mq_alloc_and_init_hctx+0x27f/0x330 block/blk-mq.c:4261 blk_mq_realloc_hw_ctxs+0x488/0x5e0 block/blk-mq.c:4294 blk_mq_init_allocated_queue+0x188/0x860 block/blk-mq.c:4350 blk_mq_init_queue_data block/blk-mq.c:4166 [inline] blk_mq_init_queue+0x8d/0x100 block/blk-mq.c:4176 scsi_alloc_sdev+0x843/0xd50 drivers/scsi/scsi_scan.c:335 scsi_probe_and_add_lun+0x77c/0xde0 drivers/scsi/scsi_scan.c:1189 __scsi_scan_target+0x1fc/0x5a0 drivers/scsi/scsi_scan.c:1727 scsi_scan_channel drivers/scsi/scsi_scan.c:1815 [inline] scsi_scan_channel+0x14b/0x1e0 drivers/scsi/scsi_scan.c:1791 scsi_scan_host_selected+0x2fe/0x400 drivers/scsi/scsi_scan.c:1844 scsi_scan+0x3a0/0x3f0 drivers/scsi/scsi_sysfs.c:151 store_scan+0x2a/0x60 drivers/scsi/scsi_sysfs.c:191 dev_attr_store+0x5c/0x90 drivers/base/core.c:2388 sysfs_kf_write+0x11c/0x170 fs/sysfs/file.c:136 kernfs_fop_write_iter+0x3fc/0x610 fs/kernfs/file.c:338 call_write_iter include/linux/fs.h:2083 [inline] new_sync_write+0x1b4/0x2d0 fs/read_write.c:493 vfs_write+0x76c/0xb00 fs/read_write.c:586 ksys_write+0x127/0x250 fs/read_write.c:639 do_syscall_x64 arch/x86/entry/common.c:51 [inline] do_syscall_64+0x70/0x120 arch/x86/entry/common.c:81 entry_SYSCALL_64_after_hwframe+0x78/0xe2 Freed by task 244687: kasan_save_stack+0x22/0x50 mm/kasan/common.c:45 kasan_set_track+0x25/0x30 mm/kasan/common.c:52 kasan_save_free_info+0x2b/0x50 mm/kasan/generic.c:522 ____kasan_slab_free mm/kasan/common.c:236 [inline] __kasan_slab_free+0x12a/0x1b0 mm/kasan/common.c:244 kasan_slab_free include/linux/kasan.h:164 [inline] slab_free_hook mm/slub.c:1815 [inline] slab_free_freelist_hook mm/slub.c:1841 [inline] slab_free mm/slub.c:3807 [inline] __kmem_cache_free+0xe4/0x520 mm/slub.c:3820 blk_free_flush_queue+0x40/0x60 block/blk-flush.c:520 blk_mq_hw_sysfs_release+0x4a/0x170 block/blk-mq-sysfs.c:37 kobject_cleanup+0x136/0x410 lib/kobject.c:689 kobject_release lib/kobject.c:720 [inline] kref_put include/linux/kref.h:65 [inline] kobject_put+0x119/0x140 lib/kobject.c:737 blk_mq_release+0x24f/0x3f0 block/blk-mq.c:4144 blk_free_queue block/blk-core.c:298 [inline] blk_put_queue+0xe2/0x180 block/blk-core.c:314 blkg_free_workfn+0x376/0x6e0 block/blk-cgroup.c:144 process_one_work+0x7c4/0x1450 kernel/workqueue.c:2631 process_scheduled_works kernel/workqueue.c:2704 [inline] worker_thread+0x804/0xe40 kernel/workqueue.c:2785 kthread+0x346/0x450 kernel/kthread.c:388 ret_from_fork+0x4d/0x80 arch/x86/kernel/process.c:147 ret_from_fork_asm+0x1b/0x30 arch/x86/entry/entry_64.S:293 Other than blk_mq_clear_flush_rq_mapping(), the flag is only used in blk_register_queue() from initialization path, hence it's safe not to clear the flag in del_gendisk. And since QUEUE_FLAG_REGISTERED already make sure that queue should only be registered once, there is no need to test the flag as well. Fixes: 6cfeadbff3f8 ("blk-mq: don't clear flush_rq from tags->rqs[]") Depends-on: commit aec89dc5d421 ("block: keep q_usage_counter in atomic mode after del_gendisk") Signed-off-by: Yu Kuai <yukuai3@huawei.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20241104110005.1412161-1-yukuai1@huaweicloud.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * blk-settings: round down io_opt to physical_block_sizeMikulas Patocka2024-11-181-0/+7
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | There was a bug report [1] where the user got a warning alignment inconsistency. The user has optimal I/O 16776704 (0xFFFE00) and physical block size 4096. Note that the optimal I/O size may be set by the DMA engines or SCSI controllers and they have no knowledge about the disks attached to them, so the situation with optimal I/O not aligned to physical block size may happen. This commit makes blk_validate_limits round down optimal I/O size to the physical block size of the block device. Closes: https://lore.kernel.org/dm-devel/1426ad71-79b4-4062-b2bf-84278be66a5d@redhat.com/T/ [1] Signed-off-by: Mikulas Patocka <mpatocka@redhat.com> Fixes: a23634644afc ("block: take io_opt and io_min into account for max_sectors") Cc: stable@vger.kernel.org # v6.11+ Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/3dc0014b-9690-dc38-81c9-4a316a2d4fb2@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
* | Merge tag 'for-6.13/block-20241118' of git://git.kernel.dk/linuxLinus Torvalds2024-11-1925-440/+760
|\| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull block updates from Jens Axboe: - NVMe updates via Keith: - Use uring_cmd helper (Pavel) - Host Memory Buffer allocation enhancements (Christoph) - Target persistent reservation support (Guixin) - Persistent reservation tracing (Guixen) - NVMe 2.1 specification support (Keith) - Rotational Meta Support (Matias, Wang, Keith) - Volatile cache detection enhancment (Guixen) - MD updates via Song: - Maintainers update - raid5 sync IO fix - Enhance handling of faulty and blocked devices - raid5-ppl atomic improvement - md-bitmap fix - Support for manually defining embedded partition tables - Zone append fixes and cleanups - Stop sending the queued requests in the plug list to the driver ->queue_rqs() handle in reverse order. - Zoned write plug cleanups - Cleanups disk stats tracking and add support for disk stats for passthrough IO - Add preparatory support for file system atomic writes - Add lockdep support for queue freezing. Already found a bunch of issues, and some fixes for that are in here. More will be coming. - Fix race between queue stopping/quiescing and IO queueing - ublk recovery improvements - Fix ublk mmap for 64k pages - Various fixes and cleanups * tag 'for-6.13/block-20241118' of git://git.kernel.dk/linux: (118 commits) MAINTAINERS: Update git tree for mdraid subsystem block: make struct rq_list available for !CONFIG_BLOCK block/genhd: use seq_put_decimal_ull for diskstats decimal values block: don't reorder requests in blk_mq_add_to_batch block: don't reorder requests in blk_add_rq_to_plug block: add a rq_list type block: remove rq_list_move virtio_blk: reverse request order in virtio_queue_rqs nvme-pci: reverse request order in nvme_queue_rqs btrfs: validate queue limits block: export blk_validate_limits nvmet: add tracing of reservation commands nvme: parse reservation commands's action and rtype to string nvmet: report ns's vwc not present md/raid5: Increase r5conf.cache_name size block: remove the ioprio field from struct request block: remove the write_hint field from struct request nvme: check ns's volatile write cache not present nvme: add rotational support nvme: use command set independent id ns if available ...
| * block/genhd: use seq_put_decimal_ull for diskstats decimal valuesDavid Wang2024-11-131-34/+29
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | seq_printf is costly. For each block device, 19 decimal values are yielded in /proc/diskstats via seq_printf. On a system with 16 logical block devices, profiling for open/read/close sequences shows seq_printf took ~75% samples of diskstats_show: diskstats_show(92.626% 2269372/2450040) seq_printf(76.026% 1725313/2269372) vsnprintf(99.163% 1710866/1725313) format_decode(26.597% 455040/1710866) number(19.554% 334542/1710866) memcpy_orig(4.183% 71570/1710866) ... srso_return_thunk(0.009% 148/1725313) part_stat_read_all(8.030% 182236/2269372) One million rounds of open/read/close /proc/diskstats takes: real 0m37.687s user 0m0.264s sys 0m32.911s On average, each sequence tooks ~0.032ms With this patch, most decimal values are yield via seq_put_decimal_ull, performance is significantly improved: real 0m20.792s user 0m0.316s sys 0m20.463s On average, each sequence tooks ~0.020ms, a ~37.5% improvement. Signed-off-by: David Wang <00107082@163.com> Link: https://lore.kernel.org/r/20241108054500.4251-1-00107082@163.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * block: don't reorder requests in blk_add_rq_to_plugChristoph Hellwig2024-11-131-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | Add requests to the tail of the list instead of the front so that they are queued up in submission order. Remove the re-reordering in blk_mq_dispatch_plug_list, virtio_queue_rqs and nvme_queue_rqs now that the list is ordered as expected. Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20241113152050.157179-6-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * block: add a rq_list typeChristoph Hellwig2024-11-134-26/+24
| | | | | | | | | | | | | | | | | | | | | | Replace the semi-open coded request list helpers with a proper rq_list type that mirrors the bio_list and has head and tail pointers. Besides better type safety this actually allows to insert at the tail of the list, which will be useful soon. Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20241113152050.157179-5-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * block: export blk_validate_limitsChristoph Hellwig2024-11-131-1/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | While block drivers do the validation as part of committing them to the queue, users that use the limit outside of a block device context have to validate the limits and fill in the calculated values as well. So far btrfs is the only user of queue limits without a block device, and it has gotten away with that more or less by accident. But with commit 559218d43ec9 ("block: pre-calculate max_zone_append_sectors") this became fatal for setups that have small max zone append size, as it won't be limited now. Export blk_validate_limits so that it can be called directly from btrfs. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Link: https://lore.kernel.org/r/20241113084541.34315-2-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * block: remove the ioprio field from struct requestChristoph Hellwig2024-11-122-8/+5
| | | | | | | | | | | | | | | | | | | | | | The request ioprio is only initialized from the first attached bio, so requests without a bio already never set it. Directly use the bio field instead. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Bart Van Assche <bvanassche@acm.org> Link: https://lore.kernel.org/r/20241112170050.1612998-3-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * block: remove the write_hint field from struct requestChristoph Hellwig2024-11-122-8/+10
| | | | | | | | | | | | | | | | | | | | The write_hint is only used for read/write requests, which must have a bio attached to them. Just use the bio field instead. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Bart Van Assche <bvanassche@acm.org> Link: https://lore.kernel.org/r/20241112170050.1612998-2-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * block: pre-calculate max_zone_append_sectorsChristoph Hellwig2024-11-114-31/+18
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | max_zone_append_sectors differs from all other queue limits in that the final value used is not stored in the queue_limits but needs to be obtained using queue_limits_max_zone_append_sectors helper. This not only adds (tiny) extra overhead to the I/O path, but also can be easily forgotten in file system code. Add a new max_hw_zone_append_sectors value to queue_limits which is set by the driver, and calculate max_zone_append_sectors from that and the other inputs in blk_validate_zoned_limits, similar to how max_sectors is calculated to fix this. Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20241104073955.112324-3-hch@lst.de Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Link: https://lore.kernel.org/r/20241108154657.845768-2-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * block: lift bio_is_zone_append to bio.hChristoph Hellwig2024-11-111-9/+0
| | | | | | | | | | | | | | | | | | | | | | | | Make bio_is_zone_append globally available, because file systems need to use to check for a zone append bio in their end_io handlers to deal with the block layer emulation. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Link: https://lore.kernel.org/r/20241104062647.91160-4-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * block: fix bio_split_rw_at to take zone_write_granularity into accountChristoph Hellwig2024-11-111-1/+9
| | | | | | | | | | | | | | | | | | | | | | Otherwise it can create unaligned writes on zoned devices. Fixes: a805a4fa4fa3 ("block: introduce zone_write_granularity limit") Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Link: https://lore.kernel.org/r/20241104062647.91160-3-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * block: take chunk_sectors into account in bio_split_write_zeroesChristoph Hellwig2024-11-111-12/+23
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | For zoned devices, write zeroes must be split at the zone boundary which is represented as chunk_sectors. For other uses like the internally RAIDed NVMe devices it is probably at least useful. Enhance get_max_io_size to know about write zeroes and use it in bio_split_write_zeroes. Also add a comment about the seemingly nonsensical zero max_write_zeroes limit. Fixes: 885fa13f6559 ("block: implement splitting of REQ_OP_WRITE_ZEROES bios") Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Link: https://lore.kernel.org/r/20241104062647.91160-2-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * block: Handle bio_split() errors in bio_submit_split()John Garry2024-11-111-5/+10
| | | | | | | | | | | | | | | | | | | | | | bio_split() may error, so check this. Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Signed-off-by: John Garry <john.g.garry@oracle.com> Link: https://lore.kernel.org/r/20241111112150.3756529-4-john.g.garry@oracle.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * block: Error an attempt to split an atomic write in bio_split()John Garry2024-11-111-0/+4
| | | | | | | | | | | | | | | | | | | | | | This is disallowed. Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Hannes Reinecke <hare@suse.de> Signed-off-by: John Garry <john.g.garry@oracle.com> Link: https://lore.kernel.org/r/20241111112150.3756529-3-john.g.garry@oracle.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * block: Rework bio_split() return valueJohn Garry2024-11-112-5/+7
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Instead of returning an inconclusive value of NULL for an error in calling bio_split(), return a ERR_PTR() always. Also remove the BUG_ON() calls, and WARN_ON_ONCE() instead. Indeed, since almost all callers don't check the return code from bio_split(), we'll crash anyway (for those failures). Fix up the only user which checks bio_split() return code today (directly or indirectly), blk_crypto_fallback_split_bio_if_needed(). The md/bcache code does check the return code in cached_dev_cache_miss() -> bio_next_split() -> bio_split(), but only to see if there was a split, so there would be no change in behaviour here (when returning a ERR_PTR()). Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: John Garry <john.g.garry@oracle.com> Link: https://lore.kernel.org/r/20241111112150.3756529-2-john.g.garry@oracle.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * block: don't verify IO lock for freeze/unfreeze in elevator_init_mq()Ming Lei2024-11-081-2/+8
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | elevator_init_mq() is only called at the entry of add_disk_fwnode() when disk IO isn't allowed yet. So not verify io lock(q->io_lockdep_map) for freeze & unfreeze in elevator_init_mq(). Reported-by: Marek Szyprowski <m.szyprowski@samsung.com> Reported-by: Lai Yi <yi1.lai@linux.intel.com> Fixes: f1be1788a32e ("block: model freeze & enter queue as lock for supporting lockdep") Signed-off-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20241031133723.303835-5-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * block: always verify unfreeze lock on the owner taskMing Lei2024-11-083-10/+57
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit f1be1788a32e ("block: model freeze & enter queue as lock for supporting lockdep") tries to apply lockdep for verifying freeze & unfreeze. However, the verification is only done the outmost freeze and unfreeze. This way is actually not correct because q->mq_freeze_depth still may drop to zero on other task instead of the freeze owner task. Fix this issue by always verifying the last unfreeze lock on the owner task context, and make sure both the outmost freeze & unfreeze are verified in the current task. Fixes: f1be1788a32e ("block: model freeze & enter queue as lock for supporting lockdep") Signed-off-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20241031133723.303835-4-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * block: remove blk_freeze_queue()Ming Lei2024-11-082-22/+1
| | | | | | | | | | | | | | | | | | No one use blk_freeze_queue(), so remove it and the obsolete comment. Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20241031133723.303835-2-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * block: Add a public bdev_zone_is_seq() helperDamien Le Moal2024-11-071-15/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Turn the private disk_zone_is_conv() function in blk-zoned.c into a public and documented bdev_zone_is_seq() helper with the inverse polarity of the original function, also adding a check for non-zoned devices so that all file systems can use the helper, even with a regular block device. Suggested-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Link: https://lore.kernel.org/r/20241107064300.227731-3-dlemoal@kernel.org Signed-off-by: Jens Axboe <axboe@kernel.dk>