summaryrefslogtreecommitdiffstats
path: root/src/os/bluestore/BlueFS.cc (follow)
Commit message (Collapse)AuthorAgeFilesLines
* Merge pull request #60556 from ↵Adam Kupczyk13 days1-31/+29
|\ | | | | | | | | aclamk/wip-aclamk-bluefs-truncate-allocations-main os/bluestore: Make truncate() drop unused allocations - addendum
| * os/bluestore: Make truncate() drop unused allocationsAdam Kupczyk2024-10-291-31/+29
| | | | | | | | | | | | | | | | | | | | Review fixes. Removed overcatious assert. Improved if .. else style. Skipped processing extent truncation when seek() goes to end. Fixes: https://tracker.ceph.com/issues/68385 (addendum) Signed-off-by: Adam Kupczyk <akupczyk@ibm.com>
* | os/bluestore: Fix BlueFS::truncate()Adam Kupczyk2025-01-131-1/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | In `struct bluefs_fnode_t` there is a vector `extents` and the vector `extents_index` that is a log2 seek cache. Until modifications to truncate() we never removed extents from files. Modified truncate() did not update extents_index. For example 10 extents long files when truncated to 0 will have: 0 extents, 10 extents_index. After writing some data to file: 1 extents, 11 extents_index. Now, `bluefs_fnode_t::seek` will binary search extents_index, lets say it located seek at item #3. It will then jump up from #0 extent (that exists) to #3 extent which does not exist at. The worst part is that code is now broken, as #3 != extent.end(). There are 3 parts of the fix: 1) assert in `bluefs_fnode_t::seek` to protect against jumping outside extents 2) code in BlueFS::truncate to sync up `extents_index` with `extents` 3) dampening down assert in _replay to give a way out of cases where incorrect "offset 12345" (12345 is file size) instead of "offset 20000" (allocations occupied) was written to log. Fixes: https://tracker.ceph.com/issues/69481 Signed-off-by: Adam Kupczyk <akupczyk@ibm.com>
* | crimson: add missing includesMax Kellermann2024-12-101-0/+7
|/ | | | Signed-off-by: Max Kellermann <max.kellermann@ionos.com>
* os/bluestore: Make truncate() drop unused allocationsAdam Kupczyk2024-10-081-13/+52
| | | | | | | | | Now when truncate() drops unused allocations. Modified Close() in BlueRocksEnv to unconditionally call truncate. Fixes: https://tracker.ceph.com/issues/68385 Signed-off-by: Adam Kupczyk <akupczyk@ibm.com>
* os/bluestore: passing device type name parameter to kernel deviceYite Gu2024-08-081-1/+5
| | | | Signed-off-by: Yite Gu <yitegu0@gmail.com>
* Merge pull request #52489 from ifed01/wip-ifed-alloc2Adam Kupczyk2024-08-071-1/+2
|\ | | | | os/bluestore: introduce hybrid_btree2 allocator
| * os/bluestore: uniform allocator's error handlingIgor Fedotov2024-07-111-1/+2
| | | | | | | | Signed-off-by: Igor Fedotov <igor.fedotov@croit.io>
* | os/bluestore: Bluefs, expand api for getting BlockDevice on BD/WALAdam Kupczyk2024-07-221-0/+7
| | | | | | | | Signed-off-by: Adam Kupczyk <akupczyk@ibm.com>
* | os/bluestore: Add fsck procedure for bdev multi labelsAdam Kupczyk2024-07-221-1/+1
| | | | | | | | | | | | | | | | | | Now fsck can properly detect collision between labels and object data / bluefs files. Additional labels have lower precedence, they never overwrite other data. If collision label - object data happens, the object is moved somewhere else. If collision label - bluefs file happens, it is left unsolved. Signed-off-by: Adam Kupczyk <akupczyk@ibm.com>
* | Merge pull request #57722 from sajibreadd/wip-62500Adam Kupczyk2024-07-171-0/+9
|\ \ | |/ |/| os/bluestore: Warning added for slow operations and stalled read
| * Warning added for slow operations and stalled read in BlueStore. User can ↵sajibreadd2024-06-261-0/+9
| | | | | | | | | | | | | | control how much time the warning should persist after last occurence and maximum number of operations as a threshold will be considered for the warning. Fixes: https://tracker.ceph.com/issues/62500 Signed-off-by: Md Mahamudur Rahaman Sajib <mahamudur.sajib@croit.io>
* | Merge pull request #57369 from YiteGu/bluestore-offline-trimAdam Kupczyk2024-07-091-0/+31
|\ \ | | | | | | tools/bluestore: Add command 'trim' to ceph-bluestore-tool
| * | tools/bluestore: Add command 'trim' to ceph-bluestore-toolyite.gu2024-05-161-0/+31
| | | | | | | | | | | | | | | | | | | | | Add command 'trim' to ceph-bluestore-tool. Co-authored-by: Igor Fedotov <igor.fedotov@croit.io> Signed-off-by: Yite Gu <yitegu0@gmail.com>
* | | Merge pull request #57015 from liangmingyuanneo/wip-bluefs-max-alloc-sizeYuri Weinstein2024-06-051-1/+41
|\ \ \ | |_|/ |/| | | | | | | | bluefs: bluefs alloc unit should only be shrink Reviewed-by: Igor Fedotov <ifedotov@suse.com>
| * | bluefs: bluefs alloc unit should only be shrinkliangmingyuan2024-05-241-1/+41
| |/ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The alloc unit has already forbidden changed for bluestore, what's more, it should forbidden increased in bluefs. Otherwise, it can leads to coredump or bad data. Let's explain it use Bitmap Allocater, it has two presentations: a) in BitmapAllocator::init_rm_free(offset, length), (offset + length) should bigger than offs. But when get_min_alloc_size() changed bigger, this can not be guaranteed. b) if init_rm_free() are successfully in luck, then in rocksdb compact, when release() be called, it release a small extent but may leads to larger extents be released to Bitmap. As a result, rocksdb data is corrupted, and the osd can not be booted again. Signed-off-by: Mingyuan Liang <liangmingyuan@baidu.com>
* / os/bluestore: fix the problem that estimate the log size incorrectlywanglinke2024-03-281-1/+1
|/ | | | | | | | | In BlueFS::_estimate_log_size_N, the total size of the dir was calculated incorrectly. Fixes: https://tracker.ceph.com/issues/65176 co-author: Jrchyang Yu <yuzhiqiang_yewu@cmss.chinamobile.com> Signed-off-by: Wang Linke <wanglinke_yewu@cmss.chinamobile.com>
* os/bluestore: fix bluefs perf counters about l_bluefs_log_compactionswanglinke2024-02-221-1/+0
| | | | | | | | | In BlueFS::_compact_log_sync_LNF_LD,l_bluefs_log_compactions is being counted two times. Fixes: https://tracker.ceph.com/issues/64533 co-author: Jrchyang Yu <yuzhiqiang_yewu@cmss.chinamobile.com> Signed-off-by: Wang Linke <wanglinke_yewu@cmss.chinamobile.com>
* Merge pull request #55054 from pereman2/zns-removeAdam Kupczyk2024-02-061-1/+0
|\ | | | | | | os/bluestore: remove zoned namespace support It has never been finished and now its in the way of future improvements.
| * os/bluestore: remove zoned namespace supportPere Diaz Bou2024-01-031-1/+0
| | | | | | | | | | | | | | | | Lately we've been adding a lot of commits that could've interfered with smr support but since no one is actively reviewing/supporting smr in bluestore, it doesn't make sense for us to mantain it. Signed-off-by: Pere Diaz Bou <pere-altea@hotmail.com>
* | os/bluestore: add perfcount for bluestore/bluefs allocatoryite.gu2023-12-191-3/+56
|/ | | | | | | | | Allocator performance is the performance limiting factor on performance storage. This performance count can help us intuitively observe the reasons for changes in bluestore performance. Signed-off-by: Yite Gu <yitegu0@gamil.com>
* Merge pull request #54102 from ifed01/wip-ifed-better-vselector-callsYuri Weinstein2023-12-131-32/+35
|\ | | | | | | | | | | os/bluestore: rework vselector calls. Reviewed-by: Adam Kupczyk <akupczyk@redhat.com>
| * os/bluestore: rework vselector callsIgor Fedotov2023-11-141-32/+35
| | | | | | | | | | | | | | We can provide fnode delta to vseector now. Which is a bit more effective. Signed-off-by: Igor Fedotov <igor.fedotov@croit.io>
* | os/bluestore: adjust and validate bluefs_shared_alloc_sizeIgor Fedotov2023-12-041-7/+15
|/ | | | | | | | Make sure it's in-sync (meaning it's higher or equal and properly aligned) with bluestore_min_alloc_size into account Fixes: https://tracker.ceph.com/issues/63618 Signed-off-by: Igor Fedotov <igor.fedotov@croit.io>
* Merge pull request #53597 from ifed01/wip-ifed-bluefs-perf-countersYuri Weinstein2023-11-131-11/+47
|\ | | | | | | | | | | | | os/bluestore: add more bluefs perf counters Reviewed-by: Pere Diaz Bou <pdiazbou@redhat.com> Reviewed-by: Adam Kupczyk <akupczyk@redhat.com>
| * os/bluestore: a bit more effective file_map handling in BlueFSIgor Fedotov2023-10-021-8/+8
| | | | | | | | Signed-off-by: Igor Fedotov <igor.fedotov@croit.io>
| * os/bluestore: add more latency tracking perf counters into BlueFSIgor Fedotov2023-10-021-3/+39
| | | | | | | | Signed-off-by: Igor Fedotov <igor.fedotov@croit.io>
* | os/bluestore: fix _extend_log seq advancePere Diaz Bou2023-09-291-2/+3
|/ | | | | | when extending the log, the sequence was left on a bad state because it would first create a transaction to update with the current seq number but leave the "real" transaction with the same sequence number which should be `extend_log_transaction.seq + 1`. Signed-off-by: Pere Diaz Bou <pdiabou@redhat.com>
* Merge pull request #50325 from ifed01/wip-ifed-reserved-by-bluefsLaura Flores2023-09-061-2/+19
|\ | | | | os/bluestore: make BlueFS an exclusive selector for volume reserved
| * os/bluestore: make BlueFS an exclusive selector for volume reservedIgor Fedotov2023-07-061-2/+19
| | | | | | | | | | | | block size. Signed-off-by: Igor Fedotov <ifedotov@croit.io>
* | os/bluestore: fix bluefs log runway enospcPere Diaz Bou2023-08-111-78/+79
|/ | | | | | | | | | | | | | | | | | With these changes, every call to log compaction will try to expand its runway in case of insufficient log space. async compaction will ignore the `log_forbidden_to_expand` atomic since we know it should't be harmful. In any other case, expansion of log will wait until compaction is completed. in order to ensure op_file_update_inc fits on disk we increase the size of logs as previously used in _maybe_extend_log. This means we too bring back _maybe_extend_log with a different usage. _maybe_extend_log increases the size of the log if the runway is less than the min runway and if the current transaction is too big to fit. Fixes: https://tracker.ceph.com/issues/58759 Signed-off-by: Pere Diaz Bou <pere-altea@hotmail.com>
* Merge pull request #50245 from baergj/bluefs-perf-stats-write-countYuri Weinstein2023-04-111-0/+21
|\ | | | | | | | | os/bluestore: Add bluefs write op count metrics. Reviewed-by: Igor Fedotov <ifedotov@suse.com>
| * os/bluestore/bluefs: Add write op count metrics.Joshua Baergen2023-02-271-0/+21
| | | | | | | | | | | | | | | | | | | | | | | | | | There were already several metrics counting bytes written to the various regions of bluefs but nothing counting ops. Also add a sum of bytes written to match the read and new write count metrics. This provides more insight behind the cause of https://tracker.ceph.com/issues/58530. Signed-off-by: Joshua Baergen <jbaergen@digitalocean.com>
* | Merge pull request #50185 from ethanwu-syno/bluefs_tracker_56210Yuri Weinstein2023-04-101-12/+11
|\ \ | | | | | | | | | | | | | | | | | | os/bluestore/bluefs: fix dir_link might add link that already exists in compact log Reviewed-by: Igor Fedotov <ifedotov@suse.com> Reviewed-by: Adam Kupczyk <akupczyk@redhat.com>
| * | os/bluestore/bluefs: fix dir_link might add link that already exists in ↵ethanwu2023-03-011-12/+11
| |/ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | compact log After commit eac1807cf5f19dd79eb95bcb0cde80c67acb69f8 os/bluestore/bluefs: Weaken locks in open_for_write There's a race window between open_for_write and log compaction Process A Process B open_for_write _compact_log_async_LD_LNF_D log.lock node.lock ... update nodes.dir_map(add dirlink A) node.lock(wait for process A) node.unlock ... log.lock(wait for Process B) <get lock> ... compact log(create log based on nodes.dir_map which has dirlink A) ... ... ... ... ... node.unlock() ... log.unlock <get lock> log file create event(dirlink A) log.unlock After the above case, bluefs log will have something like this 0x0: txn(seq 1 len 0x141ee crc 0x3e1c626f) 0x0: op_init 0x0: op_file_update file(ino 2524749 size 0x246b6 mtime 2023-02-08T03:07:19.950963+0800 allocated 30000 alloc_commit 30000 extents [1:0xa135e0000~30000]) 0x0: op_file_update file(ino 2524746 size 0x175af mtime 2023-02-08T03:07:19.771584+0800 allocated 20000 alloc_commit 20000 extents [1:0xa13530000~20000]) ... 0x0: op_dir_link db/2524749.sst to 2524751 0x0: op_dir_link db/2524750.sst to 2524752 0x0: op_dir_link db/CURRENT to 2491157 ... 0x0: op_jump seq 18414993 offset 0x20000 0x20000: txn(seq 18414994 len 0x65 crc 0xc1f9ec5f) 0x20000: op_file_update file(ino 2524752 size 0x0 mtime 2023-02-08T03:07:20.205074+0800 allocated 0 alloc_commit 0 extents []) 0x20000: op_dir_link db/2524750.sst to 2524752 dir_link db/2524750.sst to 2524752 exists at both compacted log(txn seq 1) and log txn seq 18414994. If log compaction won't happen later or abnormal shutdown happens, next time bluefs mount replay will fail at following assert 2023-02-10T11:05:09.826+0800 7f1f97b71280 10 bluefs _replay 0x20000: txn(seq 18414994 len 0x65 crc 0xc1f9ec5f) 2023-02-10T11:05:09.826+0800 7f1f97b71280 20 bluefs _replay 0x20000: op_file_update file(ino 2524752 size 0x0 mtime 2023-02-08T03:07:20.205074+0800 allocated 0 alloc_commit 0 extents []) 2023-02-10T11:05:09.826+0800 7f1f97b71280 20 bluefs _replay 0x20000: op_dir_link db/2524750.sst to 2524752 2023-02-10T11:05:09.832+0800 7f1f97b71280 -1 //source/ceph/src/os/bluestore/BlueFS.cc: In function 'int BlueFS::_replay(bool, bool)' thread 7f1f97b71280 time 2023-02-10T11:05:09.827662+0800 //source/ceph/src/os/bluestore/BlueFS.cc: 1419: FAILED ceph_assert(r == q->second->file_map.end()) Refer to other operations that update the node and add a log entry at the same time, such as rename. Fixed this by taking log lock and node lock at the begining function(follow lock ordering, so log lock first.), i.e. N_LD -> LND Fixes: https://tracker.ceph.com/issues/56210 Signed-off-by: ethanwu <ethanwu@synology.com>
* | os/bluestore: BlueFS: harmonize log read and writes modesAdam Kupczyk2023-03-101-2/+5
| | | | | | | | | | | | | | | | | | | | BlueFS log has always been written in non-buffered mode. Reading of it depends on bluefs_buffered_io option. It is strongly suspected that this causes some wierd problems. Possibly fixes: https://tracker.ceph.com/issues/54019 Signed-off-by: Adam Kupczyk <akupczyk@ibm.com>
* | common: Add labeled perf countersAli Maredia2023-02-231-1/+1
|/ | | | | | | | | | | | | | | | | Add the ability to dump labeled perf counters for a daemon. Labeled perf counters are stored in a CephContext's PerfCountersCollection. Labeled and unlabeled perf counters are dumped to the admin socket via `counters dump` command. The schema for labeled and unlabeled perf counters are dumped to the admin socket via `counters schema` command. This commit includes docs and additional unit tests Signed-off-by: Ali Maredia <amaredia@redhat.com>
* Merge pull request #48854 from ifed01/wip-ifed-small-chunk-bluefsYuri Weinstein2023-01-261-364/+628
|\ | | | | | | | | os/bluestore: enable 4K allocation unit for BlueFS
| * os/bluestore: introduce a cooldown period for failed BlueFS allocations.Igor Fedotov2022-11-161-6/+36
| | | | | | | | | | | | | | | | | | | | | | When using bluefs_shared_alloc_size one might get a long-lasting state when that large chunks are not available any more and fallback to shared device min alloc size occurs. The introduced cooldown is intended to prevent repetitive allocation attempts with bluefs_shared_alloc_size for a while. The rationale is to eliminate performance penalty these failing attempts might cause. Signed-off-by: Igor Fedotov <igor.fedotov@croit.io>
| * os/bluestore: get rid off BlueFS::allocate_without_fallback.Igor Fedotov2022-11-161-74/+29
| | | | | | | | Signed-off-by: Igor Fedotov <igor.fedotov@croit.io>
| * os/bluestore: support main/slow device's alloc unit for BlueFS.Igor Fedotov2022-11-161-41/+81
| | | | | | | | | | | | | | | | | | This effectively enables having 4K allocation units for BlueFS. But it doesn't turn it on by default for the sake of performance. Using main device which lacks enough free large continuous extents might do the trick though. Signed-off-by: Igor Fedotov <igor.fedotov@croit.io>
| * os/bluestore: output cosmetics for BlueFSIgor Fedotov2022-11-161-2/+8
| | | | | | | | | | | | | | This includes finer position specification during replay and logging read size in hex. Signed-off-by: Igor Fedotov <igor.fedotov@croit.io>
| * os/bluestore: new BlueFS perf counters on compaction.Igor Fedotov2022-11-161-0/+23
| | | | | | | | Signed-off-by: Igor Fedotov <igor.fedotov@croit.io>
| * os/bluestore: prepend compacted BlueFS log with a starter part.Igor Fedotov2022-11-161-150/+421
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The rationale is to have initial log fnode after compaction small enough to fit into 4K superblock. Without that compacted metadata might require fnode longer than 4K which goes beyond existing 4K superblock. BlueFS assert in this case for now. Hence the resulting log allocation disposition is like: - superblock(4K) keeps initial log fnode which refers: op_init, op_update_inc(log), op_jump(next seq) - updated log fnode built from superblock + above op_update_inc refers: compacted meta (a bunch of op_update and others) - * - more op_update_inc(log) to follow if log is extended - * Signed-off-by: Igor Fedotov <igor.fedotov@croit.io>
| * os/bluestore: increment Bluefs::super.version at _write_superIgor Fedotov2022-11-101-3/+2
| | | | | | | | Signed-off-by: Igor Fedotov <igor.fedotov@croit.io>
| * os/bluestore: introduce method to estimate BlueFS transaction sizeIgor Fedotov2022-11-101-11/+17
| | | | | | | | Signed-off-by: Igor Fedotov <igor.fedotov@croit.io>
| * os/bluestore: simplify and cleanup BlueFS::_compact_log_async_...()Igor Fedotov2022-11-101-48/+16
| | | | | | | | Signed-off-by: Igor Fedotov <igor.fedotov@croit.io>
| * os/bluestore: get rid off BlueFS::_compact_log_async_dump_metadata_NF()Igor Fedotov2022-11-101-59/+33
| | | | | | | | | | | | We can reuse _compact_log_dump_metadata_NF() instead Signed-off-by: Igor Fedotov <igor.fedotov@croit.io>
| * os/bluestore: unify allocation functions' signature at BlueFS.Igor Fedotov2022-11-101-59/+51
| | | | | | | | Signed-off-by: Igor Fedotov <ifedotov@croit.io>
* | blk/KernelDevice: don't start discard thread if device not support_discardhaoyixing2022-10-261-13/+10
|/ | | | | | | | | Only create discard thread if the device support discard, otherwise we will have some threads which does nothing. Also extract queue_discard/discard logic to device, make it cleaner when calling discard from bluefs and bluestore. Signed-off-by: haoyixing <haoyixing@kuaishou.com>