| Commit message (Collapse) | Author | Age | Files | Lines |
|\
| |
| |
| |
| | |
aclamk/wip-aclamk-bluefs-truncate-allocations-main
os/bluestore: Make truncate() drop unused allocations - addendum
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
Review fixes. Removed overcatious assert.
Improved if .. else style.
Skipped processing extent truncation when seek() goes to end.
Fixes: https://tracker.ceph.com/issues/68385 (addendum)
Signed-off-by: Adam Kupczyk <akupczyk@ibm.com>
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
In `struct bluefs_fnode_t` there is a vector `extents` and
the vector `extents_index` that is a log2 seek cache.
Until modifications to truncate() we never removed extents from files.
Modified truncate() did not update extents_index.
For example 10 extents long files when truncated to 0 will have:
0 extents, 10 extents_index.
After writing some data to file:
1 extents, 11 extents_index.
Now, `bluefs_fnode_t::seek` will binary search extents_index,
lets say it located seek at item #3.
It will then jump up from #0 extent (that exists) to #3 extent which
does not exist at.
The worst part is that code is now broken, as #3 != extent.end().
There are 3 parts of the fix:
1) assert in `bluefs_fnode_t::seek` to protect against
jumping outside extents
2) code in BlueFS::truncate to sync up `extents_index` with `extents`
3) dampening down assert in _replay to give a way out of cases
where incorrect "offset 12345" (12345 is file size) instead of
"offset 20000" (allocations occupied) was written to log.
Fixes: https://tracker.ceph.com/issues/69481
Signed-off-by: Adam Kupczyk <akupczyk@ibm.com>
|
|/
|
|
| |
Signed-off-by: Max Kellermann <max.kellermann@ionos.com>
|
|
|
|
|
|
|
|
|
| |
Now when truncate() drops unused allocations.
Modified Close() in BlueRocksEnv to unconditionally call truncate.
Fixes: https://tracker.ceph.com/issues/68385
Signed-off-by: Adam Kupczyk <akupczyk@ibm.com>
|
|
|
|
| |
Signed-off-by: Yite Gu <yitegu0@gmail.com>
|
|\
| |
| | |
os/bluestore: introduce hybrid_btree2 allocator
|
| |
| |
| |
| | |
Signed-off-by: Igor Fedotov <igor.fedotov@croit.io>
|
| |
| |
| |
| | |
Signed-off-by: Adam Kupczyk <akupczyk@ibm.com>
|
| |
| |
| |
| |
| |
| |
| |
| |
| | |
Now fsck can properly detect collision between labels and object data / bluefs files.
Additional labels have lower precedence, they never overwrite other data.
If collision label - object data happens, the object is moved somewhere else.
If collision label - bluefs file happens, it is left unsolved.
Signed-off-by: Adam Kupczyk <akupczyk@ibm.com>
|
|\ \
| |/
|/| |
os/bluestore: Warning added for slow operations and stalled read
|
| |
| |
| |
| |
| |
| |
| | |
control how much time the warning should persist after last occurence and maximum number of operations as a threshold will be considered for the warning.
Fixes: https://tracker.ceph.com/issues/62500
Signed-off-by: Md Mahamudur Rahaman Sajib <mahamudur.sajib@croit.io>
|
|\ \
| | |
| | | |
tools/bluestore: Add command 'trim' to ceph-bluestore-tool
|
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
Add command 'trim' to ceph-bluestore-tool.
Co-authored-by: Igor Fedotov <igor.fedotov@croit.io>
Signed-off-by: Yite Gu <yitegu0@gmail.com>
|
|\ \ \
| |_|/
|/| |
| | |
| | | |
bluefs: bluefs alloc unit should only be shrink
Reviewed-by: Igor Fedotov <ifedotov@suse.com>
|
| |/
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
The alloc unit has already forbidden changed for bluestore, what's more,
it should forbidden increased in bluefs. Otherwise, it can leads to
coredump or bad data. Let's explain it use Bitmap Allocater, it has two
presentations:
a) in BitmapAllocator::init_rm_free(offset, length),
(offset + length) should bigger than offs. But when get_min_alloc_size()
changed bigger, this can not be guaranteed.
b) if init_rm_free() are
successfully in luck, then in rocksdb compact, when release() be called,
it release a small extent but may leads to larger extents be released to
Bitmap. As a result, rocksdb data is corrupted, and the osd can not be
booted again.
Signed-off-by: Mingyuan Liang <liangmingyuan@baidu.com>
|
|/
|
|
|
|
|
|
|
| |
In BlueFS::_estimate_log_size_N, the total size of
the dir was calculated incorrectly.
Fixes: https://tracker.ceph.com/issues/65176
co-author: Jrchyang Yu <yuzhiqiang_yewu@cmss.chinamobile.com>
Signed-off-by: Wang Linke <wanglinke_yewu@cmss.chinamobile.com>
|
|
|
|
|
|
|
|
|
| |
In BlueFS::_compact_log_sync_LNF_LD,l_bluefs_log_compactions
is being counted two times.
Fixes: https://tracker.ceph.com/issues/64533
co-author: Jrchyang Yu <yuzhiqiang_yewu@cmss.chinamobile.com>
Signed-off-by: Wang Linke <wanglinke_yewu@cmss.chinamobile.com>
|
|\
| |
| |
| | |
os/bluestore: remove zoned namespace support
It has never been finished and now its in the way of future improvements.
|
| |
| |
| |
| |
| |
| |
| |
| | |
Lately we've been adding a lot of commits that could've interfered with
smr support but since no one is actively reviewing/supporting smr in
bluestore, it doesn't make sense for us to mantain it.
Signed-off-by: Pere Diaz Bou <pere-altea@hotmail.com>
|
|/
|
|
|
|
|
|
|
| |
Allocator performance is the performance limiting factor
on performance storage. This performance count can help
us intuitively observe the reasons for changes in bluestore
performance.
Signed-off-by: Yite Gu <yitegu0@gamil.com>
|
|\
| |
| |
| |
| |
| | |
os/bluestore: rework vselector calls.
Reviewed-by: Adam Kupczyk <akupczyk@redhat.com>
|
| |
| |
| |
| |
| |
| |
| | |
We can provide fnode delta to vseector now. Which is a bit more
effective.
Signed-off-by: Igor Fedotov <igor.fedotov@croit.io>
|
|/
|
|
|
|
|
|
| |
Make sure it's in-sync (meaning it's higher or equal and properly aligned)
with bluestore_min_alloc_size into account
Fixes: https://tracker.ceph.com/issues/63618
Signed-off-by: Igor Fedotov <igor.fedotov@croit.io>
|
|\
| |
| |
| |
| |
| |
| | |
os/bluestore: add more bluefs perf counters
Reviewed-by: Pere Diaz Bou <pdiazbou@redhat.com>
Reviewed-by: Adam Kupczyk <akupczyk@redhat.com>
|
| |
| |
| |
| | |
Signed-off-by: Igor Fedotov <igor.fedotov@croit.io>
|
| |
| |
| |
| | |
Signed-off-by: Igor Fedotov <igor.fedotov@croit.io>
|
|/
|
|
|
|
| |
when extending the log, the sequence was left on a bad state because it would first create a transaction to update with the current seq number but leave the "real" transaction with the same sequence number which should be `extend_log_transaction.seq + 1`.
Signed-off-by: Pere Diaz Bou <pdiabou@redhat.com>
|
|\
| |
| | |
os/bluestore: make BlueFS an exclusive selector for volume reserved
|
| |
| |
| |
| |
| |
| | |
block size.
Signed-off-by: Igor Fedotov <ifedotov@croit.io>
|
|/
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
With these changes, every call to log compaction will try to expand its
runway in case of insufficient log space. async compaction will ignore
the `log_forbidden_to_expand` atomic since we know it should't be
harmful. In any other case, expansion of log will wait until compaction
is completed.
in order to ensure op_file_update_inc fits on disk we increase the size
of logs as previously used in _maybe_extend_log. This means we too bring
back _maybe_extend_log with a different usage.
_maybe_extend_log increases the size of the log if the runway is less
than the min runway and if the current transaction is too big to fit.
Fixes: https://tracker.ceph.com/issues/58759
Signed-off-by: Pere Diaz Bou <pere-altea@hotmail.com>
|
|\
| |
| |
| |
| | |
os/bluestore: Add bluefs write op count metrics.
Reviewed-by: Igor Fedotov <ifedotov@suse.com>
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
There were already several metrics counting bytes written to the various
regions of bluefs but nothing counting ops.
Also add a sum of bytes written to match the read and new write count
metrics.
This provides more insight behind the cause of
https://tracker.ceph.com/issues/58530.
Signed-off-by: Joshua Baergen <jbaergen@digitalocean.com>
|
|\ \
| | |
| | |
| | |
| | |
| | |
| | | |
os/bluestore/bluefs: fix dir_link might add link that already exists in compact log
Reviewed-by: Igor Fedotov <ifedotov@suse.com>
Reviewed-by: Adam Kupczyk <akupczyk@redhat.com>
|
| |/
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
compact log
After commit eac1807cf5f19dd79eb95bcb0cde80c67acb69f8 os/bluestore/bluefs: Weaken locks in open_for_write
There's a race window between open_for_write and log compaction
Process A Process B
open_for_write _compact_log_async_LD_LNF_D
log.lock
node.lock ...
update nodes.dir_map(add dirlink A) node.lock(wait for process A)
node.unlock ...
log.lock(wait for Process B) <get lock>
... compact log(create log based on nodes.dir_map which has dirlink A)
... ...
... ...
... node.unlock()
... log.unlock
<get lock>
log file create event(dirlink A)
log.unlock
After the above case, bluefs log will have something like this
0x0: txn(seq 1 len 0x141ee crc 0x3e1c626f)
0x0: op_init
0x0: op_file_update file(ino 2524749 size 0x246b6 mtime 2023-02-08T03:07:19.950963+0800 allocated 30000 alloc_commit 30000 extents [1:0xa135e0000~30000])
0x0: op_file_update file(ino 2524746 size 0x175af mtime 2023-02-08T03:07:19.771584+0800 allocated 20000 alloc_commit 20000 extents [1:0xa13530000~20000])
...
0x0: op_dir_link db/2524749.sst to 2524751
0x0: op_dir_link db/2524750.sst to 2524752
0x0: op_dir_link db/CURRENT to 2491157
...
0x0: op_jump seq 18414993 offset 0x20000
0x20000: txn(seq 18414994 len 0x65 crc 0xc1f9ec5f)
0x20000: op_file_update file(ino 2524752 size 0x0 mtime 2023-02-08T03:07:20.205074+0800 allocated 0 alloc_commit 0 extents [])
0x20000: op_dir_link db/2524750.sst to 2524752
dir_link db/2524750.sst to 2524752 exists at both compacted log(txn seq 1) and log txn seq 18414994.
If log compaction won't happen later or abnormal shutdown happens,
next time bluefs mount replay will fail at following assert
2023-02-10T11:05:09.826+0800 7f1f97b71280 10 bluefs _replay 0x20000: txn(seq 18414994 len 0x65 crc 0xc1f9ec5f)
2023-02-10T11:05:09.826+0800 7f1f97b71280 20 bluefs _replay 0x20000: op_file_update file(ino 2524752 size 0x0 mtime 2023-02-08T03:07:20.205074+0800 allocated 0 alloc_commit 0 extents [])
2023-02-10T11:05:09.826+0800 7f1f97b71280 20 bluefs _replay 0x20000: op_dir_link db/2524750.sst to 2524752
2023-02-10T11:05:09.832+0800 7f1f97b71280 -1 //source/ceph/src/os/bluestore/BlueFS.cc: In function 'int BlueFS::_replay(bool, bool)' thread 7f1f97b71280 time 2023-02-10T11:05:09.827662+0800
//source/ceph/src/os/bluestore/BlueFS.cc: 1419: FAILED ceph_assert(r == q->second->file_map.end())
Refer to other operations that update the node and add a log entry at the
same time, such as rename. Fixed this by taking log lock and node lock
at the begining function(follow lock ordering, so log lock first.),
i.e. N_LD -> LND
Fixes: https://tracker.ceph.com/issues/56210
Signed-off-by: ethanwu <ethanwu@synology.com>
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
BlueFS log has always been written in non-buffered mode.
Reading of it depends on bluefs_buffered_io option.
It is strongly suspected that this causes some wierd problems.
Possibly fixes: https://tracker.ceph.com/issues/54019
Signed-off-by: Adam Kupczyk <akupczyk@ibm.com>
|
|/
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Add the ability to dump labeled perf counters
for a daemon. Labeled perf counters are stored
in a CephContext's PerfCountersCollection.
Labeled and unlabeled perf counters are dumped
to the admin socket via `counters dump` command.
The schema for labeled and unlabeled perf
counters are dumped to the admin socket via
`counters schema` command.
This commit includes docs and additional unit tests
Signed-off-by: Ali Maredia <amaredia@redhat.com>
|
|\
| |
| |
| |
| | |
os/bluestore: enable 4K allocation unit for BlueFS
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
When using bluefs_shared_alloc_size one might get a long-lasting state when
that large chunks are not available any more and fallback to shared
device min alloc size occurs. The introduced cooldown is intended to
prevent repetitive allocation attempts with bluefs_shared_alloc_size for
a while. The rationale is to eliminate performance penalty these failing
attempts might cause.
Signed-off-by: Igor Fedotov <igor.fedotov@croit.io>
|
| |
| |
| |
| | |
Signed-off-by: Igor Fedotov <igor.fedotov@croit.io>
|
| |
| |
| |
| |
| |
| |
| |
| |
| | |
This effectively enables having 4K allocation units for BlueFS.
But it doesn't turn it on by default for the sake of performance.
Using main device which lacks enough free large continuous extents
might do the trick though.
Signed-off-by: Igor Fedotov <igor.fedotov@croit.io>
|
| |
| |
| |
| |
| |
| |
| | |
This includes finer position specification during replay
and logging read size in hex.
Signed-off-by: Igor Fedotov <igor.fedotov@croit.io>
|
| |
| |
| |
| | |
Signed-off-by: Igor Fedotov <igor.fedotov@croit.io>
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
The rationale is to have initial log fnode after compaction small
enough to fit into 4K superblock. Without that compacted metadata might
require fnode longer than 4K which goes beyond existing 4K
superblock. BlueFS assert in this case for now.
Hence the resulting log allocation disposition is like:
- superblock(4K) keeps initial log fnode which refers:
op_init, op_update_inc(log), op_jump(next seq)
- updated log fnode built from superblock + above op_update_inc refers:
compacted meta (a bunch of op_update and others)
- *
- more op_update_inc(log) to follow if log is extended
- *
Signed-off-by: Igor Fedotov <igor.fedotov@croit.io>
|
| |
| |
| |
| | |
Signed-off-by: Igor Fedotov <igor.fedotov@croit.io>
|
| |
| |
| |
| | |
Signed-off-by: Igor Fedotov <igor.fedotov@croit.io>
|
| |
| |
| |
| | |
Signed-off-by: Igor Fedotov <igor.fedotov@croit.io>
|
| |
| |
| |
| |
| |
| | |
We can reuse _compact_log_dump_metadata_NF() instead
Signed-off-by: Igor Fedotov <igor.fedotov@croit.io>
|
| |
| |
| |
| | |
Signed-off-by: Igor Fedotov <ifedotov@croit.io>
|
|/
|
|
|
|
|
|
|
| |
Only create discard thread if the device support discard,
otherwise we will have some threads which does nothing.
Also extract queue_discard/discard logic to device, make it cleaner when
calling discard from bluefs and bluestore.
Signed-off-by: haoyixing <haoyixing@kuaishou.com>
|