summaryrefslogtreecommitdiffstats
path: root/src/os (follow)
Commit message (Collapse)AuthorAgeFilesLines
* Merge pull request #61314 from aclamk/wip-aclamk-bluefs-truncate-fixAdam Kupczyk2025-01-143-2/+7
|\ | | | | os/bluestore: Fix BlueFS::truncate()
| * os/bluestore: Fix BlueFS::truncate()Adam Kupczyk2025-01-133-2/+7
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | In `struct bluefs_fnode_t` there is a vector `extents` and the vector `extents_index` that is a log2 seek cache. Until modifications to truncate() we never removed extents from files. Modified truncate() did not update extents_index. For example 10 extents long files when truncated to 0 will have: 0 extents, 10 extents_index. After writing some data to file: 1 extents, 11 extents_index. Now, `bluefs_fnode_t::seek` will binary search extents_index, lets say it located seek at item #3. It will then jump up from #0 extent (that exists) to #3 extent which does not exist at. The worst part is that code is now broken, as #3 != extent.end(). There are 3 parts of the fix: 1) assert in `bluefs_fnode_t::seek` to protect against jumping outside extents 2) code in BlueFS::truncate to sync up `extents_index` with `extents` 3) dampening down assert in _replay to give a way out of cases where incorrect "offset 12345" (12345 is file size) instead of "offset 20000" (allocations occupied) was written to log. Fixes: https://tracker.ceph.com/issues/69481 Signed-off-by: Adam Kupczyk <akupczyk@ibm.com>
* | Merge pull request #58924 from imtzw/tzw_ikey_latYuri Weinstein2025-01-131-0/+6
|\ \ | | | | | | | | | | | | os/bluestore: record omapiter init latency Reviewed-by: Igor Fedotov <ifedotov@suse.com>
| * | bluestore: record omapiter init latencyimtzw2024-07-311-0/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | if one object has many `internal keys` at its omap beginning, it maybe very slow for the underlying seek to reach the first `user key` when initializing a omapiter. this may stuck osd when build_push_op, seek recovering object's first omap key again and again. Signed-off-by: imtzw <tongzhiwei_yewu@cmss.chinamobile.com>
* | | Merge pull request #60278 from rzarzynski/wip-os-fastomapiterYuri Weinstein2025-01-139-98/+295
|\ \ \ | |_|/ |/| | | | | | | | | | | | | | | | | | | | | | | | | | | | | os, osd: bring the lightweight OMAP iteration Reviewed-by: Casey Bodley <cbodley@redhat.com> Reviewed-by: Matan Breizman <Matan.Brz@gmail.com> Reviewed-by: Mark Kogan <mkogan@redhat.com> Reviewed-by: Adam Kupczyk <akupczyk@redhat.com> Reviewed-by: Samuel Just <sjust@redhat.com>
| * | os, test: make omap_iterate obligatory for ObjectStoresRadoslaw Zarzynski2024-12-171-3/+1
| | | | | | | | | | | | Signed-off-by: Radoslaw Zarzynski <rzarzyns@redhat.com>
| * | os/kstore: bring support for omap_iterateRadoslaw Zarzynski2024-12-172-0/+72
| | | | | | | | | | | | Signed-off-by: Radoslaw Zarzynski <rzarzyns@redhat.com>
| * | os/memstore: bring support for omap_iterateRadoslaw Zarzynski2024-12-172-0/+49
| | | | | | | | | | | | Signed-off-by: Radoslaw Zarzynski <rzarzyns@redhat.com>
| * | crimson, os: put AlienStore::omap_get_values() on top of OS::omap_iterate()Radoslaw Zarzynski2024-12-175-96/+0
| | | | | | | | | | | | Signed-off-by: Radoslaw Zarzynski <rzarzyns@redhat.com>
| * | os/bluestore: reduce dependencies of omap_iterate()'s loop on OnodeRadoslaw Zarzynski2024-12-172-16/+10
| | | | | | | | | | | | Signed-off-by: Radoslaw Zarzynski <rzarzyns@redhat.com>
| * | kv: avoid memcpy around key() in OMAP iterator of KeyValueDBRadoslaw Zarzynski2024-12-172-4/+18
| | | | | | | | | | | | Signed-off-by: Radoslaw Zarzynski <rzarzyns@redhat.com>
| * | os/bluestore: bring latency logging to omap_iterate()Radoslaw Zarzynski2024-12-171-1/+18
| | | | | | | | | | | | Signed-off-by: Radoslaw Zarzynski <rzarzyns@redhat.com>
| * | os/bluestore: implement the lightweight OMAP iterationRadoslaw Zarzynski2024-12-172-0/+77
| | | | | | | | | | | | Signed-off-by: Radoslaw Zarzynski <rzarzyns@redhat.com>
| * | kv: avoid memcpy in OMAP iterator of KeyValueDBRadoslaw Zarzynski2024-12-177-0/+27
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | ``` - 63.07% _ZN12PrimaryLogPG19prepare_transactionEPNS_9OpContextE ▒ - 63.06% _ZN12PrimaryLogPG10do_osd_opsEPNS_9OpContextERSt6vectorI5OSDOpSaIS3_EE ▒ - 20.19% _ZN9BlueStore16OmapIteratorImpl4nextEv ▒ - 12.21% _ZN14CFIteratorImpl4nextEv ▒ + 10.56% _ZN7rocksdb6DBIter4NextEv ▒ 1.02% _ZN7rocksdb18ArenaWrappedDBIter4NextEv ▒ + 3.11% clock_gettime@@GLIBC_2.17 ▒ + 2.44% _ZN9BlueStore11log_latencyEPKciRKNSt6chrono8durationImSt5ratioILl1ELl1000000000EEEEdS1_i ▒ 0.78% pthread_rwlock_rdlock@plt ▒ 0.69% pthread_rwlock_unlock@plt ▒ - 14.28% _ZN9BlueStore16OmapIteratorImpl5valueEv ▒ - 11.60% _ZN14CFIteratorImpl5valueEv ▒ - 11.41% _ZL13to_bufferlistN7rocksdb5SliceE ▒ - 10.50% _ZN4ceph6buffer7v15_2_03ptrC1EPKcj ▒ - _ZN4ceph6buffer7v15_2_04copyEPKcj ▒ - 10.01% _ZN4ceph6buffer7v15_2_014create_alignedEjj ▒ - _ZN4ceph6buffer7v15_2_025create_aligned_in_mempoolEjji ▒ 5.27% _ZN7mempool6pool_t12adjust_countEll ▒ + 3.72% tc_posix_memalign ▒ 0.54% _ZN4ceph6buffer7v15_2_04list6appendEONS1_3ptrE ▒ 1.25% pthread_rwlock_rdlock@plt ▒ 0.90% pthread_rwlock_unlock@plt ``` Signed-off-by: Radoslaw Zarzynski <rzarzyns@redhat.com>
| * | os, osd: introduce a lightweight OMAP iterationRadoslaw Zarzynski2024-12-171-0/+45
| | | | | | | | | | | | Signed-off-by: Radoslaw Zarzynski <rzarzyns@redhat.com>
* | | Merge pull request #45384 from ifed01/wip-ifed-fragmentation-commandIgor Fedotov2024-12-161-1/+1
|\ \ \ | | | | | | | | | | | | | | | | tool/ceph-bluestore-tool: fix wrong keyword for 'free-fragmentation' … Reviewed-by: Adam Kupczyk <akupczyk@ibm.com>
| * | | tool/ceph-bluestore-tool: fix wrong keyword for 'free-fragmentation' command.Igor Fedotov2022-03-141-1/+1
| | | | | | | | | | | | | | | | Signed-off-by: Igor Fedotov <igor.fedotov@croit.io>
* | | | crimson: add missing includesMax Kellermann2024-12-101-0/+7
| | | | | | | | | | | | | | | | Signed-off-by: Max Kellermann <max.kellermann@ionos.com>
* | | | Merge pull request #59481 from ifed01/wip-ifed-more-info-in-slow-op-logSrinivasaBharathKanta2024-11-042-4/+48
|\ \ \ \ | | | | | | | | | | os/bluestore: log txc details in slow op notification on committed_kv
| * | | | os/bluestore: log max throttle cost and txc count on slow op.Igor Fedotov2024-09-302-4/+41
| | | | | | | | | | | | | | | | | | | | Signed-off-by: Igor Fedotov <igor.fedotov@croit.io>
| * | | | os/bluestore: log additional txc info for slow op warning onIgor Fedotov2024-09-241-1/+8
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | kv_committed. This might be helpful to troubleshoot issues with slow ops caused by bulky client transactions. Related-to: https://tracker.ceph.com/issues/67339 Signed-off-by: Igor Fedotov <igor.fedotov@croit.io>
* | | | | Merge pull request #59838 from cbodley/wip-68083Yuri Weinstein2024-10-301-201/+0
|\ \ \ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | os: remove unused btrfs_ioctl.h and tests Reviewed-by: Patrick Donnelly <pdonnell@redhat.com> Reviewed-by: Radoslaw Zarzynski <rzarzyns@redhat.com> Reviewed-by: Anthony D'Atri <anthony.datri@gmail.com>
| * | | | | os: remove unused btrfs_ioctl.h and testsCasey Bodley2024-09-171-201/+0
| |/ / / / | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | remove unused header whose GPL license was potentially problematic Fixes: https://tracker.ceph.com/issues/68083 Signed-off-by: Casey Bodley <cbodley@redhat.com>
* | | | | Merge pull request #60258 from aclamk/wip-aclamk-cbt-improve-show-labelAdam Kupczyk2024-10-231-5/+10
|\ \ \ \ \ | | | | | | | | | | | | os/bluestore/ceph-bluestore-tool: Modify show-label for many devs
| * | | | | os/bluestore/ceph-bluestore-tool: Modify show-label for many devsAdam Kupczyk2024-10-111-5/+10
| | |_|/ / | |/| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | It was possible to give multiple devices to cbt: > ceph-bluestore-tool show-label --dev /dev/sda --dev /dev/sdb But is any of devices cannot provide valid label, nothing was printed. Now, always print results. Non readable labels are output as empty dictionaries. Exit code: - 0 if any label properly read - 1 if all labels failed Fixes: https://tracker.ceph.com/issues/68505 Signed-off-by: Adam Kupczyk <akupczyk@ibm.com>
* | | | | Merge pull request #60323 from aclamk/wip-aclamk-fix-68528Adam Kupczyk2024-10-181-6/+3
|\ \ \ \ \ | | | | | | | | | | | | os/bluestore: Fix repair of multilabel when collides with BlueFS
| * | | | | os/bluestore: Fix repair of multilabel when collides with BlueFSAdam Kupczyk2024-10-161-6/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The problem was that BDEV_FIRST_LABEL_POSITION was removed from bdev_label_valid_locations set. Now, if label at BDEV_FIRST_LABEL_POSITION is valid, it is in the set. Fixes: https://tracker.ceph.com/issues/68528 Signed-off-by: Adam Kupczyk <akupczyk@ibm.com>
* | | | | | Merge pull request #59782 from aclamk/wip-aclamk-fix-67596-allocmapAdam Kupczyk2024-10-181-0/+8
|\ \ \ \ \ \ | |/ / / / / |/| | | | | os/bluestore: Fix ceph-bluestore-tool allocmap command
| * | | | | os/bluestore: Fix ceph-bluestore-tool allocmap commandAdam Kupczyk2024-09-251-0/+8
| | |/ / / | |/| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | BlueStore::read_allocation_from_drive_for_bluestore_tool was not informed that multiple bdev labels can exist and reserve space. Comparison of real alloc vs recovered alloc was failing. Fixes: https://tracker.ceph.com/issues/67596 Signed-off-by: Adam Kupczyk <akupczyk@ibm.com>
* | | | | os/bluestore: Make truncate() drop unused allocationsAdam Kupczyk2024-10-082-23/+56
| |/ / / |/| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Now when truncate() drops unused allocations. Modified Close() in BlueRocksEnv to unconditionally call truncate. Fixes: https://tracker.ceph.com/issues/68385 Signed-off-by: Adam Kupczyk <akupczyk@ibm.com>
* | | | Merge pull request #58952 from YiteGu/add-perfcounter-for-blk-discardIgor Fedotov2024-09-253-3/+7
|\ \ \ \ | | | | | | | | | | | | | | | | | | | | blk/kerneldevice: add perfcounter for block async discard Reviewed-by: Igor Fedotov <igor.fedotov@croit.io>
| * | | | os/bluestore: passing device type name parameter to kernel deviceYite Gu2024-08-083-3/+7
| | | | | | | | | | | | | | | | | | | | Signed-off-by: Yite Gu <yitegu0@gmail.com>
* | | | | Merge pull request #59850 from aclamk/wip-aclamk-fix-67911-bdev-multi-labelAdam Kupczyk2024-09-251-4/+4
|\ \ \ \ \ | | | | | | | | | | | | os/bluestore: Fix BlueFS allocating bdev label reserved location.
| * | | | | os/bluestore: Move reservation of bdev label to proper place.Adam Kupczyk2024-09-181-4/+4
| | |/ / / | |/| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Reservation (alloc->init_rm_free) was after reopening DB in r/w mode. This was a problem - as soon as DB is in r/w it can flush sst or compact, which will make allocations. Fixes: https://tracker.ceph.com/issues/67911 Signed-off-by: Adam Kupczyk <akupczyk@ibm.com>
* | | | | Merge pull request #59762 from aclamk/wip-aclamk-cbt-combinedAdam Kupczyk2024-09-253-5/+76
|\ \ \ \ \ | |/ / / / |/| | | | ceph-bluestore-tool: Fixes for multilple bdev label
| * | | | tools/ceph-bluestore-tool: remove param zap_sizeAdam Kupczyk2024-09-133-21/+11
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Make zapping precisely target block device labels. Signed-off-by: Adam Kupczyk <akupczyk@ibm.com>
| * | | | tools/ceph-bluestore-tool: Allow show-label even if OSD is runningAdam Kupczyk2024-09-121-0/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | ceph-volume needs to query the devices for `ceph-volume raw list`. Signed-off-by: Adam Kupczyk <akupczyk@ibm.com>
| * | | | tool/bluestore-tool: add zap_device command supportIgor Fedotov2024-09-063-1/+75
| | | | | | | | | | | | | | | | | | | | Signed-off-by: Igor Fedotov <igor.fedotov@croit.io>
| * | | | tools/ceph-bluestore-tool: fix "--yes-i-really-really-mean-it" optionIgor Fedotov2024-09-061-4/+7
| | | | | | | | | | | | | | | | | | | | | | | | | Fixes: https://tracker.ceph.com/issues/67926 Signed-off-by: Igor Fedotov <igor.fedotov@croit.io>
* | | | | os/bluestore: perfect comments in hybrid_allocatorwanglinke2024-09-101-1/+1
|/ / / / | | | | | | | | | | | | | | | | co-author: Jrchyang Yu <yuzhiqiang_yewu@cmss.chinamobile.com> Signed-off-by: Wang Linke <wanglinke_yewu@cmss.chinamobile.com>
* | | | Merge pull request #54504 from aclamk/wip-aclamk-bs-refactor-write-pathAdam Kupczyk2024-08-137-18/+2073
|\ \ \ \ | | | | | | | | | | os/bluestore: Recompression, part 2. New write path.
| * | | | os/bluestore: Write_v2 changesAdam Kupczyk2024-08-072-6/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | 4) remove Writer::shared_changed and use txc::shared_blobs directly Signed-off-by: Adam Kupczyk <akupczyk@ibm.com>
| * | | | os/bluestore: Write_v2 changesAdam Kupczyk2024-08-072-10/+11
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | 1) moved stats and blobs update to Writer::do_write 2) preallocate space in Writer:_split_data 3) fixed Writer::_write_expand_l that could check one extent too much Signed-off-by: Adam Kupczyk <akupczyk@ibm.com>
| * | | | os/bluestore: Add conf.bluestore_write_v2_randomAdam Kupczyk2024-08-071-1/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Added conf.bluestore_write_v2_random. This is useful only for testing. If set, it overrides value of bluestore_write_v2 with a random true/false selection. It is useful for v1 / v2 compatibility testing. Signed-off-by: Adam Kupczyk <akupczyk@ibm.com>
| * | | | os/bluestore: Add compression fallbackAdam Kupczyk2024-08-071-2/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | For write_v2 create fallback to write_v1 if compression is selected. This is temporary until compression dedicated to benefit from v2 is merged. Signed-off-by: Adam Kupczyk <akupczyk@ibm.com>
| * | | | os/bluestore: Writer, fix find_mutable_blobAdam Kupczyk2024-08-071-3/+12
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | 1) Algorithm assumed that blob->blob_start() is aligned to csum size. It is true for blobs created by write_v2, but write_v1 can generate blob like: begin = 0x9000, size = 0x6000, csum = 0x2000. 2) Blobs with unused were selected even if those need to be expanded. This is illegal since we cannot expand unused. Fixed blob selection algorithm. Signed-off-by: Adam Kupczyk <akupczyk@ibm.com>
| * | | | os/bluestore: Writer, improved calculation of need_sizeAdam Kupczyk2024-08-071-1/+8
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | More diligent calcualtion algorithm of need_size. Takes into account front and back alignment. Signed-off-by: Adam Kupczyk <akupczyk@ibm.com>
| * | | | os/bluestore: Writer, fix for clangAdam Kupczyk2024-08-071-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Clang fails at _construct_at(). Signed-off-by: Adam Kupczyk <akupczyk@ibm.com>
| * | | | os/bluestore: Add Writer::_crop_allocs_to_ioAdam Kupczyk2024-08-072-23/+40
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Usually the data we put to disk is AU aligned. In weird cases like AU=16K we put less data than we allocated. _crop_allocs_to_io trims allocated extents into disk block extents to reflect real IO. Signed-off-by: Adam Kupczyk <akupczyk@ibm.com>
| * | | | os/bluestore: Fix after rebaseAdam Kupczyk2024-08-071-20/+11
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | BufferSpace is not with Onode, not Blob. Modify code to adapt to this change. Signed-off-by: Adam Kupczyk <akupczyk@ibm.com>