summaryrefslogtreecommitdiffstats
path: root/src/os (follow)
Commit message (Collapse)AuthorAgeFilesLines
* Merge pull request #44311 from ifed01/wip-ifed-cleanup-onode-pinYuri Weinstein2021-12-241-19/+3
|\ | | | | | | | | | | os/bluestore: get rid of fake onode nref increment for pinned entry Reviewed-by: Adam Kupczyk <akupczyk@redhat.com>
| * os/bluestore: get rid of fake onode nref increment for pinned entryIgor Fedotov2021-12-201-19/+3
| | | | | | | | | | | | | | Looks like this isn't necessary any more after fixing https://tracker.ceph.com/issues/53002 Signed-off-by: Igor Fedotov <igor.fedotov@croit.io>
* | Merge pull request #42896 from ifed01/wip-ifed-bluefs-improve-logYuri Weinstein2021-12-231-0/+2
|\ \ | | | | | | | | | | | | | | | os/bluestore: dump alloc unit size on bluefs allocation failure. Reviewed-by: Adam Kupczyk <akupczyk@redhat.com>
| * | os/bluestore: dump alloc unit size on bluefs allocation failure.Igor Fedotov2021-08-231-0/+2
| | | | | | | | | | | | Signed-off-by: Igor Fedotov <ifedotov@suse.com>
* | | Merge pull request #43598 from ideepika/wip-opentelemetryJosh Durgin2021-12-221-2/+2
|\ \ \ | | | | | | | | | | | | | | | | | | | | migrate from using opentracing-cpp to opentelemetry-cpp static as distributed tracing API Reviewed-by: Yuval Lifshitz <ylifshit@redhat.com> Reviewed-by: Josh Durgin <jdurgin@redhat.com>
| * | | src/*/CMakeLists: update jaeger-base > jaeger_baseDeepika Upadhyay2021-11-241-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | jaeger-base encapsulated dependency for opentelemetry tracing libraries, when linked will provide support for tracing for the ceph target. Signed-off-by: Deepika Upadhyay <dupadhya@redhat.com>
* | | | Merge pull request #44334 from aclamk/wip-aclamk-better-bluestore-perfErnesto Puerta2021-12-221-3/+3
|\ \ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | os/bluestore: Better readability of perf output Reviewed-by: Ernesto Puerta <epuertat@redhat.com> Reviewed-by: Igor Fedotov <ifedotov@suse.com> Reviewed-by: Laura Flores <lflores@redhat.com> Reviewed-by: Yuri Weinstein <yweins@redhat.com>
| * | | | os/bluestore: Better readability of perf outputAdam Kupczyk2021-12-161-3/+3
| | |_|/ | |/| | | | | | | | | | | | | | | | | | Get rid of bluestore_ prefix for some stats. Signed-off-by: Adam Kupczyk <akupczyk@redhat.com>
* | | | Merge pull request #44317 from aclamk/aclamk-fix-bluefs-importNeha Ojha2021-12-213-3/+15
|\ \ \ \ | | | | | | | | | | | | | | | | | | | | Fix ceph-bluestore-tool bluefs-import command Reviewed-by: Igor Fedotov <igor.fedotov@croit.io>
| * | | | os/bluestore/ceph-bluestore-tool: Fix bluefs-import commandAdam Kupczyk2021-12-163-3/+15
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Modify bluefs-import command so it can properly initialize allocators. Without allocators initialized, importing file to bluefs did overwrite some random data, including first block on device. Signed-off-by: Adam Kupczyk <akupczyk@redhat.com>
* | | | | os/bluestore/bluefs: Add tracking of bluefs log in noop replay modeAdam Kupczyk2021-12-201-0/+8
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Keep updating bluefs log when printing content of bluefs replay log. Without this modification we only have initial content of log. Log can be printed by 'ceph-bluestore-tool bluefs-log-dump'. Signed-off-by: Adam Kupczyk <akupczyk@redhat.com>
* | | | | os/bluestore/bluefs: Sync BlueFS log with its allocation deltaAdam Kupczyk2021-12-201-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | BlueFS log is the only file that we can append to. When we append to file we must take into consideration previously commited allocations, otherwise update will be miscalculated. Signed-off-by: Adam Kupczyk <akupczyk@redhat.com>
* | | | | os/bluefs: allow incremental file metadata updates in bluefs logIgor Fedotov2021-12-203-8/+152
| |/ / / |/| | | | | | | | | | | Signed-off-by: Adam Kupczyk <akupczyk@redhat.com>
* | | | Merge pull request #44216 from locallocal/masterYuri Weinstein2021-12-151-7/+2
|\ \ \ \ | | | | | | | | | | | | | | | | | | | | os/bluestore: don't need separate variable to mark hits when lookup oid. Reviewed-by: Igor Fedotov <ifedotov@suse.com>
| * | | | os/bluestore: don't need separate variable to mark hits when lookup oid.locallocal2021-12-061-7/+2
| | | | | | | | | | | | | | | | | | | | Signed-off-by: locallocal <locallocal@163.com>
* | | | | Merge pull request #43770 from ifed01/wip-ifed-fix-53002Yuri Weinstein2021-12-152-5/+6
|\ \ \ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | os/bluestore: avoid premature onode release. Reviewed-by: Adam Kupczyk <akupczyk@redhat.com>
| * | | | | os/bluestore: avoid premature onode release.Igor Fedotov2021-12-142-5/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This was observed when onode's removal is followed by reading and the latter causes object release before the removal is finalized. The root cause is an improper 'pinned' state assessment in Onode::get More detailed overview is: At some point Onode::get() might face the case when nref == 2 and pinned = true which means parallel incomplete put is running on the onode - ref count is decremented but pinned state is still unmodified (and even lock hasn't been acquired yet). This might finally result in two puts racing over the same onode with nref == 2 which finally results in a premature onode release: // nref =3, pinned = 1 // Thread 1 Thread 2 // o->put() o->get() // --nref(n = 2, pinned=1) // nref++ (n=3, pinned = 1) // return // ... // o->put() // --nref(n = 2) // pinned = 0, // --nref(n = 1) // ocs->_unpin_and_rm(o) -> o->put() // ... // --nref(n = 0) // release o // o->c->get_onode_cache() // FAULT! // The suggested fix is to introduce additional atomic counter tracking running put() functions. And permit onode release when both regular nref and put_nref are both equal to zero. Fixes: https://tracker.ceph.com/issues/53002 Signed-off-by: Igor Fedotov <igor.fedotov@croit.io>
* | | | | | Merge pull request #43857 from aclamk/wip-aclamk-omap-clone-assertYuri Weinstein2021-12-141-0/+3
|\ \ \ \ \ \ | |/ / / / / |/| | | | | | | | | | | | | | | | | os/bluestore: Protect _clone against sudden omap format changes Reviewed-by: Igor Fedotov <ifedotov@suse.com>
| * | | | | os/bluestore: Protect _clone against sudden omap format changesAdam Kupczyk2021-11-091-0/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Added assert to verify that omap prefixes between cloned objects are exactly the same. If they would differ rewrite_omap_key() will possibly overwrite user key potion of data, or move some part of prefix into user key. This is a follow up from https://github.com/ceph/ceph/pull/43687 Signed-off-by: Adam Kupczyk <akupczyk@redhat.com>
* | | | | | Merge pull request #44098 from ifed01/wip-ifed-dump-alloc-unitIgor Fedotov2021-12-084-2/+35
|\ \ \ \ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | os/bluestore: dump bluestore/bluefs alloc unit sizes with perf dump Reviewed-by: Laura Flores <lflores@redhat.com>
| * | | | | | os/bluestore: dump bluefs alloc unit sizes with perf counters dumpIgor Fedotov2021-11-242-2/+28
| | | | | | | | | | | | | | | | | | | | | | | | | | | | Signed-off-by: Igor Fedotov <igor.fedotov@croit.io>
| * | | | | | os/bluestore: dump bluestore's min_alloc_size with perf counters dumpIgor Fedotov2021-11-242-0/+7
| | |_|_|/ / | |/| | | | | | | | | | | | | | | | Signed-off-by: Igor Fedotov <igor.fedotov@croit.io>
* | | | | | Merge pull request #43840 from ifed01/wip-ifed-verbose-open-colIgor Fedotov2021-12-081-1/+4
|\ \ \ \ \ \ | |_|_|/ / / |/| | | | | | | | | | | | | | | | | | | | | | | | | | | | | osd,bluestore: gracefully handle a failure during meta collection load Reviewed-by: jdurgin@redhat.com Reviewed-by: nojha@redhat.com
| * | | | | os/bluestore: report amount of loaded collections.Igor Fedotov2021-12-021-1/+4
| | |/ / / | |/| | | | | | | | | | | | | Signed-off-by: Igor Fedotov <igor.fedotov@croit.io>
* | | | | BlueStore: Fix a bug when FSCK is invoked in mount()/umount()/mkfs() with ↵Gabriel BenHanokh2021-12-042-67/+77
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | DEEP option Fixes: https://tracker.ceph.com/issues/53185 NCB mishandles fsck DEEP in mount()/umount()/mkfs() case causing it to remove the allocation-file without destaging a new copy (which will cost us a full rebuild on startup) There are also few confiliting calls to open_db()/close_db() passing inconsistent read-only flag We fix both issues by storing open-db type (read-only/read-write) and using it for close-db (which won't pass read-only flag anymore) We also move allocation-file destage to close-db so it will be refreshed after being removed by fsck and such Signed-off-by: Gabriel Benhanokh <gbenhano@redhat.com>
* | | | | Merge pull request #43691 from curtbruns/use_optimal_for_min_allocYuri Weinstein2021-12-032-1/+21
|\ \ \ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | os/bluestore: Set min_alloc_size to optimal io size Reviewed-by: Sage Weil <sage@redhat.com> Reviewed-by: Igor Fedotov <ifedotov@suse.com> Reviewed-by: Adam Kupczyk <akupczyk@redhat.com>
| * | | | | os/bluestore: Set min_alloc_size to optimal io sizeCurt Bruns2021-11-052-1/+21
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Block devices may report an "optimal_io_size" that is different than the typical 4KiB. To optimize BlueStore for this io size, the allocator needs to set its min_alloc_size to this optimal_io_size. This PR adds the discovery of the optimal_io_size for a block device and an option to use the optimal_io_size as the min_alloc_size for the bluestore allocator. Older devices may report an optimal_io_size of 0 and if that is the case, the default config min_alloc_size is used. Signed-off-by: Curt Bruns <curt.e.bruns@gmail.com>
* | | | | | Merge pull request #43336 from ifed01/wip-fix-bluefs-volumes-opsNeha Ojha2021-12-021-0/+3
|\ \ \ \ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | qa/osd-bluefs-volume-ops: fix bluefs volumes ops test case Reviewed-by: Adam Kupczyk <akupczyk@redhat.com>
| * | | | | | os/bluestore: cleanup onode/collection cache after NCB recovery.Igor Fedotov2021-11-021-0/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | One needs to properly shutdown Onode cache if NCB stuff has performed full recovery. Signed-off-by: Igor Fedotov <ifedotov@suse.com>
* | | | | | | Merge pull request #44089 from benhanokh/ncb_fsck_fixbenhanokh2021-12-021-1/+1
|\ \ \ \ \ \ \ | |_|_|_|_|/ / |/| | | | | | os/bluestore: bug-fix for NCB-FSCK
| * | | | | | Bug-Fix for PR-42762Gabriel BenHanokh2021-11-241-1/+1
| | |_|_|/ / | |/| | | | | | | | | | | | | | | | | | | | | | Use the function local allocator instead of the global shared-allocator Signed-off-by: Gabriel Benhanokh <gbenhano@redhat.com>
* | | | | | Merge pull request #43774 from aclamk/fix-bluefs-truncateNeha Ojha2021-11-241-0/+1
|\ \ \ \ \ \ | |/ / / / / |/| | | | | | | | | | | | | | | | | Fix data corruption in bluefs truncate() Reviewed-by: Igor Fedotov <igor.fedotov@croit.io>
| * | | | | os/bluestore/bluefs: Fix data corruption in truncate()Adam Kupczyk2021-11-031-0/+1
| |/ / / / | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | It is possible to create condition in which a BlueFS contains file that is corrupted. It can happen when BlueFS replay log is on device A and we just wrote to device B and truncated file. Scenario: 1) write to file h1 on SLOW device 2) flush h1 (initiate transfer, but no fdatasync yet) 3) truncate h1 4) write to file h2 on DB 5) fsync h2 (forces replay log to be written, after fdatasync to DB) 6) poweroff Fixes: https://tracker.ceph.com/issues/53129 Signed-off-by: Adam Kupczyk <akupczyk@redhat.com>
* | | | | Merge pull request #41557 from ifed01/wip-ifed-better-daemonperfIgor Fedotov2021-11-194-215/+453
|\ \ \ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | os/bluestore: improve usability for bluestore/bluefs perf counters Signed-off-by: Igor Fedotov <igor.fedotov@croit.io> Reviewed-by: Laura Flores lflores@redhat.com
| * | | | | os/bluestore: adjust usefullness tag for bluestore perf countersIgor Fedotov2021-11-161-12/+42
| | | | | | | | | | | | | | | | | | | | | | | | Signed-off-by: Igor Fedotov <igor.fedotov@croit.io>
| * | | | | os/bluestore: raise usefullness tag for some bluefs perf countersIgor Fedotov2021-11-161-14/+34
| | | | | | | | | | | | | | | | | | | | | | | | Signed-off-by: Igor Fedotov <igor.fedotov@croit.io>
| * | | | | os/bluestore: unify bluefs read perf counter short namesIgor Fedotov2021-11-101-4/+4
| | | | | | | | | | | | | | | | | | | | | | | | Signed-off-by: Igor Fedotov <igor.fedotov@croit.io>
| * | | | | os/bluestore: cleanup around deferred_write perf counters.Igor Fedotov2021-11-102-18/+20
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Better naming and description. Signed-off-by: Igor Fedotov <ifedotov@suse.com>
| * | | | | os/bluestore: dump bluefs read_disk[_slow] stats with daemonperf cmdIgor Fedotov2021-11-101-16/+31
| | | | | | | | | | | | | | | | | | | | | | | | Signed-off-by: Igor Fedotov <ifedotov@suse.com>
| * | | | | os/bluestore: more bluefs performance countersIgor Fedotov2021-11-102-30/+98
| | | | | | | | | | | | | | | | | | | | | | | | Signed-off-by: Igor Fedotov <ifedotov@suse.com>
| * | | | | os/bluestore: improve daemonperf output for BlueFS.Igor Fedotov2021-11-101-9/+15
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Print slow writing and reading volumes Signed-off-by: Igor Fedotov <ifedotov@suse.com>
| * | | | | os/bluestore: group perf counters, improve daemonperf outputIgor Fedotov2021-11-102-155/+252
| | |_|/ / | |/| | | | | | | | | | | | | | | | | | | | | | | Improves both general usability and daemonperf output Signed-off-by: Igor Fedotov <ifedotov@suse.com>
* | | | | os/bluestore: Fix omap upgrade to per-pg schemeAdam Kupczyk2021-11-171-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This is fix to regression introduced by fix to omap upgrade: https://github.com/ceph/ceph/pull/43687 The problem was that we always skipped first omap entry. This worked fine with objects having omap header key. For objects without header key we skipped first actual omap key. Fixes: https://tracker.ceph.com/issues/53260 Signed-off-by: Adam Kupczyk <akupczyk@redhat.com>
* | | | | Merge pull request #43818 from ifed01/wip-ifed-fix-vol-selectYuri Weinstein2021-11-163-5/+30
|\ \ \ \ \ | |/ / / / |/| | | | | | | | | | | | | | os/bluestore: do not select absent device in volume selector Reviewed-by: Adam Kupczyk <akupczyk@redhat.com>
| * | | | os/bluestore: do not select absent device in volume selectorIgor Fedotov2021-11-053-5/+30
| |/ / / | | | | | | | | | | | | | | | | Fixes: ttps://tracker.ceph.com/issues/53139 Signed-off-by: Igor Fedotov <igor.fedotov@croit.io>
* | | | Merge pull request #43621 from ifed01/wip-ifed-fix-53011Yuri Weinstein2021-11-081-1/+1
|\ \ \ \ | |_|/ / |/| | | | | | | | | | | | | | | os/bluestore: use proper prefix when removing undecodable Share Blob. Reviewed-by: Adam Kupczyk <akupczyk@redhat.com>
| * | | os/bluestore: use proper prefix when removing undecodable Share Blob.Igor Fedotov2021-10-211-1/+1
| | | | | | | | | | | | | | | | | | | | Fixes: https://tracker.ceph.com/issues/53011 Signed-off-by: Igor Fedotov <igor.fedotov@croit.io>
* | | | Merge PR #42762 into masterSage Weil2021-11-0217-677/+1292
|\ \ \ \ | |_|/ / |/| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | * refs/pull/42762/head: ceph_test_objectstore: skip BlueStoreUnshareBlobTest with SMR os/bluestore: debug ExtentMap::update() os/bluestore: _txc_create inside of alloc_and_submit_lock os/bluestore: fix cleaner race with collection removal os/bluestore: add missing ' ' to LruOnodeCacheShare _[un]pin os/bluestore: use simpler map<> to track (onode, zone) -> offset os/bluestore: avoid casting zoned implementations again os/bluestore/ZonedFreelistManager: remove sanity checks os/bluestore/ZonedAllocator: fix allocate() search os/bluestore: drain transactions on cleaner zone finish os/bluestore/ZonedFreelistManager: simplify freelist merge update vs zone reset os/bluetore: configurable sleep period for cleaner blk/zoned: make discard a no-op os/bluestore/ZonedAllocator: count sequential only as 'free' os/bluestore: expect smr fields IFF device is smr ceph_test_objectstore: Test for fixing write pointer ceph_test_objectstore: complain if SMR support not compiled in test/objectstore/run_smr_bluestore_test.sh os/bluestore/ZonedAllocator: handle alloc/release spanning zones os/bluestore: simple cleaner os/bluestore: be smarter about picking a zone to clean os/bluestore: avoid writes to cleaning zone os/bluestore/HybridAllocator: whitespace in debug output os/bluestore: give conventional region of SMR to bluefs os/bluestore: separate alloc pointer from shared_alloc.a test/objectstore/run_smr_bluestore_test.sh ceph_test_objectstore: skip tests that don't work on SMR os/bluestore: disable cleaner thread until it is implemented os/bluestore: fsck verify zone refs os/bluestore: include object in zone ref keys os/bluestore: refactor object key helpers a bit ceph_test_objectstore: skip failing tests on SMR os/bluestore: report mismatch write pointer during fsck os/bluestore: simplify zone to clean selection ceph_test_objectstore: add trivial fsck test os/bluestore: fsck smr allocations (verify num_dead_bytes, alloc past write pointer) os/bluestore: duplicate zone refs when cloning os/bluestore: correct zoned freelist when device write pointers are ahead os/bluestore/ZonedFreelistManager: whitespace os/bluestore: fix startup vs device write pointers blk/zoned: add get_zones() to fetch write pointers os/bluestore: use 64 bit values for zone_state_t os/bluestore: reimplement zone backrefs os/bluestore: fix smr allocator init os/bluestore: do not use null freelist with SMR blk/zones: implement HMSMRDevice has KernelDevice child os/bluestore: fix/simplify zoned_cleaner thread start error handling os/bluestore: properly reset zoned allocator on startup os/bluestore: force prefer_deferred_size=0 for smr os/bluestore: drop SMR 64K min_alloc_size restriction os/bluestore/ZonedAllocator: less verbose os/bluestore/ZonedAllocator: simplify debug output prefix os/bluestore/ZonedAllocator: be consistent with hex debug output os/bluestore/ZonedAllocator: whitespace blk/zoned: remove dead VDO code blk/zoned: add reset_all_zones() blk/zoned: print error during init os/bluestore: adjust allocator+freelist interfaces for smr params os/bluestore: select 'zoned' freelistmanager during mkfs, not mount Reviewed-by: Igor Fedotov <ifedotov@suse.com>
| * | | os/bluestore: debug ExtentMap::update()Sage Weil2021-10-291-0/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | I hit a case where the shard key size didn't match, and it looked as though this code somehow didn't get executed. Signed-off-by: Sage Weil <sage@newdream.net>
| * | | os/bluestore: _txc_create inside of alloc_and_submit_lockSage Weil2021-10-291-4/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Create the transaction inside of the SMR lock. Otherwise, we may get a deadlock between the cleaner C and a normal write op W: W C _txc_create seq 1 lock alloc_and_submit _txc_create seq 2 ... unlock alloc_and_submit lock alloc_and_submit ... block on flush _txc_finish_io, but blocked by seq 1 <deadlock> The root issue here is the txc's are misordered with respect to the alloc_and_submit lock. Fix by moving the _txc_create inside the lock! Signed-off-by: Sage Weil <sage@newdream.net>