| Commit message (Collapse) | Author | Age | Files | Lines |
|\
| |
| |
| |
| |
| | |
os/bluestore: get rid of fake onode nref increment for pinned entry
Reviewed-by: Adam Kupczyk <akupczyk@redhat.com>
|
| |
| |
| |
| |
| |
| |
| | |
Looks like this isn't necessary any more after fixing
https://tracker.ceph.com/issues/53002
Signed-off-by: Igor Fedotov <igor.fedotov@croit.io>
|
|\ \
| | |
| | |
| | |
| | |
| | | |
os/bluestore: dump alloc unit size on bluefs allocation failure.
Reviewed-by: Adam Kupczyk <akupczyk@redhat.com>
|
| | |
| | |
| | |
| | | |
Signed-off-by: Igor Fedotov <ifedotov@suse.com>
|
|\ \ \
| | | |
| | | |
| | | |
| | | |
| | | | |
migrate from using opentracing-cpp to opentelemetry-cpp static as distributed tracing API
Reviewed-by: Yuval Lifshitz <ylifshit@redhat.com>
Reviewed-by: Josh Durgin <jdurgin@redhat.com>
|
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | | |
jaeger-base encapsulated dependency for opentelemetry tracing libraries,
when linked will provide support for tracing for the ceph target.
Signed-off-by: Deepika Upadhyay <dupadhya@redhat.com>
|
|\ \ \ \
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | | |
os/bluestore: Better readability of perf output
Reviewed-by: Ernesto Puerta <epuertat@redhat.com>
Reviewed-by: Igor Fedotov <ifedotov@suse.com>
Reviewed-by: Laura Flores <lflores@redhat.com>
Reviewed-by: Yuri Weinstein <yweins@redhat.com>
|
| | |_|/
| |/| |
| | | |
| | | |
| | | |
| | | | |
Get rid of bluestore_ prefix for some stats.
Signed-off-by: Adam Kupczyk <akupczyk@redhat.com>
|
|\ \ \ \
| | | | |
| | | | |
| | | | |
| | | | | |
Fix ceph-bluestore-tool bluefs-import command
Reviewed-by: Igor Fedotov <igor.fedotov@croit.io>
|
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | | |
Modify bluefs-import command so it can properly initialize allocators.
Without allocators initialized, importing file to bluefs did overwrite some random data,
including first block on device.
Signed-off-by: Adam Kupczyk <akupczyk@redhat.com>
|
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | | |
Keep updating bluefs log when printing content of bluefs replay log.
Without this modification we only have initial content of log.
Log can be printed by 'ceph-bluestore-tool bluefs-log-dump'.
Signed-off-by: Adam Kupczyk <akupczyk@redhat.com>
|
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | | |
BlueFS log is the only file that we can append to.
When we append to file we must take into consideration previously commited allocations,
otherwise update will be miscalculated.
Signed-off-by: Adam Kupczyk <akupczyk@redhat.com>
|
| |/ / /
|/| | |
| | | |
| | | | |
Signed-off-by: Adam Kupczyk <akupczyk@redhat.com>
|
|\ \ \ \
| | | | |
| | | | |
| | | | |
| | | | | |
os/bluestore: don't need separate variable to mark hits when lookup oid.
Reviewed-by: Igor Fedotov <ifedotov@suse.com>
|
| | | | |
| | | | |
| | | | |
| | | | | |
Signed-off-by: locallocal <locallocal@163.com>
|
|\ \ \ \ \
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | | |
os/bluestore: avoid premature onode release.
Reviewed-by: Adam Kupczyk <akupczyk@redhat.com>
|
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | | |
This was observed when onode's removal is followed by reading
and the latter causes object release before the removal is finalized.
The root cause is an improper 'pinned' state assessment in Onode::get
More detailed overview is:
At some point Onode::get() might face the case when nref == 2 and pinned = true
which means parallel incomplete put is running on the onode - ref count is
decremented but pinned state is still unmodified (and even lock hasn't been
acquired yet).
This might finally result in two puts racing over the same onode with nref == 2
which finally results in a premature onode release:
// nref =3, pinned = 1
// Thread 1 Thread 2
// o->put() o->get()
// --nref(n = 2, pinned=1)
// nref++ (n=3, pinned = 1)
// return
// ...
// o->put()
// --nref(n = 2)
// pinned = 0,
// --nref(n = 1)
// ocs->_unpin_and_rm(o) -> o->put()
// ...
// --nref(n = 0)
// release o
// o->c->get_onode_cache()
// FAULT!
//
The suggested fix is to introduce additional atomic counter tracking
running put() functions. And permit onode release when both regular
nref and put_nref are both equal to zero.
Fixes: https://tracker.ceph.com/issues/53002
Signed-off-by: Igor Fedotov <igor.fedotov@croit.io>
|
|\ \ \ \ \ \
| |/ / / / /
|/| | | | |
| | | | | |
| | | | | | |
os/bluestore: Protect _clone against sudden omap format changes
Reviewed-by: Igor Fedotov <ifedotov@suse.com>
|
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | | |
Added assert to verify that omap prefixes between cloned objects are exactly the same.
If they would differ rewrite_omap_key() will possibly overwrite user key potion of data,
or move some part of prefix into user key.
This is a follow up from
https://github.com/ceph/ceph/pull/43687
Signed-off-by: Adam Kupczyk <akupczyk@redhat.com>
|
|\ \ \ \ \ \
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | | |
os/bluestore: dump bluestore/bluefs alloc unit sizes with perf dump
Reviewed-by: Laura Flores <lflores@redhat.com>
|
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | | |
Signed-off-by: Igor Fedotov <igor.fedotov@croit.io>
|
| | |_|_|/ /
| |/| | | |
| | | | | |
| | | | | | |
Signed-off-by: Igor Fedotov <igor.fedotov@croit.io>
|
|\ \ \ \ \ \
| |_|_|/ / /
|/| | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | | |
osd,bluestore: gracefully handle a failure during meta collection load
Reviewed-by: jdurgin@redhat.com
Reviewed-by: nojha@redhat.com
|
| | |/ / /
| |/| | |
| | | | |
| | | | | |
Signed-off-by: Igor Fedotov <igor.fedotov@croit.io>
|
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | | |
DEEP option
Fixes: https://tracker.ceph.com/issues/53185
NCB mishandles fsck DEEP in mount()/umount()/mkfs() case causing it to remove the allocation-file without destaging a new copy (which will cost us a full rebuild on startup)
There are also few confiliting calls to open_db()/close_db() passing inconsistent read-only flag
We fix both issues by storing open-db type (read-only/read-write) and using it for close-db (which won't pass read-only flag anymore)
We also move allocation-file destage to close-db so it will be refreshed after being removed by fsck and such
Signed-off-by: Gabriel Benhanokh <gbenhano@redhat.com>
|
|\ \ \ \ \
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | | |
os/bluestore: Set min_alloc_size to optimal io size
Reviewed-by: Sage Weil <sage@redhat.com>
Reviewed-by: Igor Fedotov <ifedotov@suse.com>
Reviewed-by: Adam Kupczyk <akupczyk@redhat.com>
|
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | | |
Block devices may report an "optimal_io_size" that is different than the
typical 4KiB. To optimize BlueStore for this io size, the allocator
needs to set its min_alloc_size to this optimal_io_size. This PR adds
the discovery of the optimal_io_size for a block device and an option
to use the optimal_io_size as the min_alloc_size for the bluestore allocator.
Older devices may report an optimal_io_size of 0 and if that is the
case, the default config min_alloc_size is used.
Signed-off-by: Curt Bruns <curt.e.bruns@gmail.com>
|
|\ \ \ \ \ \
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | | |
qa/osd-bluefs-volume-ops: fix bluefs volumes ops test case
Reviewed-by: Adam Kupczyk <akupczyk@redhat.com>
|
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | | |
One needs to properly shutdown Onode cache if NCB stuff has performed
full recovery.
Signed-off-by: Igor Fedotov <ifedotov@suse.com>
|
|\ \ \ \ \ \ \
| |_|_|_|_|/ /
|/| | | | | | |
os/bluestore: bug-fix for NCB-FSCK
|
| | |_|_|/ /
| |/| | | |
| | | | | |
| | | | | |
| | | | | | |
Use the function local allocator instead of the global shared-allocator
Signed-off-by: Gabriel Benhanokh <gbenhano@redhat.com>
|
|\ \ \ \ \ \
| |/ / / / /
|/| | | | |
| | | | | |
| | | | | | |
Fix data corruption in bluefs truncate()
Reviewed-by: Igor Fedotov <igor.fedotov@croit.io>
|
| |/ / / /
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | | |
It is possible to create condition in which a BlueFS contains file that is corrupted.
It can happen when BlueFS replay log is on device A and we just wrote to device B and truncated file.
Scenario:
1) write to file h1 on SLOW device
2) flush h1 (initiate transfer, but no fdatasync yet)
3) truncate h1
4) write to file h2 on DB
5) fsync h2 (forces replay log to be written, after fdatasync to DB)
6) poweroff
Fixes: https://tracker.ceph.com/issues/53129
Signed-off-by: Adam Kupczyk <akupczyk@redhat.com>
|
|\ \ \ \ \
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | | |
os/bluestore: improve usability for bluestore/bluefs perf counters
Signed-off-by: Igor Fedotov <igor.fedotov@croit.io>
Reviewed-by: Laura Flores lflores@redhat.com
|
| | | | | |
| | | | | |
| | | | | |
| | | | | | |
Signed-off-by: Igor Fedotov <igor.fedotov@croit.io>
|
| | | | | |
| | | | | |
| | | | | |
| | | | | | |
Signed-off-by: Igor Fedotov <igor.fedotov@croit.io>
|
| | | | | |
| | | | | |
| | | | | |
| | | | | | |
Signed-off-by: Igor Fedotov <igor.fedotov@croit.io>
|
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | | |
Better naming and description.
Signed-off-by: Igor Fedotov <ifedotov@suse.com>
|
| | | | | |
| | | | | |
| | | | | |
| | | | | | |
Signed-off-by: Igor Fedotov <ifedotov@suse.com>
|
| | | | | |
| | | | | |
| | | | | |
| | | | | | |
Signed-off-by: Igor Fedotov <ifedotov@suse.com>
|
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | | |
Print slow writing and reading volumes
Signed-off-by: Igor Fedotov <ifedotov@suse.com>
|
| | |_|/ /
| |/| | |
| | | | |
| | | | |
| | | | |
| | | | | |
Improves both general usability and daemonperf output
Signed-off-by: Igor Fedotov <ifedotov@suse.com>
|
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | | |
This is fix to regression introduced by fix to omap upgrade: https://github.com/ceph/ceph/pull/43687
The problem was that we always skipped first omap entry.
This worked fine with objects having omap header key.
For objects without header key we skipped first actual omap key.
Fixes: https://tracker.ceph.com/issues/53260
Signed-off-by: Adam Kupczyk <akupczyk@redhat.com>
|
|\ \ \ \ \
| |/ / / /
|/| | | |
| | | | |
| | | | | |
os/bluestore: do not select absent device in volume selector
Reviewed-by: Adam Kupczyk <akupczyk@redhat.com>
|
| |/ / /
| | | |
| | | |
| | | |
| | | | |
Fixes: ttps://tracker.ceph.com/issues/53139
Signed-off-by: Igor Fedotov <igor.fedotov@croit.io>
|
|\ \ \ \
| |_|/ /
|/| | |
| | | |
| | | |
| | | | |
os/bluestore: use proper prefix when removing undecodable Share Blob.
Reviewed-by: Adam Kupczyk <akupczyk@redhat.com>
|
| | | |
| | | |
| | | |
| | | |
| | | | |
Fixes: https://tracker.ceph.com/issues/53011
Signed-off-by: Igor Fedotov <igor.fedotov@croit.io>
|
|\ \ \ \
| |_|/ /
|/| | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | | |
* refs/pull/42762/head:
ceph_test_objectstore: skip BlueStoreUnshareBlobTest with SMR
os/bluestore: debug ExtentMap::update()
os/bluestore: _txc_create inside of alloc_and_submit_lock
os/bluestore: fix cleaner race with collection removal
os/bluestore: add missing ' ' to LruOnodeCacheShare _[un]pin
os/bluestore: use simpler map<> to track (onode, zone) -> offset
os/bluestore: avoid casting zoned implementations again
os/bluestore/ZonedFreelistManager: remove sanity checks
os/bluestore/ZonedAllocator: fix allocate() search
os/bluestore: drain transactions on cleaner zone finish
os/bluestore/ZonedFreelistManager: simplify freelist merge update vs zone reset
os/bluetore: configurable sleep period for cleaner
blk/zoned: make discard a no-op
os/bluestore/ZonedAllocator: count sequential only as 'free'
os/bluestore: expect smr fields IFF device is smr
ceph_test_objectstore: Test for fixing write pointer
ceph_test_objectstore: complain if SMR support not compiled in
test/objectstore/run_smr_bluestore_test.sh
os/bluestore/ZonedAllocator: handle alloc/release spanning zones
os/bluestore: simple cleaner
os/bluestore: be smarter about picking a zone to clean
os/bluestore: avoid writes to cleaning zone
os/bluestore/HybridAllocator: whitespace in debug output
os/bluestore: give conventional region of SMR to bluefs
os/bluestore: separate alloc pointer from shared_alloc.a
test/objectstore/run_smr_bluestore_test.sh
ceph_test_objectstore: skip tests that don't work on SMR
os/bluestore: disable cleaner thread until it is implemented
os/bluestore: fsck verify zone refs
os/bluestore: include object in zone ref keys
os/bluestore: refactor object key helpers a bit
ceph_test_objectstore: skip failing tests on SMR
os/bluestore: report mismatch write pointer during fsck
os/bluestore: simplify zone to clean selection
ceph_test_objectstore: add trivial fsck test
os/bluestore: fsck smr allocations (verify num_dead_bytes, alloc past write pointer)
os/bluestore: duplicate zone refs when cloning
os/bluestore: correct zoned freelist when device write pointers are ahead
os/bluestore/ZonedFreelistManager: whitespace
os/bluestore: fix startup vs device write pointers
blk/zoned: add get_zones() to fetch write pointers
os/bluestore: use 64 bit values for zone_state_t
os/bluestore: reimplement zone backrefs
os/bluestore: fix smr allocator init
os/bluestore: do not use null freelist with SMR
blk/zones: implement HMSMRDevice has KernelDevice child
os/bluestore: fix/simplify zoned_cleaner thread start error handling
os/bluestore: properly reset zoned allocator on startup
os/bluestore: force prefer_deferred_size=0 for smr
os/bluestore: drop SMR 64K min_alloc_size restriction
os/bluestore/ZonedAllocator: less verbose
os/bluestore/ZonedAllocator: simplify debug output prefix
os/bluestore/ZonedAllocator: be consistent with hex debug output
os/bluestore/ZonedAllocator: whitespace
blk/zoned: remove dead VDO code
blk/zoned: add reset_all_zones()
blk/zoned: print error during init
os/bluestore: adjust allocator+freelist interfaces for smr params
os/bluestore: select 'zoned' freelistmanager during mkfs, not mount
Reviewed-by: Igor Fedotov <ifedotov@suse.com>
|
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | | |
I hit a case where the shard key size didn't match, and it looked as though
this code somehow didn't get executed.
Signed-off-by: Sage Weil <sage@newdream.net>
|
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | | |
Create the transaction inside of the SMR lock. Otherwise, we may get a
deadlock between the cleaner C and a normal write op W:
W C
_txc_create seq 1
lock alloc_and_submit
_txc_create seq 2
...
unlock alloc_and_submit
lock alloc_and_submit
...
block on flush
_txc_finish_io, but blocked by seq 1
<deadlock>
The root issue here is the txc's are misordered with respect to the
alloc_and_submit lock.
Fix by moving the _txc_create inside the lock!
Signed-off-by: Sage Weil <sage@newdream.net>
|