diff options
author | Samuel Just <sjust@redhat.com> | 2020-03-13 01:12:01 +0100 |
---|---|---|
committer | Samuel Just <sjust@redhat.com> | 2020-05-20 06:59:04 +0200 |
commit | be4093a61e66e81fb32aafc54e39a655d3259d2a (patch) | |
tree | e89e2a52402c05efe32996101b38aa3de9dea1ee /doc/dev | |
parent | doc: break deduplication.rst int several files (diff) | |
download | ceph-be4093a61e66e81fb32aafc54e39a655d3259d2a.tar.xz ceph-be4093a61e66e81fb32aafc54e39a655d3259d2a.zip |
doc: add more information for manifest tiering
Signed-off-by: Samuel Just <sjust@redhat.com>
Signed-off-by: Myoungwon Oh <myoungwon.oh@samsung.com>
Diffstat (limited to '')
-rw-r--r-- | doc/dev/deduplication.rst | 78 | ||||
-rw-r--r-- | doc/dev/osd_internals/manifest.rst | 118 | ||||
-rw-r--r-- | doc/dev/osd_internals/refcount.rst | 45 |
3 files changed, 213 insertions, 28 deletions
diff --git a/doc/dev/deduplication.rst b/doc/dev/deduplication.rst index 2dec9f03bc3..1a322d59910 100644 --- a/doc/dev/deduplication.rst +++ b/doc/dev/deduplication.rst @@ -89,34 +89,64 @@ scheme between replication and erasure coding depending on its usage and each pool can be placed in a different storage location depending on the required performance. -Manifest Object: -Metadata objects are stored in the -base pool, which contains metadata for data deduplication. - -:: - - struct object_manifest_t { - enum { - TYPE_NONE = 0, - TYPE_REDIRECT = 1, - TYPE_CHUNKED = 2, - }; - uint8_t type; // redirect, chunked, ... - hobject_t redirect_target; - std::map<uint64_t, chunk_info_t> chunk_map; - } +Regarding how to use, please see :doc:`doc/dev/osd_internals/manifest.rst` +============= +Usage Patterns +============== + +The different ceph interface layers present potentially different oportunities +and costs for deduplication and tiering in general. + +RadosGW +------- + +S3 big data workloads seem like a good opportunity for deduplication. These +objects tend to be write once, read mostly objects which don't see partial +overwrites. As sugh, it makes sense to fingerprint and dedup up front. + +Unlike cephfs and rbd, radosgw has a system for storing +explicit metadata in the head object of a logical s3 object for +locating the remaining pieces. As such, radosgw could use the +refcounting machinery (osd_internals/refcount.rst) directly without +needing direct support from rados for manifests. + +RBD/Cephfs +---------- + +RBD and CephFS both use deterministic naming schemes to partition +block devices/file data over rados objects. As such, the redirection +metadata would need to be included as part of rados, presumably +transparently. + +Moreover, unlike radosgw, rbd/cephfs rados objects can see overwrites. +For those objects, we don't really want to perform dedup, and we don't +want to pay a write latency penalty in the hot path to do so anyway. +As such, performing tiering and dedup on cold objects in the background +is likely to be preferred. + +One important wrinkle, however, is that both rbd and cephfs workloads +often feature usage of snapshots. This means that the rados manifest +support needs robust support for snapshots. + +RADOS Machinery +=============== -A chunk Object: -Chunk objects are stored in the chunk pool. Chunk object contains chunk data -and its reference count information. +For more information on rados redirect/chunk/dedup support, see osd_internals/manifest.rst. +For more information on rados refcount support, see osd_internals/refcount.rst. +Status and Future Work +====================== -Although chunk objects and manifest objects have a different purpose -from existing objects, they can be handled the same way as -original objects. Therefore, to support existing features such as replication, -no additional operations for dedup are needed. +At the moment, there exists some preliminary support for manifest +objects within the osd as well as a dedup tool. +RadosGW data warehouse workloads probably represent the largest +opportunity for this feature, so the first priority is probably to add +direct support for fingerprinting and redirects into the refcount pool +to radosgw. -Regarding how to use, please see :doc:`doc/dev/osd_internals/manifest.rst` +Aside from radosgw, completing work on manifest object support in the +osd particularly as it relates to snapshots would be the next step for +rbd and cephfs workloads. diff --git a/doc/dev/osd_internals/manifest.rst b/doc/dev/osd_internals/manifest.rst index 806baa12889..bafc12a1802 100644 --- a/doc/dev/osd_internals/manifest.rst +++ b/doc/dev/osd_internals/manifest.rst @@ -3,10 +3,94 @@ Manifest ======== +============ +Introduction +============ -RAODS Interface +As described in ../deduplication.rst, adding transparent redirect +machinery to RADOS would enable a more capable tiering solution +than RADOS currently has with "cache/tiering". + +See ../deduplication.rst + +At a high level, each object has a piece of metadata embedded in +the object_info_t which can map subsets of the object data payload +to (refcounted) objects in other pools. + +This document exists to detail: + +1. Manifest data structures +2. Rados operations for manipulating manifests. +3. How those operations interact with other features like snapshots. + +Data Structures =============== +Each object contains an object_manifest_t embedded within the +object_info_t (see osd_types.h): + +:: + + struct object_manifest_t { + enum { + TYPE_NONE = 0, + TYPE_REDIRECT = 1, + TYPE_CHUNKED = 2, + }; + uint8_t type; // redirect, chunked, ... + hobject_t redirect_target; + std::map<uint64_t, chunk_info_t> chunk_map; + } + +TODO: check the following + +The type enum reflects three possible states an object can be in: + +1. TYPE_NONE: normal rados object +2. TYPE_REDIRECT: object payload is backed by a single object + specified by redirect_target +3. TYPE_CHUNKED: object payload is distributed among objects with + size and offset specified by the chunk_map. chunk_map maps + the offset of the chunk to a chunk_info_t shown below further + specifying the length, target oid, and flags. + +:: + + struct chunk_info_t { + typedef enum { + FLAG_DIRTY = 1, + FLAG_MISSING = 2, + FLAG_HAS_REFERENCE = 4, + FLAG_HAS_FINGERPRINT = 8, + } cflag_t; + uint32_t offset; + uint32_t length; + hobject_t oid; + cflag_t flags; // FLAG_* + +TODO: apparently we specify the offset twice, with different widths + +Request Handling +================ + +Similarly to cache/tiering, the initial touchpoint is +maybe_handle_manifest_detail. + +For manifest operations listed below, we return NOOP and continue onto +dedicated handling within do_osd_ops. + +For redirect objects which haven't been promoted (apparently oi.size > +0 indicates that it's present?) we proxy reads and writes. + +For reads on TYPE_CHUNKED, if can_proxy_chunked_read (basically, all +of the ops are reads of extents in the object_manifest_t chunk_map), +we proxy requests to those objects. + + + +RADOS Interface +================ + To set up deduplication pools, you must have two pools. One will act as the base pool and the other will act as the chunk pool. The base pool need to be configured with fingerprint_algorithm option as follows. @@ -29,13 +113,15 @@ configured with fingerprint_algorithm option as follows. Operations: - * set-redirect set a redirection between a base_object in the base_pool and a target_object in the target_pool. A redirected object will forward all operations from the client to the target_object. :: + + void set_redirect(const std::string& tgt_obj, const IoCtx& tgt_ioctx, + uint64_t tgt_version, int flag = 0); rados -p base_pool set-redirect <base_object> --target-pool <target_pool> <target_object> @@ -44,6 +130,9 @@ Operations: set the chunk-offset in a source_object to make a link between it and a target_object. :: + + void set_chunk(uint64_t src_offset, uint64_t src_length, const IoCtx& tgt_ioctx, + std::string tgt_oid, uint64_t tgt_offset, int flag = 0); rados -p base_pool set-chunk <source_object> <offset> <length> --target-pool <caspool> <target_object> <taget-offset> @@ -52,27 +141,32 @@ Operations: promote the object (including chunks). :: + void tier_promote(); + rados -p base_pool tier-promote <obj-name> * unset-manifest unset the manifest info in the object that has manifest. :: + void unset_manifest(); + rados -p base_pool unset-manifest <obj-name> * tier-flush flush the object which has chunks to the chunk pool. :: - rados -p base_pool tier-flush <obj-name> + void tier_flush(); + rados -p base_pool tier-flush <obj-name> Dedup tool ========== Dedup tool has two features: finding an optimal chunk offset for dedup chunking -and fixing the reference count. +and fixing the reference count (see ./refcount.rst). * find an optimal chunk offset @@ -151,3 +245,19 @@ and fixing the reference count. ceph-dedup-tool --op chunk_scrub --chunk_pool $CHUNK_POOL +===================== +Status and Future Work +====================== + +At the moment, the above interfaces exist in rados, but have unclear +interactions with snapshots. + +Snapshots +--------- + +Here are some design questions we'll need to tackle: + +1. set-redirect + + * What happens if set on a clone? + * diff --git a/doc/dev/osd_internals/refcount.rst b/doc/dev/osd_internals/refcount.rst new file mode 100644 index 00000000000..4d75ae01949 --- /dev/null +++ b/doc/dev/osd_internals/refcount.rst @@ -0,0 +1,45 @@ +======== +Refcount +======== + + +Introduction +============ + +Dedupliation, as described in ../deduplication.rst, needs a way to +maintain a target pool of deduplicated chunks with atomic ref +refcounting. To that end, there exists an osd object class +refcount responsible for using the object class machinery to +maintain refcounts on deduped chunks and ultimately remove them +as the refcount hits 0. + +Class Interface +=============== + +See cls/refcount/cls_refcount_client* + +* cls_refcount_get + + Atomically increments the refcount with specified tag :: + + void cls_refcount_get(librados::ObjectWriteOperation& op, const string& tag, bool implicit_ref = false); + +* cls_refcount_put + + Atomically decrements the refcount specified by passed tag :: + + void cls_refcount_put(librados::ObjectWriteOperation& op, const string& tag, bool implicit_ref = false); + +* cls_refcount_Set + + Atomically sets the set of refcounts with passed list of tags :: + + void cls_refcount_set(librados::ObjectWriteOperation& op, list<string>& refs); + +* cls_refcount_read + + Dumps the current set of ref tags for the object :: + + int cls_refcount_read(librados::IoCtx& io_ctx, string& oid, list<string> *refs, bool implicit_ref = false); + + |