diff options
Diffstat (limited to 'doc/dev/crimson/pipeline.rst')
-rw-r--r-- | doc/dev/crimson/pipeline.rst | 124 |
1 files changed, 31 insertions, 93 deletions
diff --git a/doc/dev/crimson/pipeline.rst b/doc/dev/crimson/pipeline.rst index e9115c6d7c3..6e47b79d80b 100644 --- a/doc/dev/crimson/pipeline.rst +++ b/doc/dev/crimson/pipeline.rst @@ -2,96 +2,34 @@ The ``ClientRequest`` pipeline ============================== -In crimson, exactly like in the classical OSD, a client request has data and -ordering dependencies which must be satisfied before processing (actually -a particular phase of) can begin. As one of the goals behind crimson is to -preserve the compatibility with the existing OSD incarnation, the same semantic -must be assured. An obvious example of such data dependency is the fact that -an OSD needs to have a version of OSDMap that matches the one used by the client -(``Message::get_min_epoch()``). - -If a dependency is not satisfied, the processing stops. It is crucial to note -the same must happen to all other requests that are sequenced-after (due to -their ordering requirements). - -There are a few cases when the blocking of a client request can happen. - - - ``ClientRequest::ConnectionPipeline::await_map`` - wait for particular OSDMap version is available at the OSD level - ``ClientRequest::ConnectionPipeline::get_pg`` - wait a particular PG becomes available on OSD - ``ClientRequest::PGPipeline::await_map`` - wait on a PG being advanced to particular epoch - ``ClientRequest::PGPipeline::wait_for_active`` - wait for a PG to become *active* (i.e. have ``is_active()`` asserted) - ``ClientRequest::PGPipeline::recover_missing`` - wait on an object to be recovered (i.e. leaving the ``missing`` set) - ``ClientRequest::PGPipeline::get_obc`` - wait on an object to be available for locking. The ``obc`` will be locked - before this operation is allowed to continue - ``ClientRequest::PGPipeline::process`` - wait if any other ``MOSDOp`` message is handled against this PG - -At any moment, a ``ClientRequest`` being served should be in one and only one -of the phases described above. Similarly, an object denoting particular phase -can host not more than a single ``ClientRequest`` the same time. At low-level -this is achieved with a combination of a barrier and an exclusive lock. -They implement the semantic of a semaphore with a single slot for these exclusive -phases. - -As the execution advances, request enters next phase and leaves the current one -freeing it for another ``ClientRequest`` instance. All these phases form a pipeline -which assures the order is preserved. - -These pipeline phases are divided into two ordering domains: ``ConnectionPipeline`` -and ``PGPipeline``. The former ensures order across a client connection while -the latter does that across a PG. That is, requests originating from the same -connection are executed in the same order as they were sent by the client. -The same applies to the PG domain: when requests from multiple connections reach -a PG, they are executed in the same order as they entered a first blocking phase -of the ``PGPipeline``. - -Comparison with the classical OSD ----------------------------------- -As the audience of this document are Ceph Developers, it seems reasonable to -match the phases of crimson's ``ClientRequest`` pipeline with the blocking -stages in the classical OSD. The names in the right column are names of -containers (lists and maps) used to implement these stages. They are also -already documented in the ``PG.h`` header. - -+----------------------------------------+--------------------------------------+ -| crimson | ceph-osd waiting list | -+========================================+======================================+ -|``ConnectionPipeline::await_map`` | ``OSDShardPGSlot::waiting`` and | -|``ConnectionPipeline::get_pg`` | ``OSDShardPGSlot::waiting_peering`` | -+----------------------------------------+--------------------------------------+ -|``PGPipeline::await_map`` | ``PG::waiting_for_map`` | -+----------------------------------------+--------------------------------------+ -|``PGPipeline::wait_for_active`` | ``PG::waiting_for_peered`` | -| +--------------------------------------+ -| | ``PG::waiting_for_flush`` | -| +--------------------------------------+ -| | ``PG::waiting_for_active`` | -+----------------------------------------+--------------------------------------+ -|To be done (``PG_STATE_LAGGY``) | ``PG::waiting_for_readable`` | -+----------------------------------------+--------------------------------------+ -|To be done | ``PG::waiting_for_scrub`` | -+----------------------------------------+--------------------------------------+ -|``PGPipeline::recover_missing`` | ``PG::waiting_for_unreadable_object``| -| +--------------------------------------+ -| | ``PG::waiting_for_degraded_object`` | -+----------------------------------------+--------------------------------------+ -|To be done (proxying) | ``PG::waiting_for_blocked_object`` | -+----------------------------------------+--------------------------------------+ -|``PGPipeline::get_obc`` | *obc rwlocks* | -+----------------------------------------+--------------------------------------+ -|``PGPipeline::process`` | ``PG::lock`` (roughly) | -+----------------------------------------+--------------------------------------+ - - -As the last word it might be worth to emphasize that the ordering implementations -in both classical OSD and in crimson are stricter than a theoretical minimum one -required by the RADOS protocol. For instance, we could parallelize read operations -targeting the same object at the price of extra complexity but we don't -- the -simplicity has won. +RADOS requires writes on each object to be ordered. If a client +submits a sequence of concurrent writes (doesn't want for the prior to +complete before submitting the next), that client may rely on the +writes being completed in the order in which they are submitted. + +As a result, the client->osd communication and queueing mechanisms on +both sides must take care to ensure that writes on a (connection, +object) pair remain ordered for the entire process. + +crimson-osd enforces this ordering via Pipelines and Stages +(crimson/osd/osd_operation.h). Upon arrival at the OSD, messages +enter the ConnectionPipeline::AwaitActive stage and proceed +through a sequence of pipeline stages: + +* ConnectionPipeline: per-connection stages representing the message handling + path prior to being handed off to the target PG +* PerShardPipeline: intermediate Pipeline representing the hand off from the + receiving shard to the shard with the target PG. +* CommonPGPipeline: represents processing on the target PG prior to obtaining + the ObjectContext for the target of the operation. +* CommonOBCPipeline: represents the actual processing of the IO on the target + object + +Because CommonOBCPipeline is per-object rather than per-connection or +per-pg, multiple requests on different objects may be in the same +CommonOBCPipeline stage concurrently. This allows us to serve +multiple reads in the same PG concurrently. We can also process +writes on multiple objects concurrently up to the point at which the +write is actually submitted. + +See crimson/osd/osd_operations/client_request.(h|cc) for details. |