diff options
author | Anthony D'Atri <anthony.datri@gmail.com> | 2020-10-20 09:51:44 +0200 |
---|---|---|
committer | Anthony D'Atri <anthony.datri@gmail.com> | 2020-10-21 19:56:05 +0200 |
commit | ade052e08a1ee5d747a1fac4bf3763224b787e41 (patch) | |
tree | 572269dda03a4542f7e126b8cfcf42f82625b603 | |
parent | Merge pull request #37644 from mgfritch/orch-device-lsm_data (diff) | |
download | ceph-ade052e08a1ee5d747a1fac4bf3763224b787e41.tar.xz ceph-ade052e08a1ee5d747a1fac4bf3763224b787e41.zip |
doc/rados/operations: clarity, detail, modernization
Signed-off-by: Anthony D'Atri <anthony.datri@gmail.com>
-rw-r--r-- | doc/rados/operations/add-or-rm-mons.rst | 49 | ||||
-rw-r--r-- | doc/rados/operations/change-mon-elections.rst | 10 | ||||
-rw-r--r-- | doc/rados/operations/control.rst | 22 | ||||
-rw-r--r-- | doc/rados/operations/crush-map-edits.rst | 25 | ||||
-rw-r--r-- | doc/rados/operations/crush-map.rst | 337 | ||||
-rw-r--r-- | doc/rados/operations/stretch-mode.rst | 2 |
6 files changed, 242 insertions, 203 deletions
diff --git a/doc/rados/operations/add-or-rm-mons.rst b/doc/rados/operations/add-or-rm-mons.rst index ba03839aefd..b0edb199bd8 100644 --- a/doc/rados/operations/add-or-rm-mons.rst +++ b/doc/rados/operations/add-or-rm-mons.rst @@ -13,32 +13,41 @@ or `Monitor Bootstrap`_. Adding Monitors =============== -Ceph monitors are light-weight processes that maintain a master copy of the -cluster map. You can run a cluster with 1 monitor. We recommend at least 3 -monitors for a production cluster. Ceph monitors use a variation of the -`Paxos`_ protocol to establish consensus about maps and other critical +Ceph monitors are lightweight processes that are the single source of truth +for the cluster map. You can run a cluster with 1 monitor but we recommend at least 3 +for a production cluster. Ceph monitors use a variation of the +`Paxos`_ algorithm to establish consensus about maps and other critical information across the cluster. Due to the nature of Paxos, Ceph requires -a majority of monitors running to establish a quorum (thus establishing +a majority of monitors to be active to establish a quorum (thus establishing consensus). -It is advisable to run an odd-number of monitors but not mandatory. An -odd-number of monitors has a higher resiliency to failures than an -even-number of monitors. For instance, on a 2 monitor deployment, no -failures can be tolerated in order to maintain a quorum; with 3 monitors, -one failure can be tolerated; in a 4 monitor deployment, one failure can -be tolerated; with 5 monitors, two failures can be tolerated. This is -why an odd-number is advisable. Summarizing, Ceph needs a majority of -monitors to be running (and able to communicate with each other), but that +It is advisable to run an odd number of monitors. An +odd number of monitors is more resilient than an +even number. For instance, with a two monitor deployment, no +failures can be tolerated and still maintain a quorum; with three monitors, +one failure can be tolerated; in a four monitor deployment, one failure can +be tolerated; with five monitors, two failures can be tolerated. This avoids +the dreaded *split brain* phenomenon, and is why an odd number is best. +In short, Ceph needs a majority of +monitors to be active (and able to communicate with each other), but that majority can be achieved using a single monitor, or 2 out of 2 monitors, 2 out of 3, 3 out of 4, etc. -For an initial deployment of a multi-node Ceph cluster, it is advisable to -deploy three monitors, increasing the number two at a time if a valid need -for more than three exists. +For small or non-critical deployments of multi-node Ceph clusters, it is +advisable to deploy three monitors, and to increase the number of monitors +to five for larger clusters or to survive a double failure. There is rarely +justification for seven or more. -Since monitors are light-weight, it is possible to run them on the same -host as an OSD; however, we recommend running them on separate hosts, -because fsync issues with the kernel may impair performance. +Since monitors are lightweight, it is possible to run them on the same +host as OSDs; however, we recommend running them on separate hosts, +because `fsync` issues with the kernel may impair performance. +Dedicated monitor nodes also minimize disruption since monitor and OSD +daemons are not inactive at the same time when a node crashes or is +taken down for maintenance. + +Dedicated +monitor nodes also make for cleaner maintenance by avoiding both OSDs and +a mon going down if a node is rebooted, taken down, or crashes. .. note:: A *majority* of monitors in your cluster must be able to reach each other in order to establish a quorum. @@ -131,7 +140,7 @@ Removing Monitors ================= When you remove monitors from a cluster, consider that Ceph monitors use -PAXOS to establish consensus about the master cluster map. You must have +Paxos to establish consensus about the master cluster map. You must have a sufficient number of monitors to establish a quorum for consensus about the cluster map. diff --git a/doc/rados/operations/change-mon-elections.rst b/doc/rados/operations/change-mon-elections.rst index ee4cf1dc93e..558e7d85b78 100644 --- a/doc/rados/operations/change-mon-elections.rst +++ b/doc/rados/operations/change-mon-elections.rst @@ -4,9 +4,8 @@ Configure Monitor Election Strategies ===================================== -By default, the monitors will use the classic option it has always used. We -recommend you stay in this mode unless you require features in the other -modes. +By default, the monitors will use the ``classic`` mode. We +recommend that you stay in this mode unless you have a very specific reason. If you want to switch modes BEFORE constructing the cluster, change the ``mon election default strategy`` option. This option is an integer value: @@ -48,8 +47,9 @@ The connectivity Mode ===================== This mode evaluates connection scores provided by each monitor for its peers and elects the monitor with the highest score. This mode is designed -to handle netsplits, which may happen if your cluster is stretched across -multiple data centers or otherwise susceptible. +to handle network partitioning or *net-splits*, which may happen if your cluster +is stretched across multiple data centers or otherwise has a non-uniform +or unbalanced network topology. This mode also supports disallowing monitors from being the leader using the same commands as above in disallow. diff --git a/doc/rados/operations/control.rst b/doc/rados/operations/control.rst index 7ec372282c2..126f72bc66e 100644 --- a/doc/rados/operations/control.rst +++ b/doc/rados/operations/control.rst @@ -8,7 +8,7 @@ Monitor Commands ================ -Monitor commands are issued using the ceph utility:: +Monitor commands are issued using the ``ceph`` utility:: ceph [-m monhost] {command} @@ -20,12 +20,12 @@ The command is usually (though not always) of the form:: System Commands =============== -Execute the following to display the current status of the cluster. :: +Execute the following to display the current cluster status. :: ceph -s ceph status -Execute the following to display a running summary of the status of the cluster, +Execute the following to display a running summary of cluster status and major events. :: ceph -w @@ -59,11 +59,15 @@ To list the cluster's keys and their capabilities, execute the following:: Placement Group Subsystem ========================= -To display the statistics for all placement groups, execute the following:: +To display the statistics for all placement groups (PGs), execute the following:: ceph pg dump [--format {format}] The valid formats are ``plain`` (default), ``json`` ``json-pretty``, ``xml``, and ``xml-pretty``. +When implementing monitoring and other tools, it is best to use ``json`` format. +JSON parsing is more deterministic than the human-oriented ``plain``, and the layout is much +less variable from release to release. The ``jq`` utility can be invaluable when extracting +data from JSON output. To display the statistics for all placement groups stuck in a specified state, execute the following:: @@ -115,7 +119,7 @@ The foregoing is functionally equivalent to :: Dump the OSD map. Valid formats for ``-f`` are ``plain``, ``json``, ``json-pretty``, ``xml``, and ``xml-pretty``. If no ``--format`` option is given, the OSD map is -dumped as plain text. :: +dumped as plain text. As above, JSON format is best for tools, scripting, and other automation. :: ceph osd dump [--format {format}] @@ -149,7 +153,7 @@ Set the weight of the item given by ``{name}`` to ``{weight}``. :: ceph osd crush reweight {name} {weight} -Mark an OSD as lost. This may result in permanent data loss. Use with caution. :: +Mark an OSD as ``lost``. This may result in permanent data loss. Use with caution. :: ceph osd lost {id} [--yes-i-really-mean-it] @@ -162,7 +166,7 @@ Remove the given OSD(s). :: ceph osd rm [{id}...] -Query the current max_osd parameter in the OSD map. :: +Query the current ``max_osd`` parameter in the OSD map. :: ceph osd getmaxosd @@ -170,8 +174,8 @@ Import the given crush map. :: ceph osd setcrushmap -i file -Set the ``max_osd`` parameter in the OSD map. This is necessary when -expanding the storage cluster. :: +Set the ``max_osd`` parameter in the OSD map. This defaults to 10000 now so +most admins will never need to adjust this. :: ceph osd setmaxosd diff --git a/doc/rados/operations/crush-map-edits.rst b/doc/rados/operations/crush-map-edits.rst index 452140a77b6..aea48ae7d5b 100644 --- a/doc/rados/operations/crush-map-edits.rst +++ b/doc/rados/operations/crush-map-edits.rst @@ -1,14 +1,14 @@ Manually editing a CRUSH Map ============================ -.. note:: Manually editing the CRUSH map is considered an advanced +.. note:: Manually editing the CRUSH map is an advanced administrator operation. All CRUSH changes that are necessary for the overwhelming majority of installations are possible via the standard ceph CLI and do not require manual CRUSH map edits. If you have identified a use case where - manual edits *are* necessary, consider contacting the Ceph - developers so that future versions of Ceph can make this - unnecessary. + manual edits *are* necessary with recent Ceph releases, consider + contacting the Ceph developers so that future versions of Ceph + can obviate your corner case. To edit an existing CRUSH map: @@ -77,13 +77,12 @@ Sections There are six main sections to a CRUSH Map. -#. **tunables:** The preamble at the top of the map described any *tunables* - for CRUSH behavior that vary from the historical/legacy CRUSH behavior. These - correct for old bugs, optimizations, or other changes in behavior that have +#. **tunables:** The preamble at the top of the map describes any *tunables* + that differ from the historical / legacy CRUSH behavior. These + correct for old bugs, optimizations, or other changes that have been made over the years to improve CRUSH's behavior. -#. **devices:** Devices are individual ``ceph-osd`` daemons that can - store data. +#. **devices:** Devices are individual OSDs that store data. #. **types**: Bucket ``types`` define the types of buckets used in your CRUSH hierarchy. Buckets consist of a hierarchical aggregation @@ -108,10 +107,10 @@ There are six main sections to a CRUSH Map. CRUSH Map Devices ----------------- -Devices are individual ``ceph-osd`` daemons that can store data. You -will normally have one defined here for each OSD daemon in your -cluster. Devices are identified by an id (a non-negative integer) and -a name, normally ``osd.N`` where ``N`` is the device id. +Devices are individual OSDs that store data. Usually one is defined here for each +OSD daemon in your +cluster. Devices are identified by an ``id`` (a non-negative integer) and +a ``name``, normally ``osd.N`` where ``N`` is the device id. .. _crush-map-device-class: diff --git a/doc/rados/operations/crush-map.rst b/doc/rados/operations/crush-map.rst index 8e00cdaa9e0..2558a42fee3 100644 --- a/doc/rados/operations/crush-map.rst +++ b/doc/rados/operations/crush-map.rst @@ -3,47 +3,49 @@ ============ The :abbr:`CRUSH (Controlled Replication Under Scalable Hashing)` algorithm -determines how to store and retrieve data by computing data storage locations. +determines how to store and retrieve data by computing storage locations. CRUSH empowers Ceph clients to communicate with OSDs directly rather than through a centralized server or broker. With an algorithmically determined method of storing and retrieving data, Ceph avoids a single point of failure, a performance bottleneck, and a physical limit to its scalability. -CRUSH requires a map of your cluster, and uses the CRUSH map to pseudo-randomly -store and retrieve data in OSDs with a uniform distribution of data across the -cluster. For a detailed discussion of CRUSH, see +CRUSH uses a map of your cluster (the CRUSH map) to pseudo-randomly +map data to OSDs, distributing it across the cluster according to configured +replication policy and failure domain. For a detailed discussion of CRUSH, see `CRUSH - Controlled, Scalable, Decentralized Placement of Replicated Data`_ -CRUSH maps contain a list of :abbr:`OSDs (Object Storage Devices)`, a list of -'buckets' for aggregating the devices into physical locations, and a list of -rules that tell CRUSH how it should replicate data in a Ceph cluster's pools. By +CRUSH maps contain a list of :abbr:`OSDs (Object Storage Devices)`, a hierarchy +of 'buckets' for aggregating devices and buckets, and +rules that govern how CRUSH replicates data within the cluster's pools. By reflecting the underlying physical organization of the installation, CRUSH can -model—and thereby address—potential sources of correlated device failures. -Typical sources include physical proximity, a shared power source, and a shared -network. By encoding this information into the cluster map, CRUSH placement -policies can separate object replicas across different failure domains while -still maintaining the desired distribution. For example, to address the +model (and thereby address) the potential for correlated device failures. +Typical factors include chassis, racks, physical proximity, a shared power +source, and shared networking. By encoding this information into the cluster +map, CRUSH placement +policies distribute object replicas across failure domains while +maintaining the desired distribution. For example, to address the possibility of concurrent failures, it may be desirable to ensure that data replicas are on devices using different shelves, racks, power supplies, controllers, and/or physical locations. -When you deploy OSDs they are automatically placed within the CRUSH map under a -``host`` node named with the hostname for the host they are running on. This, -combined with the default CRUSH failure domain, ensures that replicas or erasure -code shards are separated across hosts and a single host failure will not -affect availability. For larger clusters, however, administrators should carefully consider their choice of failure domain. Separating replicas across racks, -for example, is common for mid- to large-sized clusters. +When you deploy OSDs they are automatically added to the CRUSH map under a +``host`` bucket named for the node on which they run. This, +combined with the configured CRUSH failure domain, ensures that replicas or +erasure code shards are distributed across hosts and that a single host or other +failure will not affect availability. For larger clusters, administrators must +carefully consider their choice of failure domain. Separating replicas across racks, +for example, is typical for mid- to large-sized clusters. CRUSH Location ============== -The location of an OSD in terms of the CRUSH map's hierarchy is -referred to as a ``crush location``. This location specifier takes the -form of a list of key and value pairs describing a position. For +The location of an OSD within the CRUSH map's hierarchy is +referred to as a ``CRUSH location``. This location specifier takes the +form of a list of key and value pairs. For example, if an OSD is in a particular row, rack, chassis and host, and -is part of the 'default' CRUSH tree (this is the case for the vast -majority of clusters), its crush location could be described as:: +is part of the 'default' CRUSH root (which is the case for most +clusters), its CRUSH location could be described as:: root=default row=a rack=a2 chassis=a2a host=a2a1 @@ -51,51 +53,53 @@ Note: #. Note that the order of the keys does not matter. #. The key name (left of ``=``) must be a valid CRUSH ``type``. By default - these include root, datacenter, room, row, pod, pdu, rack, chassis and host, - but those types can be customized to be anything appropriate by modifying - the CRUSH map. + these include ``root``, ``datacenter``, ``room``, ``row``, ``pod``, ``pdu``, + ``rack``, ``chassis`` and ``host``. + These defined types suffice for almost all clusters, but can be customized + by modifying the CRUSH map. #. Not all keys need to be specified. For example, by default, Ceph - automatically sets a ``ceph-osd`` daemon's location to be + automatically sets an ``OSD``'s location to be ``root=default host=HOSTNAME`` (based on the output from ``hostname -s``). -The crush location for an OSD is normally expressed via the ``crush location`` -config option being set in the ``ceph.conf`` file. Each time the OSD starts, +The CRUSH location for an OSD can be defined by adding the ``crush location`` +option in ``ceph.conf``. Each time the OSD starts, it verifies it is in the correct location in the CRUSH map and, if it is not, it moves itself. To disable this automatic CRUSH map management, add the following to your configuration file in the ``[osd]`` section:: osd crush update on start = false +Note that in most cases you will not need to manually configure this. + Custom location hooks --------------------- A customized location hook can be used to generate a more complete -crush location on startup. The crush location is based on, in order +CRUSH location on startup. The CRUSH location is based on, in order of preference: -#. A ``crush location`` option in ceph.conf. +#. A ``crush location`` option in ``ceph.conf`` #. A default of ``root=default host=HOSTNAME`` where the hostname is - generated with the ``hostname -s`` command. + derived from the ``hostname -s`` command -This is not useful by itself, as the OSD itself has the exact same -behavior. However, a script can be written to provide additional -location fields (for example, the rack or datacenter), and then the +A script can be written to provide additional +location fields (for example, ``rack`` or ``datacenter``) and the hook enabled via the config option:: - crush location hook = /path/to/customized-ceph-crush-location + crush location hook = /path/to/customized-ceph-crush-location This hook is passed several arguments (below) and should output a single line -to stdout with the CRUSH location description.:: +to ``stdout`` with the CRUSH location description.:: --cluster CLUSTER --id ID --type TYPE -where the cluster name is typically 'ceph', the id is the daemon +where the cluster name is typically ``ceph``, the ``id`` is the daemon identifier (e.g., the OSD number or daemon identifier), and the daemon -type is ``osd``, ``mds``, or similar. +type is ``osd``, ``mds``, etc. -For example, a simple hook that additionally specified a rack location -based on a hypothetical file ``/etc/rack`` might be:: +For example, a simple hook that additionally specifies a rack location +based on a value in the file ``/etc/rack`` might be:: #!/bin/sh echo "host=$(hostname -s) rack=$(cat /etc/rack) root=default" @@ -104,10 +108,10 @@ based on a hypothetical file ``/etc/rack`` might be:: CRUSH structure =============== -The CRUSH map consists of, loosely speaking, a hierarchy describing -the physical topology of the cluster, and a set of rules defining -policy about how we place data on those devices. The hierarchy has -devices (``ceph-osd`` daemons) at the leaves, and internal nodes +The CRUSH map consists of a hierarchy that describes +the physical topology of the cluster and a set of rules defining +data placement policy. The hierarchy has +devices (OSDs) at the leaves, and internal nodes corresponding to other physical features or groupings: hosts, racks, rows, datacenters, and so on. The rules describe how replicas are placed in terms of that hierarchy (e.g., 'three replicas in different @@ -116,36 +120,35 @@ racks'). Devices ------- -Devices are individual ``ceph-osd`` daemons that can store data. You -will normally have one defined here for each OSD daemon in your -cluster. Devices are identified by an id (a non-negative integer) and -a name, normally ``osd.N`` where ``N`` is the device id. +Devices are individual OSDs that store data, usually one for each storage drive. +Devices are identified by an ``id`` +(a non-negative integer) and a ``name``, normally ``osd.N`` where ``N`` is the device id. -Devices may also have a *device class* associated with them (e.g., -``hdd`` or ``ssd``), allowing them to be conveniently targeted by a -crush rule. +Since the Luminous release, devices may also have a *device class* assigned (e.g., +``hdd`` or ``ssd`` or ``nvme``), allowing them to be conveniently targeted by +CRUSH rules. This is especially useful when mixing device types within hosts. Types and Buckets ----------------- A bucket is the CRUSH term for internal nodes in the hierarchy: hosts, racks, rows, etc. The CRUSH map defines a series of *types* that are -used to describe these nodes. By default, these types include: - -- osd (or device) -- host -- chassis -- rack -- row -- pdu -- pod -- room -- datacenter -- zone -- region -- root - -Most clusters make use of only a handful of these types, and others +used to describe these nodes. Default types include: + +- ``osd`` (or ``device``) +- ``host`` +- ``chassis`` +- ``rack`` +- ``row`` +- ``pdu`` +- ``pod`` +- ``room`` +- ``datacenter`` +- ``zone`` +- ``region`` +- ``root`` + +Most clusters use only a handful of these types, and others can be defined as needed. The hierarchy is built with devices (normally type ``osd``) at the @@ -171,35 +174,33 @@ leaves, interior nodes with non-device types, and a root node of type +-----------+ +-----------+ +-----------+ +-----------+ Each node (device or bucket) in the hierarchy has a *weight* -associated with it, indicating the relative proportion of the total +that indicates the relative proportion of the total data that device or hierarchy subtree should store. Weights are set at the leaves, indicating the size of the device, and automatically -sum up the tree from there, such that the weight of the default node +sum up the tree, such that the weight of the ``root`` node will be the total of all devices contained beneath it. Normally weights are in units of terabytes (TB). -You can get a simple view the CRUSH hierarchy for your cluster, -including the weights, with:: +You can get a simple view the of CRUSH hierarchy for your cluster, +including weights, with:: - ceph osd crush tree + ceph osd tree Rules ----- -Rules define policy about how data is distributed across the devices -in the hierarchy. - -CRUSH rules define placement and replication strategies or +CRUSH Rules define policy about how data is distributed across the devices +in the hierarchy. They define placement and replication strategies or distribution policies that allow you to specify exactly how CRUSH -places object replicas. For example, you might create a rule selecting -a pair of targets for 2-way mirroring, another rule for selecting -three targets in two different data centers for 3-way mirroring, and -yet another rule for erasure coding over six storage devices. For a +places data replicas. For example, you might create a rule selecting +a pair of targets for two-way mirroring, another rule for selecting +three targets in two different data centers for three-way mirroring, and +yet another rule for erasure coding (EC) across six storage devices. For a detailed discussion of CRUSH rules, refer to `CRUSH - Controlled, Scalable, Decentralized Placement of Replicated Data`_, and more specifically to **Section 3.2**. -In almost all cases, CRUSH rules can be created via the CLI by +CRUSH rules can be created via the CLI by specifying the *pool type* they will be used for (replicated or erasure coded), the *failure domain*, and optionally a *device class*. In rare cases rules must be written by hand by manually editing the @@ -216,8 +217,8 @@ You can view the contents of the rules with:: Device classes -------------- -Each device can optionally have a *class* associated with it. By -default, OSDs automatically set their class on startup to either +Each device can optionally have a *class* assigned. By +default, OSDs automatically set their class at startup to `hdd`, `ssd`, or `nvme` based on the type of device they are backed by. @@ -243,8 +244,8 @@ A pool can then be changed to use the new rule with:: Device classes are implemented by creating a "shadow" CRUSH hierarchy for each device class in use that contains only devices of that class. -Rules can then distribute data over the shadow hierarchy. One nice -thing about this approach is that it is fully backward compatible with +CRUSH rules can then distribute data over the shadow hierarchy. +This approach is fully backward compatible with old Ceph clients. You can view the CRUSH hierarchy with shadow items with:: @@ -263,10 +264,10 @@ A *weight set* is an alternative set of weights to use when calculating data placement. The normal weights associated with each device in the CRUSH map are set based on the device size and indicate how much data we *should* be storing where. However, because CRUSH is -based on a pseudorandom placement process, there is always some -variation from this ideal distribution, the same way that rolling a -dice sixty times will not result in rolling exactly 10 ones and 10 -sixes. Weight sets allow the cluster to do a numerical optimization +a "probabilistic" pseudorandom placement process, there is always some +variation from this ideal distribution, in the same way that rolling a +die sixty times will not result in rolling exactly 10 ones and 10 +sixes. Weight sets allow the cluster to perform numerical optimization based on the specifics of your cluster (hierarchy, pools, etc.) to achieve a balanced distribution. @@ -294,7 +295,7 @@ When weight sets are in use, the weights associated with each node in the hierarchy is visible as a separate column (labeled either ``(compat)`` or the pool name) from the command:: - ceph osd crush tree + ceph osd tree When both *compat* and *per-pool* weight sets are in use, data placement for a particular pool will use its own per-pool weight set @@ -302,8 +303,8 @@ if present. If not, it will use the compat weight set if present. If neither are present, it will use the normal CRUSH weights. Although weight sets can be set up and manipulated by hand, it is -recommended that the *balancer* module be enabled to do so -automatically. +recommended that the ``ceph-mgr`` *balancer* module be enabled to do so +automatically when running Luminous or later releases. Modifying the CRUSH map @@ -368,7 +369,7 @@ Adjust OSD weight with the correct weight when they are created. This command is rarely needed. -To adjust an OSD's crush weight in the CRUSH map of a running cluster, execute +To adjust an OSD's CRUSH weight in the CRUSH map of a running cluster, execute the following:: ceph osd crush reweight {name} {weight} @@ -417,12 +418,15 @@ Where: Add a Bucket ------------ -.. note: Buckets are normally implicitly created when an OSD is added +.. note: Buckets are implicitly created when an OSD is added that specifies a ``{bucket-type}={bucket-name}`` as part of its - location and a bucket with that name does not already exist. This + location, if a bucket with that name does not already exist. This command is typically used when manually adjusting the structure of the - hierarchy after OSDs have been created (for example, to move a - series of hosts underneath a new rack-level bucket). + hierarchy after OSDs have been created. One use is to move a + series of hosts underneath a new rack-level bucket; another is to + add new ``host`` buckets (OSD nodes) to a dummy ``root`` so that they don't + receive data until you're ready, at which time you would move them to the + ``default`` or other root as described below. To add a bucket in the CRUSH map of a running cluster, execute the ``ceph osd crush add-bucket`` command:: @@ -478,7 +482,7 @@ Where: Remove a Bucket --------------- -To remove a bucket from the CRUSH map hierarchy, execute the following:: +To remove a bucket from the CRUSH hierarchy, execute the following:: ceph osd crush remove {bucket-name} @@ -565,12 +569,12 @@ Creating a rule for a replicated pool For a replicated pool, the primary decision when creating the CRUSH rule is what the failure domain is going to be. For example, if a failure domain of ``host`` is selected, then CRUSH will ensure that -each replica of the data is stored on a different host. If ``rack`` +each replica of the data is stored on a unique host. If ``rack`` is selected, then each replica will be stored in a different rack. -What failure domain you choose primarily depends on the size of your -cluster and how your hierarchy is structured. +What failure domain you choose primarily depends on the size and +topology of your cluster. -Normally, the entire cluster hierarchy is nested beneath a root node +In most cases the entire cluster hierarchy is nested beneath a root node named ``default``. If you have customized your hierarchy, you may want to create a rule nested at some other node in the hierarchy. It doesn't matter what type is associated with that node (it doesn't have @@ -611,7 +615,7 @@ Where: ``class`` -:Description: The device class data should be placed on. +:Description: The device class on which data should be placed. :Type: String :Required: No :Example: ``ssd`` @@ -619,8 +623,8 @@ Where: Creating a rule for an erasure coded pool ----------------------------------------- -For an erasure-coded pool, the same basic decisions need to be made as -with a replicated pool: what is the failure domain, what node in the +For an erasure-coded (EC) pool, the same basic decisions need to be made: +what is the failure domain, which node in the hierarchy will data be placed under (usually ``default``), and will placement be restricted to a specific device class. Erasure code pools are created a bit differently, however, because they need to be @@ -648,9 +652,9 @@ CRUSH rule that is created. The erasure code profile properties of interest are: - * **crush-root**: the name of the CRUSH node to place data under [default: ``default``]. - * **crush-failure-domain**: the CRUSH type to separate erasure-coded shards across [default: ``host``]. - * **crush-device-class**: the device class to place data on [default: none, meaning all devices are used]. + * **crush-root**: the name of the CRUSH node under which to place data [default: ``default``]. + * **crush-failure-domain**: the CRUSH bucket type across which to distribute erasure-coded shards [default: ``host``]. + * **crush-device-class**: the device class on which to place data [default: none, meaning all devices are used]. * **k** and **m** (and, for the ``lrc`` plugin, **l**): these determine the number of erasure code shards, affecting the resulting CRUSH rule. Once a profile is defined, you can create a CRUSH rule with:: @@ -685,7 +689,7 @@ In order to use newer tunables, both clients and servers must support the new version of CRUSH. For this reason, we have created ``profiles`` that are named after the Ceph version in which they were introduced. For example, the ``firefly`` tunables are first supported -in the firefly release, and will not work with older (e.g., dumpling) +by the Firefly release, and will not work with older (e.g., Dumpling) clients. Once a given set of tunables are changed from the legacy default behavior, the ``ceph-mon`` and ``ceph-osd`` will prevent older clients who do not support the new CRUSH features from connecting to @@ -694,23 +698,23 @@ the cluster. argonaut (legacy) ----------------- -The legacy CRUSH behavior used by argonaut and older releases works -fine for most clusters, provided there are not too many OSDs that have +The legacy CRUSH behavior used by Argonaut and older releases works +fine for most clusters, provided there are not many OSDs that have been marked out. bobtail (CRUSH_TUNABLES2) ------------------------- -The bobtail tunable profile fixes a few key misbehaviors: +The ``bobtail`` tunable profile fixes a few key misbehaviors: * For hierarchies with a small number of devices in the leaf buckets, some PGs map to fewer than the desired number of replicas. This commonly happens for hierarchies with "host" nodes with a small number (1-3) of OSDs nested beneath each one. - * For large clusters, some small percentages of PGs map to less than + * For large clusters, some small percentages of PGs map to fewer than the desired number of OSDs. This is more prevalent when there are - several layers of the hierarchy (e.g., row, rack, host, osd). + mutiple hierarchy layers in use (e.g., ``row``, ``rack``, ``host``, ``osd``). * When some OSDs are marked out, the data tends to get redistributed to nearby OSDs instead of across the entire hierarchy. @@ -734,49 +738,49 @@ The new tunables are: Migration impact: - * Moving from argonaut to bobtail tunables triggers a moderate amount + * Moving from ``argonaut`` to ``bobtail`` tunables triggers a moderate amount of data movement. Use caution on a cluster that is already populated with data. firefly (CRUSH_TUNABLES3) ------------------------- -The firefly tunable profile fixes a problem -with the ``chooseleaf`` CRUSH rule behavior that tends to result in PG +The ``firefly`` tunable profile fixes a problem +with ``chooseleaf`` CRUSH rule behavior that tends to result in PG mappings with too few results when too many OSDs have been marked out. The new tunable is: * ``chooseleaf_vary_r``: Whether a recursive chooseleaf attempt will - start with a non-zero value of r, based on how many attempts the - parent has already made. Legacy default is 0, but with this value + start with a non-zero value of ``r``, based on how many attempts the + parent has already made. Legacy default is ``0``, but with this value CRUSH is sometimes unable to find a mapping. The optimal value (in - terms of computational cost and correctness) is 1. + terms of computational cost and correctness) is ``1``. Migration impact: - * For existing clusters that have lots of existing data, changing - from 0 to 1 will cause a lot of data to move; a value of 4 or 5 - will allow CRUSH to find a valid mapping but will make less data - move. + * For existing clusters that house lots of data, changing + from ``0`` to ``1`` will cause a lot of data to move; a value of ``4`` or ``5`` + will allow CRUSH to still find a valid mapping but will cause less data + to move. straw_calc_version tunable (introduced with Firefly too) -------------------------------------------------------- There were some problems with the internal weights calculated and -stored in the CRUSH map for ``straw`` buckets. Specifically, when -there were items with a CRUSH weight of 0 or both a mix of weights and -some duplicated weights CRUSH would distribute data incorrectly (i.e., +stored in the CRUSH map for ``straw`` algorithm buckets. Specifically, when +there were items with a CRUSH weight of ``0``, or both a mix of different and +unique weights, CRUSH would distribute data incorrectly (i.e., not in proportion to the weights). The new tunable is: - * ``straw_calc_version``: A value of 0 preserves the old, broken - internal weight calculation; a value of 1 fixes the behavior. + * ``straw_calc_version``: A value of ``0`` preserves the old, broken + internal weight calculation; a value of ``1`` fixes the behavior. Migration impact: - * Moving to straw_calc_version 1 and then adjusting a straw bucket + * Moving to straw_calc_version ``1`` and then adjusting a straw bucket (by adding, removing, or reweighting an item, or by using the reweight-all command) can trigger a small to moderate amount of data movement *if* the cluster has hit one of the problematic @@ -788,12 +792,12 @@ concerning the required kernel version in the client side. hammer (CRUSH_V4) ----------------- -The hammer tunable profile does not affect the +The ``hammer`` tunable profile does not affect the mapping of existing CRUSH maps simply by changing the profile. However: - * There is a new bucket type (``straw2``) supported. The new - ``straw2`` bucket type fixes several limitations in the original - ``straw`` bucket. Specifically, the old ``straw`` buckets would + * There is a new bucket algorithm (``straw2``) supported. The new + ``straw2`` bucket algorithm fixes several limitations in the original + ``straw``. Specifically, the old ``straw`` buckets would change some mappings that should have changed when a weight was adjusted, while ``straw2`` achieves the original goal of only changing mappings to or from the bucket item whose weight has @@ -812,16 +816,17 @@ Migration impact: jewel (CRUSH_TUNABLES5) ----------------------- -The jewel tunable profile improves the +The ``jewel`` tunable profile improves the overall behavior of CRUSH such that significantly fewer mappings -change when an OSD is marked out of the cluster. +change when an OSD is marked out of the cluster. This results in +significantly less data movement. The new tunable is: * ``chooseleaf_stable``: Whether a recursive chooseleaf attempt will use a better value for an inner loop that greatly reduces the number - of mapping changes when an OSD is marked out. The legacy value is 0, - while the new value of 1 uses the new approach. + of mapping changes when an OSD is marked out. The legacy value is ``0``, + while the new value of ``1`` uses the new approach. Migration impact: @@ -881,7 +886,7 @@ To make this warning go away, you have two options: If things go poorly (e.g., too much load) and not very much progress has been made, or there is a client compatibility problem - (old kernel cephfs or rbd clients, or pre-bobtail librados + (old kernel CephFS or RBD clients, or pre-Bobtail ``librados`` clients), you can switch back with:: ceph osd crush tunables legacy @@ -920,8 +925,8 @@ A few important points Tuning CRUSH ------------ -The simplest way to adjust the crush tunables is by changing to a known -profile. Those are: +The simplest way to adjust CRUSH tunables is by applying them in matched +sets known as *profiles*. As of the Octopus release these are: * ``legacy``: the legacy behavior from argonaut and earlier. * ``argonaut``: the legacy values supported by the original argonaut release @@ -932,16 +937,19 @@ profile. Those are: * ``optimal``: the best (ie optimal) values of the current version of Ceph * ``default``: the default values of a new cluster installed from scratch. These values, which depend on the current version of Ceph, - are hard coded and are generally a mix of optimal and legacy values. + are hardcoded and are generally a mix of optimal and legacy values. These values generally match the ``optimal`` profile of the previous - LTS release, or the most recent release for which we generally except - more users to have up to date clients for. + LTS release, or the most recent release for which we generally expect + most users to have up-to-date clients for. -You can select a profile on a running cluster with the command:: +You can apply a profile to a running cluster with the command:: ceph osd crush tunables {PROFILE} -Note that this may result in some data movement. +Note that this may result in data movement, potentially quite a bit. Study +release notes and documentation carefully before changing the profile on a +running cluster, and consider throttling recovery/backfill parameters to +limit the impact of a bolus of backfill. .. _CRUSH - Controlled, Scalable, Decentralized Placement of Replicated Data: https://ceph.com/wp-content/uploads/2016/08/weil-crush-sc06.pdf @@ -950,9 +958,10 @@ Note that this may result in some data movement. Primary Affinity ================ -When a Ceph Client reads or writes data, it always contacts the primary OSD in -the acting set. For set ``[2, 3, 4]``, ``osd.2`` is the primary. Sometimes an -OSD is not well suited to act as a primary compared to other OSDs (e.g., it has +When a Ceph Client reads or writes data, it first contacts the primary OSD in +each affected PG's acting set. In the acting set ``[2, 3, 4]``, ``osd.2`` is +listed first and thus is the primary (lead). Sometimes an +OSD is less well suited to act as the lead than are other OSDs (e.g., it has a slow disk or a slow controller). To prevent performance bottlenecks (especially on read operations) while maximizing utilization of your hardware, you can set a Ceph OSD's primary affinity so that CRUSH is less likely to use @@ -961,10 +970,28 @@ the OSD as a primary in an acting set. :: ceph osd primary-affinity <osd-id> <weight> Primary affinity is ``1`` by default (*i.e.,* an OSD may act as a primary). You -may set the OSD primary range from ``0-1``, where ``0`` means that the OSD may -**NOT** be used as a primary and ``1`` means that an OSD may be used as a -primary. When the weight is ``< 1``, it is less likely that CRUSH will select -the Ceph OSD Daemon to act as a primary. - +may set the OSD primary range as a real number in the range ``[0-1]``, where ``0`` +indicates that the OSD may **NOT** be used as a primary and ``1`` means that an +OSD may be used as a primary. When the weight is ``< 1``, it is less likely that +CRUSH will select the Ceph OSD Daemon to act as a primary. The process for +selecting the lead OSD is more nuanced than a simple probability based on +relative affinity values, but measurable results can be achieved even with +first-order approximations of desirable values. + +There are occasional clusters +that balance cost and performance by mixing SSDs and HDDs in the same pool. +Careful application of CRUSH rules can direct each PG's acting set to contain +exactly one SSD OSD with the balance HDDs. By using primary affinity one can +direct most or all read operations to the SSD in the acting set. This is +a tricky setup to maintain and it is discouraged, but it's a useful example. + +Another, more common scenario for primary affinity is when a cluster contains +a mix of drive sizes, for example older racks with 1.9 TB SATA SSDS and newer racks with +3.84TB SATA SSDs. On average the latter will be assigned double the number of +PGs and thus will serve double the number of write and read operations, thus +they'll be busier than the former. A rough application of primary affinity in +proportion to OSD size won't be 100% optimal, but it can readily achieve a 15% +improvement in overall read throughput by utilizing SATA interface bandwidth +and CPU cycles more evenly. diff --git a/doc/rados/operations/stretch-mode.rst b/doc/rados/operations/stretch-mode.rst index 3b1bc823103..147ebd440d1 100644 --- a/doc/rados/operations/stretch-mode.rst +++ b/doc/rados/operations/stretch-mode.rst @@ -9,7 +9,7 @@ Stretch Clusters ================ Ceph generally expects all parts of its network and overall cluster to be equally reliable, with failures randomly distributed across the CRUSH map. -So you may lose a switch that knocks out a big segment of OSDs, but we expect +So you may lose a switch that knocks out a number of OSDs, but we expect the remaining OSDs and monitors to route around that. This is usually a good choice, but may not work well in some |