doc/rados/operations/stretch-mode.rst


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408

.. _stretch_mode:

================
Stretch Clusters
================


Stretch Clusters
================

A stretch cluster is a cluster that has servers in geographically separated
data centers, distributed over a WAN. Stretch clusters have LAN-like high-speed
and low-latency connections, but limited links. Stretch clusters have a higher
likelihood of (possibly asymmetric) network splits, and a higher likelihood of
temporary or complete loss of an entire data center (which can represent
one-third to one-half of the total cluster).

Ceph is designed with the expectation that all parts of its network and cluster
will be reliable and that failures will be distributed randomly across the
CRUSH map. Even if a switch goes down and causes the loss of many OSDs, Ceph is
designed so that the remaining OSDs and monitors will route around such a loss. 

Sometimes this cannot be relied upon. If you have a "stretched-cluster"
deployment in which much of your cluster is behind a single network component,
you might need to use **stretch mode** to ensure data integrity.

We will here consider two standard configurations: a configuration with two
data centers (or, in clouds, two availability zones), and a configuration with
three data centers (or, in clouds, three availability zones).

In the two-site configuration, Ceph expects each of the sites to hold a copy of
the data, and Ceph also expects there to be a third site that has a tiebreaker
monitor. This tiebreaker monitor picks a winner if the network connection fails
and both data centers remain alive.

The tiebreaker monitor can be a VM. It can also have high latency relative to
the two main sites.

The standard Ceph configuration is able to survive MANY network failures or
data-center failures without ever compromising data availability. If enough
Ceph servers are brought back following a failure, the cluster *will* recover.
If you lose a data center but are still able to form a quorum of monitors and
still have all the data available, Ceph will maintain availability. (This
assumes that the cluster has enough copies to satisfy the pools' ``min_size``
configuration option, or (failing that) that the cluster has CRUSH rules in
place that will cause the cluster to re-replicate the data until the
``min_size`` configuration option has been met.)

Stretch Cluster Issues
======================

Ceph does not permit the compromise of data integrity and data consistency
under any circumstances. When service is restored after a network failure or a
loss of Ceph nodes, Ceph will restore itself to a state of normal functioning
without operator intervention.  

Ceph does not permit the compromise of data integrity or data consistency, but
there are situations in which *data availability* is compromised. These
situations can occur even though there are enough clusters available to satisfy
Ceph's consistency and sizing constraints. In some situations, you might
discover that your cluster does not satisfy those constraints.

The first category of these failures that we will discuss involves inconsistent
networks -- if there is a netsplit (a disconnection between two servers that
splits the network into two pieces), Ceph might be unable to mark OSDs ``down``
and remove them from the acting PG sets. This failure to mark ODSs ``down``
will occur, despite the fact that the primary PG is unable to replicate data (a
situation that, under normal non-netsplit circumstances, would result in the
marking of affected OSDs as ``down`` and their removal from the PG). If this
happens, Ceph will be unable to satisfy its durability guarantees and
consequently IO will not be permitted.

The second category of failures that we will discuss involves the situation in
which the constraints are not sufficient to guarantee the replication of data
across data centers, though it might seem that the data is correctly replicated
across data centers. For example, in a scenario in which there are two data
centers named Data Center A and Data Center B, and the CRUSH rule targets three
replicas and places a replica in each data center with a ``min_size`` of ``2``,
the PG might go active with two replicas in Data Center A and zero replicas in
Data Center B. In a situation of this kind, the loss of Data Center A means
that the data is lost and Ceph will not be able to operate on it. This
situation is surprisingly difficult to avoid using only standard CRUSH rules.

Individual Stretch Pools
========================
Setting individual ``stretch pool`` is an option that allows for the configuration
of specific pools to be distributed across ``two or more data centers``.
This is achieved by executing the ``ceph osd pool stretch set`` command on each desired pool,
as opposed to applying a cluster-wide configuration ``with stretch mode``.
See :ref:`setting_values_for_a_stretch_pool`

Use ``stretch mode`` when you have exactly ``two data centers`` and require a uniform
configuration across the entire cluster. Conversely, opt for a ``stretch pool``
when you need a particular pool to be replicated across ``more than two data centers``,
providing a more granular level of control and a larger cluster size.

Limitations
-----------

Individual Stretch Pools do not support I/O operations during a netsplit
scenario between two or more zones. While the cluster remains accessible for
basic Ceph commands, I/O usage remains unavailable until the netsplit is
resolved. This is different from ``stretch mode``, where the tiebreaker monitor
can isolate one zone of the cluster and continue I/O operations in degraded
mode during a netsplit. See :ref:`stretch_mode1`

Ceph is designed to tolerate multiple host failures. However, if more than 25% of
the OSDs in the cluster go down, Ceph may stop marking OSDs as out which will prevent rebalancing
and some PGs might go inactive. This behavior is controlled by the ``mon_osd_min_in_ratio`` parameter.
By default, mon_osd_min_in_ratio is set to 0.75, meaning that at least 75% of the OSDs
in the cluster must remain ``active`` before any additional OSDs can be marked out.
This setting prevents too many OSDs from being marked out as this might lead to significant
data movement. The data movement can cause high client I/O impact and long recovery times when
the OSDs are returned to service. If Ceph stops marking OSDs as out, some PGs may fail to
rebalance to surviving OSDs, potentially leading to ``inactive`` PGs.
See https://tracker.ceph.com/issues/68338 for more information.

.. _stretch_mode1:

Stretch Mode
============

Stretch mode is designed to handle netsplit scenarios between two data zones as well
as the loss of one data zone. It handles the netsplit scenario by choosing the surviving zone
that has the better connection to the ``tiebreaker monitor``. It handles the loss of one zone by
reducing the ``size`` to ``2`` and ``min_size`` to ``1``, allowing the cluster to continue operating
with the remaining zone. When the lost zone comes back, the cluster will recover the lost data
and return to normal operation.

Connectivity Monitor Election Strategy
---------------------------------------
When using stretch mode, the monitor election strategy must be set to ``connectivity``.
This strategy tracks network connectivity between the monitors and is
used to determine which zone should be favored when the cluster is in a netsplit scenario.

See `Changing Monitor Elections`_

Stretch Peering Rule
--------------------
One critical behavior of stretch mode is its ability to prevent a PG from going active if the acting set
contains only replicas from a single zone. This safeguard is crucial for mitigating the risk of data
loss during site failures because if a PG were allowed to go active with replicas only in a single site,
writes could be acknowledged despite a lack of redundancy. In the event of a site failure, all data in the
affected PG would be lost.

Entering Stretch Mode
---------------------

To enable stretch mode, you must set the location of each monitor, matching
your CRUSH map. This procedure shows how to do this.


#. Place ``mon.a`` in your first data center:

   .. prompt:: bash $

      ceph mon set_location a datacenter=site1

#. Generate a CRUSH rule that places two copies in each data center.
   This requires editing the CRUSH map directly:

   .. prompt:: bash $

      ceph osd getcrushmap > crush.map.bin
      crushtool -d crush.map.bin -o crush.map.txt

#. Edit the ``crush.map.txt`` file to add a new rule. Here there is only one
   other rule (``id 1``), but you might need to use a different rule ID. We
   have two data-center buckets named ``site1`` and ``site2``:

   ::

      rule stretch_rule {
             id 1
             type replicated
             step take site1
             step chooseleaf firstn 2 type host
             step emit
             step take site2
             step chooseleaf firstn 2 type host
             step emit
     }

   .. warning:: If a CRUSH rule is defined for a stretch mode cluster and the
      rule has multiple "takes" in it, then ``MAX AVAIL`` for the pools
      associated with the CRUSH rule will report that the available size is all
      of the available space from the datacenter, not the available space for
      the pools associated with the CRUSH rule.
   
      For example, consider a cluster with two CRUSH rules, ``stretch_rule`` and
      ``stretch_replicated_rule``::

         rule stretch_rule {
              id 1
              type replicated
              step take DC1
              step chooseleaf firstn 2 type host
              step emit
              step take DC2
              step chooseleaf firstn 2 type host
              step emit
         }
         
         rule stretch_replicated_rule {
                 id 2
                 type replicated
                 step take default
                 step choose firstn 0 type datacenter
                 step chooseleaf firstn 2 type host
                 step emit
         }

      In the above example, ``stretch_rule`` will report an incorrect value for
      ``MAX AVAIL``. ``stretch_replicated_rule`` will report the correct value.
      This is because ``stretch_rule`` is defined in such a way that
      ``PGMap::get_rule_avail`` considers only the available size of a single
      data center, and not (as would be correct) the total available size from
      both datacenters.
      
      Here is a workaround. Instead of defining the stretch rule as defined in
      the ``stretch_rule`` function above, define it as follows::

         rule stretch_rule {
           id 2
           type replicated
           step take default
           step choose firstn 0 type datacenter
           step chooseleaf firstn 2 type host
           step emit
         }

      See https://tracker.ceph.com/issues/56650 for more detail on this workaround.

   *The above procedure was developed in May and June of 2024 by Prashant Dhange.*

#. Inject the CRUSH map to make the rule available to the cluster:

   .. prompt:: bash $

      crushtool -c crush.map.txt -o crush2.map.bin
      ceph osd setcrushmap -i crush2.map.bin

#. Run the monitors in connectivity mode. See `Changing Monitor Elections`_.

   .. prompt:: bash $

      ceph mon set election_strategy connectivity

#. Command the cluster to enter stretch mode. In this example, ``mon.e`` is the
   tiebreaker monitor and we are splitting across data centers. The tiebreaker
   monitor must be assigned a data center that is neither ``site1`` nor
   ``site2``. This data center **should not** be defined in your CRUSH map, here 
   we are placing ``mon.e`` in a virtual data center called ``site3``:

   .. prompt:: bash $

      ceph mon set_location e datacenter=site3
      ceph mon enable_stretch_mode e stretch_rule datacenter

When stretch mode is enabled, PGs will become active only when they peer
across data centers (or across whichever CRUSH bucket type was specified),
assuming both are alive. Pools will increase in size from the default ``3`` to
``4``, and two copies will be expected in each site. OSDs will be allowed to
connect to monitors only if they are in the same data center as the monitors.
New monitors will not be allowed to join the cluster if they do not specify a
location.

If all OSDs and monitors in one of the data centers become inaccessible at once,
the surviving data center enters a "degraded stretch mode". A warning will be
issued, the ``min_size`` will be reduced to ``1``, and the cluster will be
allowed to go active with the data in the single remaining site. The pool size
does not change, so warnings will be generated that report that the pools are
too small -- but a special stretch mode flag will prevent the OSDs from
creating extra copies in the remaining data center. This means that the data
center will keep only two copies, just as before.

When the missing data center comes back, the cluster will enter a "recovery
stretch mode". This changes the warning and allows peering, but requires OSDs
only from the data center that was ``up`` throughout the duration of the
downtime. When all PGs are in a known state, and are neither degraded nor
incomplete, the cluster transitions back to regular stretch mode, ends the
warning, restores ``min_size`` to its original value (``2``), requires both
sites to peer, and no longer requires the site that was up throughout the
duration of the downtime when peering (which makes failover to the other site
possible, if needed).

.. _Changing Monitor elections: ../change-mon-elections

Exiting Stretch Mode
--------------------
To exit stretch mode, run the following command:

.. prompt:: bash $

   ceph mon disable_stretch_mode [{crush_rule}] --yes-i-really-mean-it


.. describe:: {crush_rule}

   The CRUSH rule that the user wants all pools to move back to. If this
   is not specified, the pools will move back to the default CRUSH rule.

   :Type: String
   :Required: No.

The command will move the cluster back to normal mode,
and the cluster will no longer be in stretch mode.
All pools will move its ``size`` and ``min_size``
back to the default values it started with.
At this point the user is responsible for scaling down the cluster
to the desired number of OSDs if they choose to operate with less number of OSDs.

Please note that the command will not execute when the cluster is in
``recovery stretch mode``. The command will only execute when the cluster
is in ``degraded stretch mode`` or ``healthy stretch mode``.

Limitations of Stretch Mode 
===========================
When using stretch mode, OSDs must be located at exactly two sites. 

Two monitors should be run in each data center, plus a tiebreaker in a third
(or in the cloud) for a total of five monitors. While in stretch mode, OSDs
will connect only to monitors within the data center in which they are located.
OSDs *DO NOT* connect to the tiebreaker monitor.

Erasure-coded pools cannot be used with stretch mode. Attempts to use erasure
coded pools with stretch mode will fail. Erasure coded pools cannot be created
while in stretch mode. 

To use stretch mode, you will need to create a CRUSH rule that provides two
replicas in each data center. Ensure that there are four total replicas: two in
each data center. If pools exist in the cluster that do not have the default
``size`` or ``min_size``, Ceph will not enter stretch mode. An example of such
a CRUSH rule is given above.

Because stretch mode runs with ``min_size`` set to ``1`` (or, more directly,
``min_size 1``), we recommend enabling stretch mode only when using OSDs on
SSDs (including NVMe OSDs). Hybrid HDD+SDD or HDD-only OSDs are not recommended
due to the long time it takes for them to recover after connectivity between
data centers has been restored. This reduces the potential for data loss.

.. warning:: Device class is currently not supported in stretch mode.
   For example, the following rule containing ``device class`` will not work::

      rule stretch_replicated_rule {
                 id 2
                 type replicated class hdd
                 step take default
                 step choose firstn 0 type datacenter
                 step chooseleaf firstn 2 type host
                 step emit
      }

In the future, stretch mode could support erasure-coded pools,
enable deployments across multiple data centers,
and accommodate various device classes.

Other commands
==============

Replacing a failed tiebreaker monitor
-------------------------------------

Turn on a new monitor and run the following command:

.. prompt:: bash $

   ceph mon set_new_tiebreaker mon.<new_mon_name>

This command protests if the new monitor is in the same location as the
existing non-tiebreaker monitors. **This command WILL NOT remove the previous
tiebreaker monitor.** Remove the previous tiebreaker monitor yourself.

Using "--set-crush-location" and not "ceph mon set_location"
------------------------------------------------------------

If you write your own tooling for deploying Ceph, use the
``--set-crush-location`` option when booting monitors instead of running ``ceph
mon set_location``. This option accepts only a single ``bucket=loc`` pair (for
example, ``ceph-mon --set-crush-location 'datacenter=a'``), and that pair must
match the bucket type that was specified when running ``enable_stretch_mode``.

Forcing recovery stretch mode
-----------------------------

When in stretch degraded mode, the cluster will go into "recovery" mode
automatically when the disconnected data center comes back. If that does not
happen or you want to enable recovery mode early, run the following command:

.. prompt:: bash $

   ceph osd force_recovery_stretch_mode --yes-i-really-mean-it

Forcing normal stretch mode
---------------------------

When in recovery mode, the cluster should go back into normal stretch mode when
the PGs are healthy. If this fails to happen or if you want to force the
cross-data-center peering early and are willing to risk data downtime (or have
verified separately that all the PGs can peer, even if they aren't fully
recovered), run the following command:

.. prompt:: bash $

   ceph osd force_healthy_stretch_mode --yes-i-really-mean-it

This command can be used to to remove the ``HEALTH_WARN`` state, which recovery
mode generates.