summaryrefslogtreecommitdiffstats
path: root/doc/cephfs/standby.rst
blob: a64c3e41c8abe42d08d6b38cf44a7fa9b43d5fa3 (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
.. _mds-standby:

Terminology
-----------

A Ceph cluster may have zero or more CephFS *file systems*.  Each CephFS has
a human readable name (set at creation time with ``fs new``) and an integer
ID.  The ID is called the file system cluster ID, or *FSCID*.

Each CephFS file system has a number of *ranks*, numbered beginning with zero.
By default there is one rank per file system.  A rank may be thought of as a
metadata shard.  Management of ranks is described in :doc:`/cephfs/multimds` .

Each CephFS ``ceph-mds`` daemon starts without a rank.  It may be assigned one
by the cluster's monitors. A daemon may only hold one rank at a time, and only
give up a rank when the ``ceph-mds`` process stops.

If a rank is not associated with any daemon, that rank is considered ``failed``.
Once a rank is assigned to a daemon, the rank is considered ``up``.

Each ``ceph-mds`` daemon has a *name* that is assigned statically by the
administrator when the daemon is first configured.  Each daemon's *name* is
typically that of the hostname where the process runs.

A ``ceph-mds`` daemon may be assigned to a specific file system by
setting its ``mds_join_fs`` configuration option to the file system's
``name``.

When a ``ceph-mds`` daemon starts, it is also assigned an integer ``GID``,
which is unique to this current daemon's process.  In other words, when a
``ceph-mds`` daemon is restarted, it runs as a new process and is assigned a
*new* ``GID`` that is different from that of the previous process.

Referring to MDS daemons
------------------------

Most administrative commands that refer to a ``ceph-mds`` daemon (MDS)
accept a flexible argument format that may specify a ``rank``, a ``GID``
or a ``name``.

Where a ``rank`` is used, it  may optionally be qualified by
a leading file system ``name`` or ``GID``.  If a daemon is a standby (i.e.
it is not currently assigned a ``rank``), then it may only be
referred to by ``GID`` or ``name``.

For example, say we have an MDS daemon with ``name`` 'myhost' and
``GID`` 5446, and which is assigned ``rank`` 0 for the file system 'myfs'
with ``FSCID`` 3.  Any of the following are suitable forms of the ``fail``
command:

::

    ceph mds fail 5446     # GID
    ceph mds fail myhost   # Daemon name
    ceph mds fail 0        # Unqualified rank
    ceph mds fail 3:0      # FSCID and rank
    ceph mds fail myfs:0   # File System name and rank

Managing failover
-----------------

If an MDS daemon stops communicating with the cluster's monitors, the monitors
will wait ``mds_beacon_grace`` seconds (default 15) before marking the daemon as
*laggy*.  If a standby MDS is available, the monitor will immediately replace the
laggy daemon.

Each file system may specify a minimum number of standby daemons in order to be
considered healthy. This number includes daemons in the ``standby-replay`` state
waiting for a ``rank`` to fail. (Note, the monitors will not assign a
``standby-replay`` daemon to take over a failure for another ``rank`` or a
failure in a different CephFS file system). The pool of standby daemons not in
``replay`` counts towards any file system count.  Each file system may set the
desired number of standby daemons by:

::

    ceph fs set <fs name> standby_count_wanted <count>

Setting ``count`` to 0 will disable the health check.


.. _mds-standby-replay:

Configuring standby-replay
--------------------------

Each CephFS file system may be configured to add ``standby-replay`` daemons.
These standby daemons follow the active MDS's metadata journal in order to
reduce failover time in the event that the active MDS becomes unavailable. Each
active MDS may have only one ``standby-replay`` daemon following it.

Configuration of ``standby-replay`` on a file system is done using the below:

::

    ceph fs set <fs name> allow_standby_replay <bool>

Once set, the monitors will assign available standby daemons to follow the
active MDSs in that file system.

Once an MDS has entered the ``standby-replay`` state, it will only be used as a
standby for the ``rank`` that it is following. If another ``rank`` fails, this
``standby-replay`` daemon will not be used as a replacement, even if no other
standbys are available. For this reason, it is advised that if ``standby-replay``
is used then *every* active MDS should have a ``standby-replay`` daemon.

.. _mds-join-fs:

Configuring MDS file system affinity
------------------------------------

You might elect to dedicate an MDS to a particular file system. Or, perhaps you
have MDSs that run on better hardware that should be preferred over a last-resort
standby on modest or over-provisioned systems. To configure this preference,
CephFS provides a configuration option for MDS called ``mds_join_fs`` which
enforces this affinity.

When failing over MDS daemons, a cluster's monitors will prefer standby daemons with
``mds_join_fs`` equal to the file system ``name`` with the failed ``rank``.  If no
standby exists with ``mds_join_fs`` equal to the file system ``name``, it will
choose an unqualified standby (no setting for ``mds_join_fs``) for the replacement.
As a last resort, a standby for another filesystem will be chosen, although this
behavior can be disabled:

::

    ceph fs set <fs name> refuse_standby_for_another_fs true

Note, configuring MDS file system affinity does not change the behavior that
``standby-replay`` daemons are always selected before other standbys.

Even further, the monitors will regularly examine the CephFS file systems even when
stable to check if a standby with stronger affinity is available to replace an
MDS with lower affinity. This process is also done for ``standby-replay`` daemons:
if a regular standby has stronger affinity than the ``standby-replay`` MDS, it will
replace the standby-replay MDS.

For example, given this stable and healthy file system:

::

    $ ceph fs dump
    dumped fsmap epoch 399
    ...
    Filesystem 'cephfs' (27)
    ...
    e399
    max_mds 1
    in      0
    up      {0=20384}
    failed
    damaged
    stopped
    ...
    [mds.a{0:20384} state up:active seq 239 addr [v2:127.0.0.1:6854/966242805,v1:127.0.0.1:6855/966242805]]

    Standby daemons:

    [mds.b{-1:10420} state up:standby seq 2 addr [v2:127.0.0.1:6856/2745199145,v1:127.0.0.1:6857/2745199145]]


You may set ``mds_join_fs`` on the standby to enforce your preference: ::

    $ ceph config set mds.b mds_join_fs cephfs

after automatic failover: ::

    $ ceph fs dump
    dumped fsmap epoch 405
    e405
    ...
    Filesystem 'cephfs' (27)
    ...
    max_mds 1
    in      0
    up      {0=10420}
    failed
    damaged
    stopped
    ...
    [mds.b{0:10420} state up:active seq 274 join_fscid=27 addr [v2:127.0.0.1:6856/2745199145,v1:127.0.0.1:6857/2745199145]]

    Standby daemons:

    [mds.a{-1:10720} state up:standby seq 2 addr [v2:127.0.0.1:6854/1340357658,v1:127.0.0.1:6855/1340357658]]

Note in the above example that ``mds.b`` now has ``join_fscid=27``. In this
output, the file system name from ``mds_join_fs`` is changed to the file system
identifier (27). If the file system is recreated with the same name, the
standby will follow the new file system as expected.

Finally, if the file system is degraded or undersized, no failover will occur
to enforce ``mds_join_fs``.