diff options
author | Sage Weil <sage@newdream.net> | 2012-03-07 00:27:02 +0100 |
---|---|---|
committer | Sage Weil <sage@newdream.net> | 2012-03-07 02:05:29 +0100 |
commit | d72b821741bad2ccb7a14903d9df46e8388bd802 (patch) | |
tree | de4cab6a29d93a67440a89127930ace854b4e30f /doc | |
parent | osd: list might_have_unfound locations in query result (diff) | |
download | ceph-d72b821741bad2ccb7a14903d9df46e8388bd802.tar.xz ceph-d72b821741bad2ccb7a14903d9df46e8388bd802.zip |
doc: document some osd failure recovery scenarios
- simple osd failure
- ceph health [detail]
- peering failure ('down') state
- unfound objects
Signed-off-by: Sage Weil <sage@newdream.net>
Diffstat (limited to 'doc')
-rw-r--r-- | doc/ops/manage/disk-failure.rst | 7 | ||||
-rw-r--r-- | doc/ops/manage/failures/index.rst | 38 | ||||
-rw-r--r-- | doc/ops/manage/failures/mds.rst | 4 | ||||
-rw-r--r-- | doc/ops/manage/failures/mon.rst | 4 | ||||
-rw-r--r-- | doc/ops/manage/failures/osd.rst | 196 | ||||
-rw-r--r-- | doc/ops/manage/failures/radosgw.rst | 4 | ||||
-rw-r--r-- | doc/ops/manage/index.rst | 2 |
7 files changed, 247 insertions, 8 deletions
diff --git a/doc/ops/manage/disk-failure.rst b/doc/ops/manage/disk-failure.rst deleted file mode 100644 index 6eda1cb4d4c..00000000000 --- a/doc/ops/manage/disk-failure.rst +++ /dev/null @@ -1,7 +0,0 @@ -.. _recover-osd: - -=============================== - Recovering from disk failures -=============================== - -.. todo:: Also cover OSD failures, filestore node failures. diff --git a/doc/ops/manage/failures/index.rst b/doc/ops/manage/failures/index.rst new file mode 100644 index 00000000000..47516a24358 --- /dev/null +++ b/doc/ops/manage/failures/index.rst @@ -0,0 +1,38 @@ +========================== + Recovering from failures +========================== + +The current health of the Ceph cluster, as known by the monitors, can +be checked with the ``ceph health`` command. If all is well, you get:: + + $ ceph health + HEALTH_OK + +and a success error code. If there are problems, you will see +something like:: + + $ ceph health + HEALTH_WARN short summary of problem(s) + +or:: + + $ ceph health + HEALTH_ERROR short summary of very serious problem(s) + +To get more detail:: + + $ ceph health detail + HEALTH_WARN short description of problem + + one problem + another problem + yet another problem + ... + +.. toctree:: + + mon + osd + mds + radosgw + diff --git a/doc/ops/manage/failures/mds.rst b/doc/ops/manage/failures/mds.rst new file mode 100644 index 00000000000..961a9cd5614 --- /dev/null +++ b/doc/ops/manage/failures/mds.rst @@ -0,0 +1,4 @@ +================================== + Recovering from ceph-mds failure +================================== + diff --git a/doc/ops/manage/failures/mon.rst b/doc/ops/manage/failures/mon.rst new file mode 100644 index 00000000000..702d6fd88ca --- /dev/null +++ b/doc/ops/manage/failures/mon.rst @@ -0,0 +1,4 @@ +================================== + Recovering from ceph-mon failure +================================== + diff --git a/doc/ops/manage/failures/osd.rst b/doc/ops/manage/failures/osd.rst new file mode 100644 index 00000000000..739b38a93be --- /dev/null +++ b/doc/ops/manage/failures/osd.rst @@ -0,0 +1,196 @@ +================================== + Recovering from ceph-osd failure +================================== + +Single ceph-osd failure +======================= + +When a ceph-osd process dies, the monitor will learn about the failure +from its peers and report it via the ``ceph health`` command:: + + $ ceph health + HEALTH_WARN 1/3 in osds are down + +Specifically, you will get a warning whenever there are ceph-osd +processes that are marked in and down. You can identify which +ceph-osds are down with:: + + $ ceph health detail + HEALTH_WARN 1/3 in osds are down + osd.0 is down since epoch 23, last address 192.168.106.220:6800/11080 + +Under normal circumstances, simply restarting the ceph-osd daemon will +allow it to rejoin the cluster and recover. If there is a disk +failure or other fault preventing ceph-osd from functioning or +restarting, an error message should be present in its log file in +``/var/log/ceph``. + + +Homeless placement groups (PGs) +=============================== + +It is possible for all OSDs that had copies of a given PG to fail. If +that's the case, that subset of the object store is unavailable, and +the monitor will receive no status updates for those PGs. To detect +this situation, the monitor marks any PG whose primary OSD has failed +as `stale`. For example:: + + $ ceph health + HEALTH_WARN 24 pgs stale; 3/3 in osds are down + +You can identify which PGs are stale, and what the last OSDs to store +them were, with:: + + $ ceph health detail + HEALTH_WARN 24 pgs stale; 3/3 in osds are down + ... + pg 2.5 is stuck stale+active+remapped, last acting [2,0] + ... + osd.0 is down since epoch 23, last address 192.168.106.220:6800/11080 + osd.1 is down since epoch 13, last address 192.168.106.220:6803/11539 + osd.2 is down since epoch 24, last address 192.168.106.220:6806/11861 + +If we want to get PG 2.5 back online, for example, this tells us that +it was last managed by ceph-osds 0 and 2. Restarting those ceph-osd +daemons will allow the cluster to recover that PG (and, presumably, +many others). + + + +PG down (peering failure) +========================= + +In certain cases, the ceph-osd "peering" process can run into +problems, preventing a PG from becoming active and usable. For +example, ``ceph health`` might report:: + + $ ceph health detail + HEALTH_ERR 7 pgs degraded; 12 pgs down; 12 pgs peering; 1 pgs recovering; 6 pgs stuck unclean; 114/3300 degraded (3.455%); 1/3 in osds are down + ... + pg 0.5 is down+peering + pg 1.4 is down+peering + ... + osd.1 is down since epoch 69, last address 192.168.106.220:6801/8651 + +We can query the cluster to determine exactly why the PG is marked ``down`` with:: + + $ ceph pg 0.5 query + { "state": "down+peering", + ... + "recovery_state": [ + { "name": "Started\/Primary\/Peering\/GetInfo", + "enter_time": "2012-03-06 14:40:16.169679", + "requested_info_from": []}, + { "name": "Started\/Primary\/Peering", + "enter_time": "2012-03-06 14:40:16.169659", + "probing_osds": [ + 0, + 1], + "blocked": "peering is blocked due to down osds", + "down_osds_we_would_probe": [ + 1], + "peering_blocked_by": [ + { "osd": 1, + "current_lost_at": 0, + "comment": "starting or marking this osd lost may let us proceed"}]}, + { "name": "Started", + "enter_time": "2012-03-06 14:40:16.169513"}]} + +The ``recovery_state`` section tells us that peering is blocked due to +down ceph-osd daemons, specifically osd.1. In this case, we can start that ceph-osd +and things will recover. + +Alternatively, is there is a catastrophic failure of osd.1 (e.g., disk +failure), we can tell the cluster that it is "lost" and to cope as +best it case. Note that this is dangerous in that the cluster cannot +guarantee that the other copies of the data are consistent and up to +date. To instruct Ceph to continue anyway:: + + $ ceph osd lost 1 + +and recovery will proceed. + + +Unfound objects +=============== + +Under certain combinations of failures Ceph may complain about +"unfound" objects:: + + $ ceph health detail + HEALTH_WARN 1 pgs degraded; 78/3778 unfound (2.065%) + pg 2.4 is active+degraded, 78 unfound + +This means that the storage cluster knows that some objects (or newer +copies of existing objects) exist, but it hasn't found copies of them. + +First, you can identify which objects are unfound with:: + + $ ceph pg 2.4 list_missing [starting offset, in json] + + { "offset": { "oid": "", + "key": "", + "snapid": 0, + "hash": 0, + "max": 0}, + "num_missing": 0, + "num_unfound": 0, + "objects": [ + { "oid": "object 1", + "key": "", + "hash": 0, + "max": 0 }, + ... + ], + "more": 0} + +If there are too many objects to list in a single result, the ``more`` +field will be true and you can query for more. (Eventually the +command line tool will hide this from you, but not yet.) + +Second, you can identify which OSDs have been probed or might contain +data:: + + $ ceph pg 2.4 query + ... + "recovery_state": [ + { "name": "Started\/Primary\/Active", + "enter_time": "2012-03-06 15:15:46.713212", + "might_have_unfound": [ + { "osd": 1, + "status": "osd is down"}]}, + +In this case, for example, the cluster knows that ``osd.1`` might have +data, but it is down. The full range of possible states include:: + + * already probed + * querying + * osd is down + * not queried (yet) + +Sometimes it simply takes some time for the cluster to query possible +locations. + +It is possible that there are other locations where the object can +exist that are not listed. For example, if a ceph-osd is stopped and +taken out of the cluster, the cluster fully recovers, and due to some +future set of failures ends up with an unfound object, it won't +consider the long-departed ceph-osd as a potential location to +consider. (This scenario, however, is unlikely.) + +If all possible locations have been queried and objects are still +lost, you may have to give up on the lost objects. This, again, is +possible given unusual combinations of failures that allow the cluster +to learn about writes that were performed before the writes themselves +are recovered. To mark the "unfound" objects as "lost":: + + $ ceph pg 2.5 mark_unfound_lost revert + +This the final argument specifies how the cluster should deal with +lost objects. Currently the only supported option is "revert", which +will either roll back to a previous version of the object or (if it +was a new object) forget about it entirely. Use this with caution, as +it may confuse applications that expected the object to exist. + + + diff --git a/doc/ops/manage/failures/radosgw.rst b/doc/ops/manage/failures/radosgw.rst new file mode 100644 index 00000000000..2c16dd03a9a --- /dev/null +++ b/doc/ops/manage/failures/radosgw.rst @@ -0,0 +1,4 @@ +================================= + Recovering from radosgw failure +================================= + diff --git a/doc/ops/manage/index.rst b/doc/ops/manage/index.rst index 0841d63d39d..85ba12d3255 100644 --- a/doc/ops/manage/index.rst +++ b/doc/ops/manage/index.rst @@ -6,7 +6,7 @@ key grow/index - disk-failure + failures/index pool cephfs |