doc: document some osd failure recovery scenarios

- simple osd failure - ceph health [detail] - peering failure ('down') state - unfound objects Signed-off-by: Sage Weil <sage@newdream.net>
author: Sage Weil <sage@newdream.net> 2012-03-07 00:27:02 +0100
committer: Sage Weil <sage@newdream.net> 2012-03-07 02:05:29 +0100
commit: d72b821741bad2ccb7a14903d9df46e8388bd802 (patch)
tree: de4cab6a29d93a67440a89127930ace854b4e30f /doc
parent: osd: list might_have_unfound locations in query result (diff)
download: ceph-d72b821741bad2ccb7a14903d9df46e8388bd802.tar.xz
ceph-d72b821741bad2ccb7a14903d9df46e8388bd802.zip
7 files changed, 247 insertions, 8 deletions
diff --git a/doc/ops/manage/disk-failure.rst b/doc/ops/manage/disk-failure.rst
deleted file mode 100644
index 6eda1cb4d4c..00000000000
--- a/doc/ops/manage/disk-failure.rst
+++ /dev/null
@@ -1,7 +0,0 @@
-.. _recover-osd:
-
-===============================
- Recovering from disk failures
-===============================
-
-.. todo:: Also cover OSD failures, filestore node failures.
diff --git a/doc/ops/manage/failures/index.rst b/doc/ops/manage/failures/index.rst
new file mode 100644
index 00000000000..47516a24358
--- /dev/null
+++ b/doc/ops/manage/failures/index.rst
@@ -0,0 +1,38 @@
+==========================
+ Recovering from failures
+==========================
+
+The current health of the Ceph cluster, as known by the monitors, can
+be checked with the ``ceph health`` command.  If all is well, you get::
+
+ $ ceph health
+ HEALTH_OK
+
+and a success error code.  If there are problems, you will see
+something like::
+
+ $ ceph health
+ HEALTH_WARN short summary of problem(s)
+
+or::
+
+ $ ceph health
+ HEALTH_ERROR short summary of very serious problem(s)
+
+To get more detail::
+
+ $ ceph health detail
+ HEALTH_WARN short description of problem
+
+ one problem
+ another problem
+ yet another problem
+ ...
+
+.. toctree::
+
+ mon
+ osd
+ mds
+ radosgw
+
diff --git a/doc/ops/manage/failures/mds.rst b/doc/ops/manage/failures/mds.rst
new file mode 100644
index 00000000000..961a9cd5614
--- /dev/null
+++ b/doc/ops/manage/failures/mds.rst
@@ -0,0 +1,4 @@
+==================================
+ Recovering from ceph-mds failure
+==================================
+
diff --git a/doc/ops/manage/failures/mon.rst b/doc/ops/manage/failures/mon.rst
new file mode 100644
index 00000000000..702d6fd88ca
--- /dev/null
+++ b/doc/ops/manage/failures/mon.rst
@@ -0,0 +1,4 @@
+==================================
+ Recovering from ceph-mon failure
+==================================
+
diff --git a/doc/ops/manage/failures/osd.rst b/doc/ops/manage/failures/osd.rst
new file mode 100644
index 00000000000..739b38a93be
--- /dev/null
+++ b/doc/ops/manage/failures/osd.rst
@@ -0,0 +1,196 @@
+==================================
+ Recovering from ceph-osd failure
+==================================
+
+Single ceph-osd failure
+=======================
+
+When a ceph-osd process dies, the monitor will learn about the failure
+from its peers and report it via the ``ceph health`` command::
+
+ $ ceph health
+ HEALTH_WARN 1/3 in osds are down
+
+Specifically, you will get a warning whenever there are ceph-osd
+processes that are marked in and down.  You can identify which
+ceph-osds are down with::
+
+ $ ceph health detail
+ HEALTH_WARN 1/3 in osds are down
+ osd.0 is down since epoch 23, last address 192.168.106.220:6800/11080
+
+Under normal circumstances, simply restarting the ceph-osd daemon will
+allow it to rejoin the cluster and recover.  If there is a disk
+failure or other fault preventing ceph-osd from functioning or
+restarting, an error message should be present in its log file in
+``/var/log/ceph``.
+
+
+Homeless placement groups (PGs)
+===============================
+
+It is possible for all OSDs that had copies of a given PG to fail.  If
+that's the case, that subset of the object store is unavailable, and
+the monitor will receive no status updates for those PGs.  To detect
+this situation, the monitor marks any PG whose primary OSD has failed
+as `stale`.  For example::
+
+ $ ceph health
+ HEALTH_WARN 24 pgs stale; 3/3 in osds are down
+
+You can identify which PGs are stale, and what the last OSDs to store
+them were, with::
+
+ $ ceph health detail
+ HEALTH_WARN 24 pgs stale; 3/3 in osds are down
+ ...
+ pg 2.5 is stuck stale+active+remapped, last acting [2,0]
+ ...
+ osd.0 is down since epoch 23, last address 192.168.106.220:6800/11080
+ osd.1 is down since epoch 13, last address 192.168.106.220:6803/11539
+ osd.2 is down since epoch 24, last address 192.168.106.220:6806/11861
+
+If we want to get PG 2.5 back online, for example, this tells us that
+it was last managed by ceph-osds 0 and 2.  Restarting those ceph-osd
+daemons will allow the cluster to recover that PG (and, presumably,
+many others).
+
+
+
+PG down (peering failure)
+=========================
+
+In certain cases, the ceph-osd "peering" process can run into
+problems, preventing a PG from becoming active and usable.  For
+example, ``ceph health`` might report::
+
+ $ ceph health detail
+ HEALTH_ERR 7 pgs degraded; 12 pgs down; 12 pgs peering; 1 pgs recovering; 6 pgs stuck unclean; 114/3300 degraded (3.455%); 1/3 in osds are down
+ ...
+ pg 0.5 is down+peering
+ pg 1.4 is down+peering
+ ...
+ osd.1 is down since epoch 69, last address 192.168.106.220:6801/8651
+
+We can query the cluster to determine exactly why the PG is marked ``down`` with::
+
+ $ ceph pg 0.5 query
+ { "state": "down+peering",
+   ...
+   "recovery_state": [
+        { "name": "Started\/Primary\/Peering\/GetInfo",
+          "enter_time": "2012-03-06 14:40:16.169679",
+          "requested_info_from": []},
+        { "name": "Started\/Primary\/Peering",
+          "enter_time": "2012-03-06 14:40:16.169659",
+          "probing_osds": [
+                0,
+                1],
+          "blocked": "peering is blocked due to down osds",
+          "down_osds_we_would_probe": [
+                1],
+          "peering_blocked_by": [
+                { "osd": 1,
+                  "current_lost_at": 0,
+                  "comment": "starting or marking this osd lost may let us proceed"}]},
+        { "name": "Started",
+          "enter_time": "2012-03-06 14:40:16.169513"}]}
+
+The ``recovery_state`` section tells us that peering is blocked due to
+down ceph-osd daemons, specifically osd.1.  In this case, we can start that ceph-osd
+and things will recover.
+
+Alternatively, is there is a catastrophic failure of osd.1 (e.g., disk
+failure), we can tell the cluster that it is "lost" and to cope as
+best it case.  Note that this is dangerous in that the cluster cannot
+guarantee that the other copies of the data are consistent and up to
+date.  To instruct Ceph to continue anyway::
+
+ $ ceph osd lost 1
+
+and recovery will proceed.
+
+
+Unfound objects
+===============
+
+Under certain combinations of failures Ceph may complain about
+"unfound" objects::
+
+ $ ceph health detail
+ HEALTH_WARN 1 pgs degraded; 78/3778 unfound (2.065%)
+ pg 2.4 is active+degraded, 78 unfound
+
+This means that the storage cluster knows that some objects (or newer
+copies of existing objects) exist, but it hasn't found copies of them.
+
+First, you can identify which objects are unfound with::
+
+ $ ceph pg 2.4 list_missing [starting offset, in json]
+
+ { "offset": { "oid": "",
+      "key": "",
+      "snapid": 0,
+      "hash": 0,
+      "max": 0},
+  "num_missing": 0,
+  "num_unfound": 0,
+  "objects": [
+     { "oid": "object 1",
+       "key": "",
+       "hash": 0,
+       "max": 0 },
+     ...
+  ],
+  "more": 0}
+
+If there are too many objects to list in a single result, the ``more``
+field will be true and you can query for more.  (Eventually the
+command line tool will hide this from you, but not yet.)
+
+Second, you can identify which OSDs have been probed or might contain
+data::
+
+ $ ceph pg 2.4 query
+ ...
+   "recovery_state": [
+        { "name": "Started\/Primary\/Active",
+          "enter_time": "2012-03-06 15:15:46.713212",
+          "might_have_unfound": [
+                { "osd": 1,
+                  "status": "osd is down"}]},
+
+In this case, for example, the cluster knows that ``osd.1`` might have
+data, but it is down.  The full range of possible states include::
+
+ * already probed
+ * querying
+ * osd is down
+ * not queried (yet)
+
+Sometimes it simply takes some time for the cluster to query possible
+locations.  
+
+It is possible that there are other locations where the object can
+exist that are not listed.  For example, if a ceph-osd is stopped and
+taken out of the cluster, the cluster fully recovers, and due to some
+future set of failures ends up with an unfound object, it won't
+consider the long-departed ceph-osd as a potential location to
+consider.  (This scenario, however, is unlikely.)
+
+If all possible locations have been queried and objects are still
+lost, you may have to give up on the lost objects. This, again, is
+possible given unusual combinations of failures that allow the cluster
+to learn about writes that were performed before the writes themselves
+are recovered.  To mark the "unfound" objects as "lost"::
+
+ $ ceph pg 2.5 mark_unfound_lost revert
+
+This the final argument specifies how the cluster should deal with
+lost objects.  Currently the only supported option is "revert", which
+will either roll back to a previous version of the object or (if it
+was a new object) forget about it entirely.  Use this with caution, as
+it may confuse applications that expected the object to exist.
+
+
+
diff --git a/doc/ops/manage/failures/radosgw.rst b/doc/ops/manage/failures/radosgw.rst
new file mode 100644
index 00000000000..2c16dd03a9a
--- /dev/null
+++ b/doc/ops/manage/failures/radosgw.rst
@@ -0,0 +1,4 @@
+=================================
+ Recovering from radosgw failure
+=================================
+
diff --git a/doc/ops/manage/index.rst b/doc/ops/manage/index.rst
index 0841d63d39d..85ba12d3255 100644
--- a/doc/ops/manage/index.rst
+++ b/doc/ops/manage/index.rst
@@ -6,7 +6,7 @@
 
    key
    grow/index
-   disk-failure
+   failures/index
    pool
    cephfs
author	Sage Weil <sage@newdream.net>	2012-03-07 00:27:02 +0100
committer	Sage Weil <sage@newdream.net>	2012-03-07 02:05:29 +0100
commit	d72b821741bad2ccb7a14903d9df46e8388bd802 (patch)
tree	de4cab6a29d93a67440a89127930ace854b4e30f /doc
parent	osd: list might_have_unfound locations in query result (diff)
download	ceph-d72b821741bad2ccb7a14903d9df46e8388bd802.tar.xz ceph-d72b821741bad2ccb7a14903d9df46e8388bd802.zip