summaryrefslogtreecommitdiffstats
path: root/doc/monitoring/index.rst
diff options
context:
space:
mode:
Diffstat (limited to 'doc/monitoring/index.rst')
-rw-r--r--doc/monitoring/index.rst24
1 files changed, 24 insertions, 0 deletions
diff --git a/doc/monitoring/index.rst b/doc/monitoring/index.rst
index 794fdf84195..afccd9ab16a 100644
--- a/doc/monitoring/index.rst
+++ b/doc/monitoring/index.rst
@@ -64,6 +64,30 @@ in:
It is good to outline that the main tool allowing users to observe and monitor a Ceph cluster is the **Ceph dashboard**. It provides graphics where the most important cluster and service metrics are represented. Most of the examples in this document are extracted from the dashboard graphics or extrapolated from the metrics exposed by the Ceph dashboard.
+Ceph daemon health metrics
+==========================
+
+The Ceph exporter provides a metric called ``ceph_daemon_socket_up`` that reports the liveness status of each Ceph daemon that exposes an admin socket.
+
+The ``ceph_daemon_socket_up`` metric indicates the health status of a Ceph daemon based on its ability to respond via the admin socket, where a value of ``1`` means healthy, and ``0`` means unhealthy. Although a Ceph daemon might still be "alive" when it reports ``ceph_daemon_socket_up=0``, this situation highlights a significant issue in its functionality. As such, this metric serves as an excellent tool for detecting problems in any of the main Ceph daemons.
+
+Labels:
+- **``ceph_daemon``**: Identifier of the Ceph daemon exposing an admin socket on the host.
+- **``hostname``**: Name of the host where the Ceph daemon is running.
+
+Example:
+
+.. code-block:: bash
+
+ ceph_daemon_socket_up{ceph_daemon="mds.a",hostname="testhost"} 1
+ ceph_daemon_socket_up{ceph_daemon="osd.1",hostname="testhost"} 0
+
+To identify any Ceph daemons that were not responsive at any point in the last 12 hours, you can use the following PromQL expression:
+
+.. code-block:: bash
+
+ ceph_daemon_socket_up == 0 or min_over_time(ceph_daemon_socket_up[12h]) == 0
+
Performance metrics
===================