mgr/prometheus: Update rule format and enhance SNMP support

Rules now adhere to the format defined by Prometheus.io. This changes alert naming and each alert now includes a a summary description to provide a quick one-liner. In addition to reformatting some missing alerts for MDS and cephadm have been added, and corresponding tests added. The MIB has also been refactored, so it now passes standard lint tests and a README included for devs to understand the OID schema. Fixes: https://tracker.ceph.com/issues/53111 Signed-off-by: Paul Cuzner <pcuzner@redhat.com>
author: Paul Cuzner <pcuzner@redhat.com> 2021-11-03 03:24:20 +0100
committer: Paul Cuzner <pcuzner@redhat.com> 2021-11-04 23:24:25 +0100
commit: 7ffcbd7f7955b443ffd4293a497b2a99180a3ad2 (patch)
tree: 6461bbaf0cb79b683459a6758350cb677ab4ec10 /monitoring/snmp
parent: Merge pull request #31877 from rosinL/wip-fix-dpdk-link (diff)
download: ceph-7ffcbd7f7955b443ffd4293a497b2a99180a3ad2.tar.xz
ceph-7ffcbd7f7955b443ffd4293a497b2a99180a3ad2.zip
2 files changed, 385 insertions, 18 deletions
diff --git a/monitoring/snmp/CEPH-MIB.txt b/monitoring/snmp/CEPH-MIB.txt
new file mode 100644
index 00000000000..f54cb361037
--- /dev/null
+++ b/monitoring/snmp/CEPH-MIB.txt
@@ -0,0 +1,337 @@
+CEPH-MIB DEFINITIONS ::= BEGIN
+
+IMPORTS
+    MODULE-IDENTITY, NOTIFICATION-TYPE, enterprises
+        FROM SNMPv2-SMI
+    MODULE-COMPLIANCE, NOTIFICATION-GROUP
+        FROM SNMPv2-CONF
+;
+
+-- Linting information:
+--
+-- # smilint -l 6 -i notification-not-reversible ./CEPH-MIB.txt
+--
+-- ignore: notification-not-reversible since our SNMP gateway doesn't use SNMPv1
+--
+
+ceph MODULE-IDENTITY
+    LAST-UPDATED
+        "202111010000Z" -- Nov 01, 2021
+    ORGANIZATION
+        "The Ceph Project
+         https://ceph.io"
+    CONTACT-INFO
+        "Email: <dev@ceph.io>
+
+        Send comments to: <dev@ceph.io>"
+    DESCRIPTION
+        "The MIB module for Ceph. In it's current form it only
+        supports Notifications, since Ceph itself doesn't provide
+        any SNMP agent functionality.
+
+        Notifications are provided through a Prometheus/Alertmanager
+        webhook passing alerts to an external gateway service that is
+        responsible for formatting, forwarding and authenticating to
+        the SNMP receiver.
+        "
+    REVISION
+        "202111010000Z" --Nov 01, 2021
+    DESCRIPTION
+        "Latest version including the following updates;
+
+        - MIB restructure to align with linting
+        - names shortened and simplified (less verbose)
+        - Simplified structure due to switch to https://github.com/maxwo/snmp_notifier
+          - objects removed
+          - notifications updated
+        - Added module compliance
+        - Updated to latest prometheus alert rule definitions
+        "
+    ::= { enterprises 50495 }
+
+cephCluster       OBJECT IDENTIFIER ::= { ceph 1 }
+cephConformance   OBJECT IDENTIFIER ::= { ceph 2 }
+
+-- cephMetadata is a placeholder for possible future expansion via an agent
+-- where we could provide an overview of the clusters configuration
+cephMetadata      OBJECT IDENTIFIER ::= { cephCluster 1 }
+cephNotifications OBJECT IDENTIFIER ::= { cephCluster 2 }
+
+prometheus OBJECT IDENTIFIER ::= { cephNotifications 1 }
+
+--
+-- Notifications: first we define the notification 'branches' for the
+-- different categories of notifications / alerts
+promGeneric       OBJECT IDENTIFIER ::= { prometheus 1 }
+promHealthStatus  OBJECT IDENTIFIER ::= { prometheus 2 }
+promMon           OBJECT IDENTIFIER ::= { prometheus 3 }
+promOsd           OBJECT IDENTIFIER ::= { prometheus 4 }
+promMds           OBJECT IDENTIFIER ::= { prometheus 5 }
+promMgr           OBJECT IDENTIFIER ::= { prometheus 6 }
+promPGs           OBJECT IDENTIFIER ::= { prometheus 7 }
+promNode          OBJECT IDENTIFIER ::= { prometheus 8 }
+promPool          OBJECT IDENTIFIER ::= { prometheus 9 }
+promRados         OBJECT IDENTIFIER ::= { prometheus 10 }
+promCephadm       OBJECT IDENTIFIER ::= { prometheus 11 }
+promPrometheus    OBJECT IDENTIFIER ::= { prometheus 12 }
+
+promGenericNotification NOTIFICATION-TYPE
+    STATUS      current
+    DESCRIPTION "Generic alert issued when the Prometheus rule doesn't provide an OID."
+::= { promGeneric 1 }
+
+promGenericDaemonCrash NOTIFICATION-TYPE
+    STATUS      current
+    DESCRIPTION "One or more daemons have crashed recently, and are yet to be archived"
+::= { promGeneric 2 }
+
+promHealthStatusError NOTIFICATION-TYPE
+    STATUS      current
+    DESCRIPTION "Ceph in health_error state for too long."
+::= { promHealthStatus 1 }
+
+promHealthStatusWarning NOTIFICATION-TYPE
+    STATUS      current
+    DESCRIPTION "Ceph in health_warn for too long."
+::= { promHealthStatus 2 }
+
+promMonLowQuorum NOTIFICATION-TYPE
+    STATUS      current
+    DESCRIPTION "Monitor count in quorum is low."
+::= { promMon 1 }
+
+promMonDiskSpaceCritical NOTIFICATION-TYPE
+    STATUS      current
+    DESCRIPTION "Monitor diskspace is critically low."
+::= { promMon 2 }
+
+promOsdDownHigh NOTIFICATION-TYPE
+    STATUS      current
+    DESCRIPTION "A high number of OSDs are down."
+::= { promOsd 1 }
+
+promOsdDown NOTIFICATION-TYPE
+    STATUS      current
+    DESCRIPTION "One or more Osds down."
+::= { promOsd 2 }
+
+promOsdNearFull NOTIFICATION-TYPE
+    STATUS      current
+    DESCRIPTION "An OSD is dangerously full."
+::= { promOsd 3 }
+
+promOsdFlapping NOTIFICATION-TYPE
+    STATUS      current
+    DESCRIPTION "An OSD was marked down at back up at least once a minute for 5 minutes."
+::= { promOsd 4 }
+
+promOsdHighPgDeviation NOTIFICATION-TYPE
+    STATUS      current
+    DESCRIPTION "An OSD deviates by more then 30% from average PG count."
+::= { promOsd 5 }
+
+promOsdFull NOTIFICATION-TYPE
+    STATUS      current
+    DESCRIPTION "An OSD has reached its full threshold."
+::= { promOsd 6 }
+
+promOsdHighPredictedFailures NOTIFICATION-TYPE
+    STATUS      current
+    DESCRIPTION "Normal self healing unable to cope with the number of devices predicted to fail."
+::= { promOsd 7 }
+
+promOsdHostDown NOTIFICATION-TYPE
+    STATUS      current
+    DESCRIPTION "Ceph OSD host is down."
+::= { promOsd 8 }
+
+promMdsDamaged NOTIFICATION-TYPE
+    STATUS      current
+    DESCRIPTION "Cephfs filesystem is damaged."
+::= { promMds 1 }
+
+promMdsReadOnly NOTIFICATION-TYPE
+    STATUS      current
+    DESCRIPTION "Cephfs filesystem marked as READ-ONLY"
+::= { promMds 2 }
+
+promMdsOffline NOTIFICATION-TYPE
+    STATUS      current
+    DESCRIPTION "Cephfs filesystem is unavailable/offline."
+::= { promMds 3 }
+
+promMdsDegraded NOTIFICATION-TYPE
+    STATUS      current
+    DESCRIPTION "Cephfs filesystem is in a degraded state."
+::= { promMds 4 }
+
+promMdsNoStandby NOTIFICATION-TYPE
+    STATUS      current
+    DESCRIPTION "Cephfs MDS daemon failure, no standby available"
+::= { promMds 5 }
+
+promMgrModuleCrash NOTIFICATION-TYPE
+    STATUS      current
+    DESCRIPTION "Ceph mgr module has crashed recently"
+::= { promMgr 1 }
+
+promMgrPrometheusInactive NOTIFICATION-TYPE
+    STATUS      current
+    DESCRIPTION "Ceph mgr prometheus module not responding"
+::= { promMgr 2 }
+
+promPGsInactive NOTIFICATION-TYPE
+    STATUS      current
+    DESCRIPTION "One or more PGs are inactive for more than 5 minutes."
+::= { promPGs 1 }
+
+promPGsUnclean NOTIFICATION-TYPE
+    STATUS      current
+    DESCRIPTION "One or more PGs are not clean for more than 15 minutes."
+::= { promPGs 2 }
+
+promPGsUnavailable NOTIFICATION-TYPE
+    STATUS      current
+    DESCRIPTION "One or more PGs is unavailable, blocking I/O to those objects."
+::= { promPGs 3 }
+
+promPGsDamaged NOTIFICATION-TYPE
+    STATUS      current
+    DESCRIPTION "One or more PGs is damaged."
+::= { promPGs 4 }
+
+promPGsRecoveryFull NOTIFICATION-TYPE
+    STATUS      current
+    DESCRIPTION "PG recovery is impaired due to full OSDs."
+::= { promPGs 5 }
+
+promPGsBackfillFull NOTIFICATION-TYPE
+    STATUS      current
+    DESCRIPTION "PG backfill is impaired due to full OSDs."
+::= { promPGs 6 }
+
+promNodeRootVolumeFull NOTIFICATION-TYPE
+    STATUS      current
+    DESCRIPTION "Root volume (OSD and MON store) is dangerously full (< 5% free)."
+::= { promNode 1 }
+
+promNodeNetworkPacketDrops NOTIFICATION-TYPE
+    STATUS      current
+    DESCRIPTION "A node experiences packet drop > 1 packet/s on an interface."
+::= { promNode 2 }
+
+promNodeNetworkPacketErrors NOTIFICATION-TYPE
+    STATUS      current
+    DESCRIPTION "A node experiences packet errors > 1 packet/s on an interface."
+::= { promNode 3 }
+
+promNodeStorageFilling NOTIFICATION-TYPE
+    STATUS      current
+    DESCRIPTION "A mountpoint will be full in less then 5 days assuming the average fillup rate of the past 48 hours."
+::= { promNode 4 }
+
+promPoolFull NOTIFICATION-TYPE
+    STATUS      current
+    DESCRIPTION "A pool is at 90% capacity or over."
+::= { promPool 1 }
+
+promPoolFilling NOTIFICATION-TYPE
+    STATUS      current
+    DESCRIPTION "A pool will be full in less then 5 days assuming the average fillup rate of the past 48 hours."
+::= { promPool 2 }
+
+promRadosUnfound NOTIFICATION-TYPE
+    STATUS      current
+    DESCRIPTION "A RADOS object can not be found, even though all OSDs are online."
+::= { promRados 1 }
+
+promCephadmDaemonDown NOTIFICATION-TYPE
+    STATUS      current
+    DESCRIPTION "Cephadm has determined that a daemon is down."
+::= { promCephadm 1 }
+
+promCephadmUpgradeFailure NOTIFICATION-TYPE
+    STATUS      current
+    DESCRIPTION "Cephadm attempted to upgrade the cluster and encountered a problem."
+::= { promCephadm 2 }
+
+promPrometheusJobMissing NOTIFICATION-TYPE
+    STATUS      current
+    DESCRIPTION "The prometheus scrape job is not defined."
+::= { promPrometheus 1 }
+-- ---------------------------------------------------------- --
+-- IEEE 802.1D MIB - Conformance Information
+-- ---------------------------------------------------------- --
+
+cephAlertGroups   OBJECT IDENTIFIER ::= { cephConformance 1 }
+cephCompliances   OBJECT IDENTIFIER ::= { cephConformance 2 }
+
+-- ---------------------------------------------------------- --
+-- units of conformance
+-- ---------------------------------------------------------- --
+
+-- ---------------------------------------------------------- --
+-- The Trap Notification Group
+-- ---------------------------------------------------------- --
+
+cephNotificationGroup NOTIFICATION-GROUP
+    NOTIFICATIONS {
+        promGenericNotification,
+        promGenericDaemonCrash,
+        promHealthStatusError,
+        promHealthStatusWarning,
+        promMonLowQuorum,
+        promMonDiskSpaceCritical,
+        promOsdDownHigh,
+        promOsdDown,
+        promOsdNearFull,
+        promOsdFlapping,
+        promOsdHighPgDeviation,
+        promOsdFull,
+        promOsdHighPredictedFailures,
+        promOsdHostDown,
+        promMdsDamaged,
+        promMdsReadOnly,
+        promMdsOffline,
+        promMdsDegraded,
+        promMdsNoStandby,
+        promMgrModuleCrash,
+        promMgrPrometheusInactive,
+        promPGsInactive,
+        promPGsUnclean,
+        promPGsUnavailable,
+        promPGsDamaged,
+        promPGsRecoveryFull,
+        promPGsBackfillFull,
+        promNodeRootVolumeFull,
+        promNodeNetworkPacketDrops,
+        promNodeNetworkPacketErrors,
+        promNodeStorageFilling,
+        promPoolFull,
+        promPoolFilling,
+        promRadosUnfound,
+        promCephadmDaemonDown,
+        promCephadmUpgradeFailure,
+        promPrometheusJobMissing
+    }
+    STATUS current
+    DESCRIPTION
+        "A collection of notifications triggered by the Prometheus
+        rules to convey Ceph cluster state"
+    ::= { cephAlertGroups 2 }
+
+-- ---------------------------------------------------------- --
+-- compliance statements
+-- ---------------------------------------------------------- --
+
+cephCompliance MODULE-COMPLIANCE
+    STATUS current
+    DESCRIPTION
+        "The Compliance statement for the Ceph MIB"
+    MODULE
+        MANDATORY-GROUPS {
+            cephNotificationGroup
+        }
+    ::= { cephCompliances 1 }
+
+END
diff --git a/monitoring/snmp/README.md b/monitoring/snmp/README.md
index dccef1908f8..1a5b609556d 100644
--- a/monitoring/snmp/README.md
+++ b/monitoring/snmp/README.md
@@ -1,24 +1,54 @@
 # SNMP schema
+To show the [OID](https://en.wikipedia.org/wiki/Object_identifier)'s supported by the MIB, use the snmptranslate command. Here's an example:
+```
+snmptranslate -Pu -Tz -M ~/git/ceph/monitoring/snmp:/usr/share/snmp/mibs -m CEPH-MIB
+```
+*The `snmptranslate` command is in the net-snmp-utils package*
 
-## Traps
+The MIB provides a NOTIFICATION only implementation since ceph doesn't have an SNMP
+agent feature.
 
-| OID | Description |
-| :--- | :--- |
-| 1.3.6.1.4.1.50495.15.1.2.1 | The default trap. This is used if no OID is specified in the alert labels. |
-| 1.3.6.1.4.1.50495.15.1.2.[2...N] | Custom traps. |
+## Integration
+The SNMP MIB is has been aligned to the Prometheus rules. Any rule that defines a 
+critical alert should have a corresponding oid in the CEPH-MIB.txt file. To generate
+an SNMP notification, you must use an SNMP gateway that the Prometheus Alertmanager
+service can forward alerts through to, via it's webhooks feature.
 
-## Objects
+&nbsp;
 
-The following objects are appended as variable binds to an SNMP trap.
+## SNMP Gateway
+The recommended SNMP gateway is https://github.com/maxwo/snmp_notifier. This is a widely
+used and generic SNMP gateway implementation written in go. It's usage (syntax and
+parameters) is very similar to Prometheus, AlertManager and even node-exporter.
 
-| OID | Type | Description |
-| :--- | :---: | :--- |
-| 1.3.6.1.4.1.50495.15.1.1.1 | String | The name of the Prometheus alert. |
-| 1.3.6.1.4.1.50495.15.1.1.2 | String | The status of the Prometheus alert. |
-| 1.3.6.1.4.1.50495.15.1.1.3 | String | The severity of the Prometheus alert. |
-| 1.3.6.1.4.1.50495.15.1.1.4 | String | Unique identifier for the Prometheus instance. |
-| 1.3.6.1.4.1.50495.15.1.1.5 | String | The name of the Prometheus job. |
-| 1.3.6.1.4.1.50495.15.1.1.6 | String | The Prometheus alert description field. |
-| 1.3.6.1.4.1.50495.15.1.1.7 | String | Additional Prometheus alert labels as JSON string. |
-| 1.3.6.1.4.1.50495.15.1.1.8 | Unix timestamp | The time when the Prometheus alert occurred. |
-| 1.3.6.1.4.1.50495.15.1.1.9 | String | The raw Prometheus alert as JSON string. |
-\ No newline at end of file
+&nbsp;
+## SNMP OIDs
+The main components of the Ceph MIB is can be broken down into discrete areas
+
+
+```
+internet private enterprise   ceph   ceph    Notifications   Prometheus  Notification
+                               org  cluster   (alerts)         source      Category
+1.3.6.1   .4     .1          .50495   .1        .2               .1         .2  (Ceph Health)
+                                                                            .3  (MON)
+                                                                            .4  (OSD)
+                                                                            .5  (MDS)
+                                                                            .6  (MGR)
+                                                                            .7  (PGs)
+                                                                            .8  (Nodes)
+                                                                            .9  (Pools)
+                                                                            .10  (Rados)
+                                                                            .11 (cephadm)
+                                                                            .12 (prometheus)
+
+```
+Individual alerts are placed within the appropriate alert category. For example, to add
+a notification relating to a MGR issue, you would use the oid 1.3.6.1.4.1.50495.1.2.1.6.x
+
+The SNMP gateway also adds additional components to the SNMP notification ;
+
+| Suffix | Description |
+|--------|-------------|
+| .1 | The oid |
+| .2 | Severity of the alert. When an alert is resolved, severity is 'info', and the description is set to Status:OK|
+| .3 | Text of the alert(s) |
author	Paul Cuzner <pcuzner@redhat.com>	2021-11-03 03:24:20 +0100
committer	Paul Cuzner <pcuzner@redhat.com>	2021-11-04 23:24:25 +0100
commit	7ffcbd7f7955b443ffd4293a497b2a99180a3ad2 (patch)
tree	6461bbaf0cb79b683459a6758350cb677ab4ec10 /monitoring/snmp
parent	Merge pull request #31877 from rosinL/wip-fix-dpdk-link (diff)
download	ceph-7ffcbd7f7955b443ffd4293a497b2a99180a3ad2.tar.xz ceph-7ffcbd7f7955b443ffd4293a497b2a99180a3ad2.zip