diff options
author | Paul Cuzner <pcuzner@redhat.com> | 2021-11-03 03:24:20 +0100 |
---|---|---|
committer | Paul Cuzner <pcuzner@redhat.com> | 2021-11-04 23:24:25 +0100 |
commit | 7ffcbd7f7955b443ffd4293a497b2a99180a3ad2 (patch) | |
tree | 6461bbaf0cb79b683459a6758350cb677ab4ec10 /monitoring/snmp | |
parent | Merge pull request #31877 from rosinL/wip-fix-dpdk-link (diff) | |
download | ceph-7ffcbd7f7955b443ffd4293a497b2a99180a3ad2.tar.xz ceph-7ffcbd7f7955b443ffd4293a497b2a99180a3ad2.zip |
mgr/prometheus: Update rule format and enhance SNMP support
Rules now adhere to the format defined by Prometheus.io.
This changes alert naming and each alert now includes a
a summary description to provide a quick one-liner.
In addition to reformatting some missing alerts for MDS and
cephadm have been added, and corresponding tests added.
The MIB has also been refactored, so it now passes standard
lint tests and a README included for devs to understand the
OID schema.
Fixes: https://tracker.ceph.com/issues/53111
Signed-off-by: Paul Cuzner <pcuzner@redhat.com>
Diffstat (limited to 'monitoring/snmp')
-rw-r--r-- | monitoring/snmp/CEPH-MIB.txt | 337 | ||||
-rw-r--r-- | monitoring/snmp/README.md | 66 |
2 files changed, 385 insertions, 18 deletions
diff --git a/monitoring/snmp/CEPH-MIB.txt b/monitoring/snmp/CEPH-MIB.txt new file mode 100644 index 00000000000..f54cb361037 --- /dev/null +++ b/monitoring/snmp/CEPH-MIB.txt @@ -0,0 +1,337 @@ +CEPH-MIB DEFINITIONS ::= BEGIN + +IMPORTS + MODULE-IDENTITY, NOTIFICATION-TYPE, enterprises + FROM SNMPv2-SMI + MODULE-COMPLIANCE, NOTIFICATION-GROUP + FROM SNMPv2-CONF +; + +-- Linting information: +-- +-- # smilint -l 6 -i notification-not-reversible ./CEPH-MIB.txt +-- +-- ignore: notification-not-reversible since our SNMP gateway doesn't use SNMPv1 +-- + +ceph MODULE-IDENTITY + LAST-UPDATED + "202111010000Z" -- Nov 01, 2021 + ORGANIZATION + "The Ceph Project + https://ceph.io" + CONTACT-INFO + "Email: <dev@ceph.io> + + Send comments to: <dev@ceph.io>" + DESCRIPTION + "The MIB module for Ceph. In it's current form it only + supports Notifications, since Ceph itself doesn't provide + any SNMP agent functionality. + + Notifications are provided through a Prometheus/Alertmanager + webhook passing alerts to an external gateway service that is + responsible for formatting, forwarding and authenticating to + the SNMP receiver. + " + REVISION + "202111010000Z" --Nov 01, 2021 + DESCRIPTION + "Latest version including the following updates; + + - MIB restructure to align with linting + - names shortened and simplified (less verbose) + - Simplified structure due to switch to https://github.com/maxwo/snmp_notifier + - objects removed + - notifications updated + - Added module compliance + - Updated to latest prometheus alert rule definitions + " + ::= { enterprises 50495 } + +cephCluster OBJECT IDENTIFIER ::= { ceph 1 } +cephConformance OBJECT IDENTIFIER ::= { ceph 2 } + +-- cephMetadata is a placeholder for possible future expansion via an agent +-- where we could provide an overview of the clusters configuration +cephMetadata OBJECT IDENTIFIER ::= { cephCluster 1 } +cephNotifications OBJECT IDENTIFIER ::= { cephCluster 2 } + +prometheus OBJECT IDENTIFIER ::= { cephNotifications 1 } + +-- +-- Notifications: first we define the notification 'branches' for the +-- different categories of notifications / alerts +promGeneric OBJECT IDENTIFIER ::= { prometheus 1 } +promHealthStatus OBJECT IDENTIFIER ::= { prometheus 2 } +promMon OBJECT IDENTIFIER ::= { prometheus 3 } +promOsd OBJECT IDENTIFIER ::= { prometheus 4 } +promMds OBJECT IDENTIFIER ::= { prometheus 5 } +promMgr OBJECT IDENTIFIER ::= { prometheus 6 } +promPGs OBJECT IDENTIFIER ::= { prometheus 7 } +promNode OBJECT IDENTIFIER ::= { prometheus 8 } +promPool OBJECT IDENTIFIER ::= { prometheus 9 } +promRados OBJECT IDENTIFIER ::= { prometheus 10 } +promCephadm OBJECT IDENTIFIER ::= { prometheus 11 } +promPrometheus OBJECT IDENTIFIER ::= { prometheus 12 } + +promGenericNotification NOTIFICATION-TYPE + STATUS current + DESCRIPTION "Generic alert issued when the Prometheus rule doesn't provide an OID." +::= { promGeneric 1 } + +promGenericDaemonCrash NOTIFICATION-TYPE + STATUS current + DESCRIPTION "One or more daemons have crashed recently, and are yet to be archived" +::= { promGeneric 2 } + +promHealthStatusError NOTIFICATION-TYPE + STATUS current + DESCRIPTION "Ceph in health_error state for too long." +::= { promHealthStatus 1 } + +promHealthStatusWarning NOTIFICATION-TYPE + STATUS current + DESCRIPTION "Ceph in health_warn for too long." +::= { promHealthStatus 2 } + +promMonLowQuorum NOTIFICATION-TYPE + STATUS current + DESCRIPTION "Monitor count in quorum is low." +::= { promMon 1 } + +promMonDiskSpaceCritical NOTIFICATION-TYPE + STATUS current + DESCRIPTION "Monitor diskspace is critically low." +::= { promMon 2 } + +promOsdDownHigh NOTIFICATION-TYPE + STATUS current + DESCRIPTION "A high number of OSDs are down." +::= { promOsd 1 } + +promOsdDown NOTIFICATION-TYPE + STATUS current + DESCRIPTION "One or more Osds down." +::= { promOsd 2 } + +promOsdNearFull NOTIFICATION-TYPE + STATUS current + DESCRIPTION "An OSD is dangerously full." +::= { promOsd 3 } + +promOsdFlapping NOTIFICATION-TYPE + STATUS current + DESCRIPTION "An OSD was marked down at back up at least once a minute for 5 minutes." +::= { promOsd 4 } + +promOsdHighPgDeviation NOTIFICATION-TYPE + STATUS current + DESCRIPTION "An OSD deviates by more then 30% from average PG count." +::= { promOsd 5 } + +promOsdFull NOTIFICATION-TYPE + STATUS current + DESCRIPTION "An OSD has reached its full threshold." +::= { promOsd 6 } + +promOsdHighPredictedFailures NOTIFICATION-TYPE + STATUS current + DESCRIPTION "Normal self healing unable to cope with the number of devices predicted to fail." +::= { promOsd 7 } + +promOsdHostDown NOTIFICATION-TYPE + STATUS current + DESCRIPTION "Ceph OSD host is down." +::= { promOsd 8 } + +promMdsDamaged NOTIFICATION-TYPE + STATUS current + DESCRIPTION "Cephfs filesystem is damaged." +::= { promMds 1 } + +promMdsReadOnly NOTIFICATION-TYPE + STATUS current + DESCRIPTION "Cephfs filesystem marked as READ-ONLY" +::= { promMds 2 } + +promMdsOffline NOTIFICATION-TYPE + STATUS current + DESCRIPTION "Cephfs filesystem is unavailable/offline." +::= { promMds 3 } + +promMdsDegraded NOTIFICATION-TYPE + STATUS current + DESCRIPTION "Cephfs filesystem is in a degraded state." +::= { promMds 4 } + +promMdsNoStandby NOTIFICATION-TYPE + STATUS current + DESCRIPTION "Cephfs MDS daemon failure, no standby available" +::= { promMds 5 } + +promMgrModuleCrash NOTIFICATION-TYPE + STATUS current + DESCRIPTION "Ceph mgr module has crashed recently" +::= { promMgr 1 } + +promMgrPrometheusInactive NOTIFICATION-TYPE + STATUS current + DESCRIPTION "Ceph mgr prometheus module not responding" +::= { promMgr 2 } + +promPGsInactive NOTIFICATION-TYPE + STATUS current + DESCRIPTION "One or more PGs are inactive for more than 5 minutes." +::= { promPGs 1 } + +promPGsUnclean NOTIFICATION-TYPE + STATUS current + DESCRIPTION "One or more PGs are not clean for more than 15 minutes." +::= { promPGs 2 } + +promPGsUnavailable NOTIFICATION-TYPE + STATUS current + DESCRIPTION "One or more PGs is unavailable, blocking I/O to those objects." +::= { promPGs 3 } + +promPGsDamaged NOTIFICATION-TYPE + STATUS current + DESCRIPTION "One or more PGs is damaged." +::= { promPGs 4 } + +promPGsRecoveryFull NOTIFICATION-TYPE + STATUS current + DESCRIPTION "PG recovery is impaired due to full OSDs." +::= { promPGs 5 } + +promPGsBackfillFull NOTIFICATION-TYPE + STATUS current + DESCRIPTION "PG backfill is impaired due to full OSDs." +::= { promPGs 6 } + +promNodeRootVolumeFull NOTIFICATION-TYPE + STATUS current + DESCRIPTION "Root volume (OSD and MON store) is dangerously full (< 5% free)." +::= { promNode 1 } + +promNodeNetworkPacketDrops NOTIFICATION-TYPE + STATUS current + DESCRIPTION "A node experiences packet drop > 1 packet/s on an interface." +::= { promNode 2 } + +promNodeNetworkPacketErrors NOTIFICATION-TYPE + STATUS current + DESCRIPTION "A node experiences packet errors > 1 packet/s on an interface." +::= { promNode 3 } + +promNodeStorageFilling NOTIFICATION-TYPE + STATUS current + DESCRIPTION "A mountpoint will be full in less then 5 days assuming the average fillup rate of the past 48 hours." +::= { promNode 4 } + +promPoolFull NOTIFICATION-TYPE + STATUS current + DESCRIPTION "A pool is at 90% capacity or over." +::= { promPool 1 } + +promPoolFilling NOTIFICATION-TYPE + STATUS current + DESCRIPTION "A pool will be full in less then 5 days assuming the average fillup rate of the past 48 hours." +::= { promPool 2 } + +promRadosUnfound NOTIFICATION-TYPE + STATUS current + DESCRIPTION "A RADOS object can not be found, even though all OSDs are online." +::= { promRados 1 } + +promCephadmDaemonDown NOTIFICATION-TYPE + STATUS current + DESCRIPTION "Cephadm has determined that a daemon is down." +::= { promCephadm 1 } + +promCephadmUpgradeFailure NOTIFICATION-TYPE + STATUS current + DESCRIPTION "Cephadm attempted to upgrade the cluster and encountered a problem." +::= { promCephadm 2 } + +promPrometheusJobMissing NOTIFICATION-TYPE + STATUS current + DESCRIPTION "The prometheus scrape job is not defined." +::= { promPrometheus 1 } +-- ---------------------------------------------------------- -- +-- IEEE 802.1D MIB - Conformance Information +-- ---------------------------------------------------------- -- + +cephAlertGroups OBJECT IDENTIFIER ::= { cephConformance 1 } +cephCompliances OBJECT IDENTIFIER ::= { cephConformance 2 } + +-- ---------------------------------------------------------- -- +-- units of conformance +-- ---------------------------------------------------------- -- + +-- ---------------------------------------------------------- -- +-- The Trap Notification Group +-- ---------------------------------------------------------- -- + +cephNotificationGroup NOTIFICATION-GROUP + NOTIFICATIONS { + promGenericNotification, + promGenericDaemonCrash, + promHealthStatusError, + promHealthStatusWarning, + promMonLowQuorum, + promMonDiskSpaceCritical, + promOsdDownHigh, + promOsdDown, + promOsdNearFull, + promOsdFlapping, + promOsdHighPgDeviation, + promOsdFull, + promOsdHighPredictedFailures, + promOsdHostDown, + promMdsDamaged, + promMdsReadOnly, + promMdsOffline, + promMdsDegraded, + promMdsNoStandby, + promMgrModuleCrash, + promMgrPrometheusInactive, + promPGsInactive, + promPGsUnclean, + promPGsUnavailable, + promPGsDamaged, + promPGsRecoveryFull, + promPGsBackfillFull, + promNodeRootVolumeFull, + promNodeNetworkPacketDrops, + promNodeNetworkPacketErrors, + promNodeStorageFilling, + promPoolFull, + promPoolFilling, + promRadosUnfound, + promCephadmDaemonDown, + promCephadmUpgradeFailure, + promPrometheusJobMissing + } + STATUS current + DESCRIPTION + "A collection of notifications triggered by the Prometheus + rules to convey Ceph cluster state" + ::= { cephAlertGroups 2 } + +-- ---------------------------------------------------------- -- +-- compliance statements +-- ---------------------------------------------------------- -- + +cephCompliance MODULE-COMPLIANCE + STATUS current + DESCRIPTION + "The Compliance statement for the Ceph MIB" + MODULE + MANDATORY-GROUPS { + cephNotificationGroup + } + ::= { cephCompliances 1 } + +END diff --git a/monitoring/snmp/README.md b/monitoring/snmp/README.md index dccef1908f8..1a5b609556d 100644 --- a/monitoring/snmp/README.md +++ b/monitoring/snmp/README.md @@ -1,24 +1,54 @@ # SNMP schema +To show the [OID](https://en.wikipedia.org/wiki/Object_identifier)'s supported by the MIB, use the snmptranslate command. Here's an example: +``` +snmptranslate -Pu -Tz -M ~/git/ceph/monitoring/snmp:/usr/share/snmp/mibs -m CEPH-MIB +``` +*The `snmptranslate` command is in the net-snmp-utils package* -## Traps +The MIB provides a NOTIFICATION only implementation since ceph doesn't have an SNMP +agent feature. -| OID | Description | -| :--- | :--- | -| 1.3.6.1.4.1.50495.15.1.2.1 | The default trap. This is used if no OID is specified in the alert labels. | -| 1.3.6.1.4.1.50495.15.1.2.[2...N] | Custom traps. | +## Integration +The SNMP MIB is has been aligned to the Prometheus rules. Any rule that defines a +critical alert should have a corresponding oid in the CEPH-MIB.txt file. To generate +an SNMP notification, you must use an SNMP gateway that the Prometheus Alertmanager +service can forward alerts through to, via it's webhooks feature. -## Objects + -The following objects are appended as variable binds to an SNMP trap. +## SNMP Gateway +The recommended SNMP gateway is https://github.com/maxwo/snmp_notifier. This is a widely +used and generic SNMP gateway implementation written in go. It's usage (syntax and +parameters) is very similar to Prometheus, AlertManager and even node-exporter. -| OID | Type | Description | -| :--- | :---: | :--- | -| 1.3.6.1.4.1.50495.15.1.1.1 | String | The name of the Prometheus alert. | -| 1.3.6.1.4.1.50495.15.1.1.2 | String | The status of the Prometheus alert. | -| 1.3.6.1.4.1.50495.15.1.1.3 | String | The severity of the Prometheus alert. | -| 1.3.6.1.4.1.50495.15.1.1.4 | String | Unique identifier for the Prometheus instance. | -| 1.3.6.1.4.1.50495.15.1.1.5 | String | The name of the Prometheus job. | -| 1.3.6.1.4.1.50495.15.1.1.6 | String | The Prometheus alert description field. | -| 1.3.6.1.4.1.50495.15.1.1.7 | String | Additional Prometheus alert labels as JSON string. | -| 1.3.6.1.4.1.50495.15.1.1.8 | Unix timestamp | The time when the Prometheus alert occurred. | -| 1.3.6.1.4.1.50495.15.1.1.9 | String | The raw Prometheus alert as JSON string. |
\ No newline at end of file + +## SNMP OIDs +The main components of the Ceph MIB is can be broken down into discrete areas + + +``` +internet private enterprise ceph ceph Notifications Prometheus Notification + org cluster (alerts) source Category +1.3.6.1 .4 .1 .50495 .1 .2 .1 .2 (Ceph Health) + .3 (MON) + .4 (OSD) + .5 (MDS) + .6 (MGR) + .7 (PGs) + .8 (Nodes) + .9 (Pools) + .10 (Rados) + .11 (cephadm) + .12 (prometheus) + +``` +Individual alerts are placed within the appropriate alert category. For example, to add +a notification relating to a MGR issue, you would use the oid 1.3.6.1.4.1.50495.1.2.1.6.x + +The SNMP gateway also adds additional components to the SNMP notification ; + +| Suffix | Description | +|--------|-------------| +| .1 | The oid | +| .2 | Severity of the alert. When an alert is resolved, severity is 'info', and the description is set to Status:OK| +| .3 | Text of the alert(s) | |