From ae9277398840bf8576ea5a8c4a2ba4e23f8b9613 Mon Sep 17 00:00:00 2001
From: Dhairya Parmar <dparmar@redhat.com>
Date: Fri, 22 Nov 2024 03:40:33 +0530
Subject: doc/cephfs: document purge queue and its perf counters

Fixes: https://tracker.ceph.com/issues/68571
Signed-off-by: Dhairya Parmar <dparmar@redhat.com>
---
 doc/cephfs/index.rst       |   1 +
 doc/cephfs/purge-queue.rst | 106 +++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 107 insertions(+)
 create mode 100644 doc/cephfs/purge-queue.rst

(limited to 'doc')
diff --git a/doc/cephfs/index.rst b/doc/cephfs/index.rst
index 57ea336c00b..a8a991b01b0 100644
--- a/doc/cephfs/index.rst
+++ b/doc/cephfs/index.rst
@@ -93,6 +93,7 @@ Administration
     CephFS Top Utility <cephfs-top>
     Scheduled Snapshots <snap-schedule>
     CephFS Snapshot Mirroring <cephfs-mirroring>
+    Purge Queue <purge-queue>
 
 .. raw:: html
 
diff --git a/doc/cephfs/purge-queue.rst b/doc/cephfs/purge-queue.rst
new file mode 100644
index 00000000000..d7a68e7fa55
--- /dev/null
+++ b/doc/cephfs/purge-queue.rst
@@ -0,0 +1,106 @@
+============
+Purge Queue
+============
+
+MDS maintains a data structure known as **Purge Queue** which is responsible
+for managing and executing the parallel deletion of files.
+There is a purge queue for every MDS rank. Purge queues consist of purge items
+which contain nominal information from the inodes such as size and the layout
+(i.e. all other un-needed metadata information is discarded making it
+independent of all metadata structures).
+
+Deletion process
+================
+
+When a client requests deletion of a directory (say ``rm -rf``):
+
+- MDS queues the files and subdirectories (purge items) from pq (purge queue)
+  journal in the purge queue.
+- Processes the deletion of inodes in background in small and manageable
+  chunks.
+- MDS instructs underlying OSDs to clean up the associated objects in data
+  pool.
+- Updates the journal.
+
+.. note:: If the users delete the files more quickly than the
+          purge queue can process then the data pool usage might increase
+          substantially over time. In extreme scenarios, the purge queue
+          backlog can become so huge that it can slacken the capacity reclaim
+          and the linux ``du`` command for CephFS might report inconsistent
+          data compared to the CephFS Data pool.
+
+There are a few tunable configs that MDS uses internally to throttle purge
+queue processing:
+
+.. confval:: filer_max_purge_ops
+.. confval:: mds_max_purge_files
+.. confval:: mds_max_purge_ops
+.. confval:: mds_max_purge_ops_per_pg
+
+Generally, the defaults are adequate for most clusters. However, in
+case of pretty huge clusters, if the need arises like ``pq_item_in_journal``
+(counter of things pending deletion) reaching gigantic figure then the configs
+can be tuned to 4-5 times of the default value as a starting point and
+further increments are subject to more requirements.
+
+Start from the most trivial config ``filer_max_purge_ops``, which should help
+reclaim the space more quickly::
+
+    $ ceph config set mds filer_max_purge_ops 40
+
+Incrementing ``filer_max_purge_ops`` should just work for most
+clusters but if it doesn't then move ahead with tuning other configs::
+
+    $ ceph config set mds mds_max_purge_files 256
+    $ ceph config set mds mds_max_purge_ops 32768
+    $ ceph config set mds mds_max_purge_ops_per_pg 2
+
+.. note:: Setting these values won't immediately break anything except
+          inasmuch as they control how many delete ops we issue to the
+          underlying RADOS cluster, but might eat up some cluster performance
+          if the values set are staggeringly high.
+
+.. note:: The purge queue is not an auto-tuning system in terms of its work
+          limits as compared to what is going on. So it is advised to make
+          a conscious decision while tuning the configs based on the cluster
+          size and workload.
+
+Examining purge queue perf counters
+===================================
+
+When analysing MDS perf dumps, the purge queue statistics look like::
+
+    "purge_queue": {
+        "pq_executing_ops": 56655,
+        "pq_executing_ops_high_water": 65350,
+        "pq_executing": 1,
+        "pq_executing_high_water": 3,
+        "pq_executed": 25,
+        "pq_item_in_journal": 6567004
+    }
+
+Let us understand what each of these means:
+
+.. list-table::
+   :widths: 50 50
+   :header-rows: 1
+
+   * - Name
+     - Description
+   * - pq_executing_ops
+     - Purge queue operations in flight
+   * - pq_executing_ops_high_water
+     - Maximum number of executing purge operations recorded
+   * - pq_executing 
+     - Purge queue files being deleted
+   * - pq_executing_high_water 
+     - Maximum number of executing file purges
+   * - pq_executed 
+     - Purge queue files deleted
+   * - pq_item_in_journal
+     - Purge items (files) left in journal
+
+.. note:: ``pq_executing`` and ``pq_executing_ops`` might look similar but
+          there is a small nuance. ``pq_executing`` tracks number of files
+          in the purge queue while ``pq_executing_ops`` is the count of RADOS
+          objects from all the files in purge queue.
-- 
cgit v1.2.3