6 files changed, 232 insertions, 19 deletions
diff --git a/doc/cephfs/disaster-recovery-experts.rst b/doc/cephfs/disaster-recovery-experts.rst
index 7677b42f47e..b01a3dfde6a 100644
--- a/doc/cephfs/disaster-recovery-experts.rst
+++ b/doc/cephfs/disaster-recovery-experts.rst
@@ -21,43 +21,46 @@ Advanced: Metadata repair tools
 Journal export
 --------------
 
-Before attempting dangerous operations, make a copy of the journal like so:
+Before attempting any dangerous operation, make a copy of the journal by
+running the following command:
 
-::
+.. prompt:: bash #
 
-    cephfs-journal-tool journal export backup.bin
+   cephfs-journal-tool journal export backup.bin
 
-Note that this command may not always work if the journal is badly corrupted,
-in which case a RADOS-level copy should be made (http://tracker.ceph.com/issues/9902).
+If the journal is badly corrupted, this command might not work. If the journal
+is badly corrupted, make a RADOS-level copy
+(http://tracker.ceph.com/issues/9902).
 
 
 Dentry recovery from journal
 ----------------------------
 
 If a journal is damaged or for any reason an MDS is incapable of replaying it,
-attempt to recover what file metadata we can like so:
+attempt to recover file metadata by running the following command:
 
-::
+.. prompt:: bash #
 
-    cephfs-journal-tool event recover_dentries summary
+   cephfs-journal-tool event recover_dentries summary
 
-This command by default acts on MDS rank 0, pass --rank=<n> to operate on other ranks.
+By default, this command acts on MDS rank ``0``. Pass the option ``--rank=<n>``
+to the ``cephfs-journal-tool`` command to operate on other ranks.
 
-This command will write any inodes/dentries recoverable from the journal
-into the backing store, if these inodes/dentries are higher-versioned
-than the previous contents of the backing store.  If any regions of the journal
-are missing/damaged, they will be skipped.
+This command writes all inodes and dentries recoverable from the journal into
+the backing store, but only if these inodes and dentries are higher-versioned
+than the existing contents of the backing store. Any regions of the journal
+that are missing or damaged will be skipped.
 
-Note that in addition to writing out dentries and inodes, this command will update
-the InoTables of each 'in' MDS rank, to indicate that any written inodes' numbers
-are now in use.  In simple cases, this will result in an entirely valid backing
+In addition to writing out dentries and inodes, this command updates the
+InoTables of each ``in`` MDS rank, to indicate that any written inodes' numbers
+are now in use. In simple cases, this will result in an entirely valid backing
 store state.
 
 .. warning::
 
-    The resulting state of the backing store is not guaranteed to be self-consistent,
-    and an online MDS scrub will be required afterwards.  The journal contents
-    will not be modified by this command, you should truncate the journal
+    The resulting state of the backing store is not guaranteed to be
+    self-consistent, and an online MDS scrub will be required afterwards. The
+    journal contents will not be modified by this command. Truncate the journal
     separately after recovering what you can.
 
 Journal truncation
diff --git a/doc/cephfs/health-messages.rst b/doc/cephfs/health-messages.rst
index 0f171c6ccc9..7aa1f2e44ee 100644
--- a/doc/cephfs/health-messages.rst
+++ b/doc/cephfs/health-messages.rst
@@ -269,3 +269,11 @@ other daemons, please see :ref:`health-checks`.
 
     To evict and permanently block broken clients from connecting to the
     cluster, set the ``required_client_feature`` bit ``client_mds_auth_caps``.
+
+``MDS_ESTIMATED_REPLAY_TIME``
+-----------------------------
+  Message
+    "HEALTH_WARN Replay: x% complete. Estimated time remaining *x* seconds
+
+  Description
+    When an MDS journal replay takes more than 30 seconds, this message indicates the estimated time to completion.
diff --git a/doc/cephfs/index.rst b/doc/cephfs/index.rst
index 57ea336c00b..630d29f1956 100644
--- a/doc/cephfs/index.rst
+++ b/doc/cephfs/index.rst
@@ -93,6 +93,7 @@ Administration
     CephFS Top Utility <cephfs-top>
     Scheduled Snapshots <snap-schedule>
     CephFS Snapshot Mirroring <cephfs-mirroring>
+    Purge Queue <purge-queue>
 
 .. raw:: html
 
@@ -147,6 +148,7 @@ CephFS Concepts
     LazyIO <lazyio>
     Directory fragmentation <dirfrags>
     Multiple active MDS daemons <multimds>
+    Snapshots <snapshots>
 
 
 .. raw:: html
diff --git a/doc/cephfs/purge-queue.rst b/doc/cephfs/purge-queue.rst
new file mode 100644
index 00000000000..d7a68e7fa55
--- /dev/null
+++ b/doc/cephfs/purge-queue.rst
@@ -0,0 +1,106 @@
+============
+Purge Queue
+============
+
+MDS maintains a data structure known as **Purge Queue** which is responsible
+for managing and executing the parallel deletion of files.
+There is a purge queue for every MDS rank. Purge queues consist of purge items
+which contain nominal information from the inodes such as size and the layout
+(i.e. all other un-needed metadata information is discarded making it
+independent of all metadata structures).
+
+Deletion process
+================
+
+When a client requests deletion of a directory (say ``rm -rf``):
+
+- MDS queues the files and subdirectories (purge items) from pq (purge queue)
+  journal in the purge queue.
+- Processes the deletion of inodes in background in small and manageable
+  chunks.
+- MDS instructs underlying OSDs to clean up the associated objects in data
+  pool.
+- Updates the journal.
+
+.. note:: If the users delete the files more quickly than the
+          purge queue can process then the data pool usage might increase
+          substantially over time. In extreme scenarios, the purge queue
+          backlog can become so huge that it can slacken the capacity reclaim
+          and the linux ``du`` command for CephFS might report inconsistent
+          data compared to the CephFS Data pool.
+
+There are a few tunable configs that MDS uses internally to throttle purge
+queue processing:
+
+.. confval:: filer_max_purge_ops
+.. confval:: mds_max_purge_files
+.. confval:: mds_max_purge_ops
+.. confval:: mds_max_purge_ops_per_pg
+
+Generally, the defaults are adequate for most clusters. However, in
+case of pretty huge clusters, if the need arises like ``pq_item_in_journal``
+(counter of things pending deletion) reaching gigantic figure then the configs
+can be tuned to 4-5 times of the default value as a starting point and
+further increments are subject to more requirements.
+
+Start from the most trivial config ``filer_max_purge_ops``, which should help
+reclaim the space more quickly::
+
+    $ ceph config set mds filer_max_purge_ops 40
+
+Incrementing ``filer_max_purge_ops`` should just work for most
+clusters but if it doesn't then move ahead with tuning other configs::
+
+    $ ceph config set mds mds_max_purge_files 256
+    $ ceph config set mds mds_max_purge_ops 32768
+    $ ceph config set mds mds_max_purge_ops_per_pg 2
+
+.. note:: Setting these values won't immediately break anything except
+          inasmuch as they control how many delete ops we issue to the
+          underlying RADOS cluster, but might eat up some cluster performance
+          if the values set are staggeringly high.
+
+.. note:: The purge queue is not an auto-tuning system in terms of its work
+          limits as compared to what is going on. So it is advised to make
+          a conscious decision while tuning the configs based on the cluster
+          size and workload.
+
+Examining purge queue perf counters
+===================================
+
+When analysing MDS perf dumps, the purge queue statistics look like::
+
+    "purge_queue": {
+        "pq_executing_ops": 56655,
+        "pq_executing_ops_high_water": 65350,
+        "pq_executing": 1,
+        "pq_executing_high_water": 3,
+        "pq_executed": 25,
+        "pq_item_in_journal": 6567004
+    }
+
+Let us understand what each of these means:
+
+.. list-table::
+   :widths: 50 50
+   :header-rows: 1
+
+   * - Name
+     - Description
+   * - pq_executing_ops
+     - Purge queue operations in flight
+   * - pq_executing_ops_high_water
+     - Maximum number of executing purge operations recorded
+   * - pq_executing 
+     - Purge queue files being deleted
+   * - pq_executing_high_water 
+     - Maximum number of executing file purges
+   * - pq_executed 
+     - Purge queue files deleted
+   * - pq_item_in_journal
+     - Purge items (files) left in journal
+
+.. note:: ``pq_executing`` and ``pq_executing_ops`` might look similar but
+          there is a small nuance. ``pq_executing`` tracks number of files
+          in the purge queue while ``pq_executing_ops`` is the count of RADOS
+          objects from all the files in purge queue.
diff --git a/doc/cephfs/snap-schedule.rst b/doc/cephfs/snap-schedule.rst
index a94d938040f..48e79047864 100644
--- a/doc/cephfs/snap-schedule.rst
+++ b/doc/cephfs/snap-schedule.rst
@@ -197,6 +197,15 @@ this happens, the next snapshot will be schedule as if the previous one was not
 delayed, i.e. one or more delayed snapshots will not cause drift in the overall
 schedule.
 
+If a volume is deleted while snapshot schedules are active on the volume, then
+there might be cases when Python Tracebacks are seen in the log file or on the
+command-line when commands are executed on such volumes. Although measures have
+been taken to take note of the fs_map changes and delete active timers and
+close database connections to avoid Python Tracebacks, it is not possible to
+completely mute the tracebacks due to the inherent nature of problem. In the
+event that such tracebacks are seen, the only solution to get the system to a
+stable state is the disable and re-enable the snap_schedule Manager Module.
+
 In order to somewhat limit the overall number of snapshots in a file system, the
 module will only keep a maximum of 50 snapshots per directory. If the retention
 policy results in more then 50 retained snapshots, the retention list will be
diff --git a/doc/cephfs/snapshots.rst b/doc/cephfs/snapshots.rst
new file mode 100644
index 00000000000..a60be96ed53
--- /dev/null
+++ b/doc/cephfs/snapshots.rst
@@ -0,0 +1,85 @@
+================
+CephFS Snapshots
+================
+
+CephFS snapshots create an immutable view of the file system at the point
+in time they are taken. CephFS support snapshots which is managed in a 
+special hidden subdirectory named ``.snap`` .Snapshots are created using
+``mkdir`` inside this directory.
+
+Snapshots can be exposed with a different name by changing the following client configurations.
+
+- ``snapdirname`` which is a mount option for kernel clients
+- ``client_snapdir`` which is a mount option for ceph-fuse.
+
+Snapshot Creation
+==================
+
+CephFS snapshot feature is enabled by default on new file systems. To enable 
+it on existing file systems, use the command below.
+
+.. code-block:: bash
+    
+    $ ceph fs set <fs_name> allow_new_snaps true
+
+When snapshots are enabled, all directories in CephFS will have a special ``.snap``
+directory. (You may configure a different name with the client snapdir setting if 
+you wish.)
+To create a CephFS snapshot, create a subdirectory under ``.snap`` with a name of 
+your choice. 
+For example, to create a snapshot on directory ``/file1/``, invoke ``mkdir /file1/.snap/snapshot-name``
+
+.. code-block:: bash
+
+    $ touch file1
+    $ cd .snap
+    $ mkdir my_snapshot
+
+Using snapshot to recover data
+===============================
+
+Snapshots can also be used to recover some deleted files.
+
+- ``create a file1 and create snapshot snap1``
+
+.. code-block:: bash
+
+    $ touch /mnt/cephfs/file1
+    $ cd .snap
+    $ mkdir snap1
+
+- ``create a file2 and create snapshot snap2``
+
+.. code-block:: bash
+
+    $ touch /mnt/cephfs/file2
+    $ cd .snap
+    $ mkdir snap2
+
+- ``delete file1 and create a new snapshot snap3``
+
+.. code-block:: bash
+
+    $ rm /mnt/cephfs/file1
+    $ cd .snap
+    $ mkdir snap3
+
+- ``recover file1 using snapshot snap2 using cp command``
+
+.. code-block:: bash
+
+    $ cd .snap
+    $ cd snap2
+    $ cp file1 /mnt/cephfs/
+
+Snapshot Deletion
+==================
+
+Snapshots are deleted by invoking ``rmdir`` on the ``.snap`` directory they are
+rooted in. (Attempts to delete a directory which roots the snapshots will fail; 
+you must delete the snapshots first.)
+
+.. code-block:: bash
+
+    $ cd .snap
+    $ rmdir my_snapshot