summaryrefslogtreecommitdiffstats
path: root/doc/cephfs
diff options
context:
space:
mode:
Diffstat (limited to 'doc/cephfs')
-rw-r--r--doc/cephfs/disaster-recovery-experts.rst41
-rw-r--r--doc/cephfs/health-messages.rst8
-rw-r--r--doc/cephfs/index.rst2
-rw-r--r--doc/cephfs/purge-queue.rst106
-rw-r--r--doc/cephfs/snap-schedule.rst9
-rw-r--r--doc/cephfs/snapshots.rst85
6 files changed, 232 insertions, 19 deletions
diff --git a/doc/cephfs/disaster-recovery-experts.rst b/doc/cephfs/disaster-recovery-experts.rst
index 7677b42f47e..b01a3dfde6a 100644
--- a/doc/cephfs/disaster-recovery-experts.rst
+++ b/doc/cephfs/disaster-recovery-experts.rst
@@ -21,43 +21,46 @@ Advanced: Metadata repair tools
Journal export
--------------
-Before attempting dangerous operations, make a copy of the journal like so:
+Before attempting any dangerous operation, make a copy of the journal by
+running the following command:
-::
+.. prompt:: bash #
- cephfs-journal-tool journal export backup.bin
+ cephfs-journal-tool journal export backup.bin
-Note that this command may not always work if the journal is badly corrupted,
-in which case a RADOS-level copy should be made (http://tracker.ceph.com/issues/9902).
+If the journal is badly corrupted, this command might not work. If the journal
+is badly corrupted, make a RADOS-level copy
+(http://tracker.ceph.com/issues/9902).
Dentry recovery from journal
----------------------------
If a journal is damaged or for any reason an MDS is incapable of replaying it,
-attempt to recover what file metadata we can like so:
+attempt to recover file metadata by running the following command:
-::
+.. prompt:: bash #
- cephfs-journal-tool event recover_dentries summary
+ cephfs-journal-tool event recover_dentries summary
-This command by default acts on MDS rank 0, pass --rank=<n> to operate on other ranks.
+By default, this command acts on MDS rank ``0``. Pass the option ``--rank=<n>``
+to the ``cephfs-journal-tool`` command to operate on other ranks.
-This command will write any inodes/dentries recoverable from the journal
-into the backing store, if these inodes/dentries are higher-versioned
-than the previous contents of the backing store. If any regions of the journal
-are missing/damaged, they will be skipped.
+This command writes all inodes and dentries recoverable from the journal into
+the backing store, but only if these inodes and dentries are higher-versioned
+than the existing contents of the backing store. Any regions of the journal
+that are missing or damaged will be skipped.
-Note that in addition to writing out dentries and inodes, this command will update
-the InoTables of each 'in' MDS rank, to indicate that any written inodes' numbers
-are now in use. In simple cases, this will result in an entirely valid backing
+In addition to writing out dentries and inodes, this command updates the
+InoTables of each ``in`` MDS rank, to indicate that any written inodes' numbers
+are now in use. In simple cases, this will result in an entirely valid backing
store state.
.. warning::
- The resulting state of the backing store is not guaranteed to be self-consistent,
- and an online MDS scrub will be required afterwards. The journal contents
- will not be modified by this command, you should truncate the journal
+ The resulting state of the backing store is not guaranteed to be
+ self-consistent, and an online MDS scrub will be required afterwards. The
+ journal contents will not be modified by this command. Truncate the journal
separately after recovering what you can.
Journal truncation
diff --git a/doc/cephfs/health-messages.rst b/doc/cephfs/health-messages.rst
index 0f171c6ccc9..7aa1f2e44ee 100644
--- a/doc/cephfs/health-messages.rst
+++ b/doc/cephfs/health-messages.rst
@@ -269,3 +269,11 @@ other daemons, please see :ref:`health-checks`.
To evict and permanently block broken clients from connecting to the
cluster, set the ``required_client_feature`` bit ``client_mds_auth_caps``.
+
+``MDS_ESTIMATED_REPLAY_TIME``
+-----------------------------
+ Message
+ "HEALTH_WARN Replay: x% complete. Estimated time remaining *x* seconds
+
+ Description
+ When an MDS journal replay takes more than 30 seconds, this message indicates the estimated time to completion.
diff --git a/doc/cephfs/index.rst b/doc/cephfs/index.rst
index 57ea336c00b..630d29f1956 100644
--- a/doc/cephfs/index.rst
+++ b/doc/cephfs/index.rst
@@ -93,6 +93,7 @@ Administration
CephFS Top Utility <cephfs-top>
Scheduled Snapshots <snap-schedule>
CephFS Snapshot Mirroring <cephfs-mirroring>
+ Purge Queue <purge-queue>
.. raw:: html
@@ -147,6 +148,7 @@ CephFS Concepts
LazyIO <lazyio>
Directory fragmentation <dirfrags>
Multiple active MDS daemons <multimds>
+ Snapshots <snapshots>
.. raw:: html
diff --git a/doc/cephfs/purge-queue.rst b/doc/cephfs/purge-queue.rst
new file mode 100644
index 00000000000..d7a68e7fa55
--- /dev/null
+++ b/doc/cephfs/purge-queue.rst
@@ -0,0 +1,106 @@
+============
+Purge Queue
+============
+
+MDS maintains a data structure known as **Purge Queue** which is responsible
+for managing and executing the parallel deletion of files.
+There is a purge queue for every MDS rank. Purge queues consist of purge items
+which contain nominal information from the inodes such as size and the layout
+(i.e. all other un-needed metadata information is discarded making it
+independent of all metadata structures).
+
+Deletion process
+================
+
+When a client requests deletion of a directory (say ``rm -rf``):
+
+- MDS queues the files and subdirectories (purge items) from pq (purge queue)
+ journal in the purge queue.
+- Processes the deletion of inodes in background in small and manageable
+ chunks.
+- MDS instructs underlying OSDs to clean up the associated objects in data
+ pool.
+- Updates the journal.
+
+.. note:: If the users delete the files more quickly than the
+ purge queue can process then the data pool usage might increase
+ substantially over time. In extreme scenarios, the purge queue
+ backlog can become so huge that it can slacken the capacity reclaim
+ and the linux ``du`` command for CephFS might report inconsistent
+ data compared to the CephFS Data pool.
+
+There are a few tunable configs that MDS uses internally to throttle purge
+queue processing:
+
+.. confval:: filer_max_purge_ops
+.. confval:: mds_max_purge_files
+.. confval:: mds_max_purge_ops
+.. confval:: mds_max_purge_ops_per_pg
+
+Generally, the defaults are adequate for most clusters. However, in
+case of pretty huge clusters, if the need arises like ``pq_item_in_journal``
+(counter of things pending deletion) reaching gigantic figure then the configs
+can be tuned to 4-5 times of the default value as a starting point and
+further increments are subject to more requirements.
+
+Start from the most trivial config ``filer_max_purge_ops``, which should help
+reclaim the space more quickly::
+
+ $ ceph config set mds filer_max_purge_ops 40
+
+Incrementing ``filer_max_purge_ops`` should just work for most
+clusters but if it doesn't then move ahead with tuning other configs::
+
+ $ ceph config set mds mds_max_purge_files 256
+ $ ceph config set mds mds_max_purge_ops 32768
+ $ ceph config set mds mds_max_purge_ops_per_pg 2
+
+.. note:: Setting these values won't immediately break anything except
+ inasmuch as they control how many delete ops we issue to the
+ underlying RADOS cluster, but might eat up some cluster performance
+ if the values set are staggeringly high.
+
+.. note:: The purge queue is not an auto-tuning system in terms of its work
+ limits as compared to what is going on. So it is advised to make
+ a conscious decision while tuning the configs based on the cluster
+ size and workload.
+
+Examining purge queue perf counters
+===================================
+
+When analysing MDS perf dumps, the purge queue statistics look like::
+
+ "purge_queue": {
+ "pq_executing_ops": 56655,
+ "pq_executing_ops_high_water": 65350,
+ "pq_executing": 1,
+ "pq_executing_high_water": 3,
+ "pq_executed": 25,
+ "pq_item_in_journal": 6567004
+ }
+
+Let us understand what each of these means:
+
+.. list-table::
+ :widths: 50 50
+ :header-rows: 1
+
+ * - Name
+ - Description
+ * - pq_executing_ops
+ - Purge queue operations in flight
+ * - pq_executing_ops_high_water
+ - Maximum number of executing purge operations recorded
+ * - pq_executing
+ - Purge queue files being deleted
+ * - pq_executing_high_water
+ - Maximum number of executing file purges
+ * - pq_executed
+ - Purge queue files deleted
+ * - pq_item_in_journal
+ - Purge items (files) left in journal
+
+.. note:: ``pq_executing`` and ``pq_executing_ops`` might look similar but
+ there is a small nuance. ``pq_executing`` tracks number of files
+ in the purge queue while ``pq_executing_ops`` is the count of RADOS
+ objects from all the files in purge queue.
diff --git a/doc/cephfs/snap-schedule.rst b/doc/cephfs/snap-schedule.rst
index a94d938040f..48e79047864 100644
--- a/doc/cephfs/snap-schedule.rst
+++ b/doc/cephfs/snap-schedule.rst
@@ -197,6 +197,15 @@ this happens, the next snapshot will be schedule as if the previous one was not
delayed, i.e. one or more delayed snapshots will not cause drift in the overall
schedule.
+If a volume is deleted while snapshot schedules are active on the volume, then
+there might be cases when Python Tracebacks are seen in the log file or on the
+command-line when commands are executed on such volumes. Although measures have
+been taken to take note of the fs_map changes and delete active timers and
+close database connections to avoid Python Tracebacks, it is not possible to
+completely mute the tracebacks due to the inherent nature of problem. In the
+event that such tracebacks are seen, the only solution to get the system to a
+stable state is the disable and re-enable the snap_schedule Manager Module.
+
In order to somewhat limit the overall number of snapshots in a file system, the
module will only keep a maximum of 50 snapshots per directory. If the retention
policy results in more then 50 retained snapshots, the retention list will be
diff --git a/doc/cephfs/snapshots.rst b/doc/cephfs/snapshots.rst
new file mode 100644
index 00000000000..a60be96ed53
--- /dev/null
+++ b/doc/cephfs/snapshots.rst
@@ -0,0 +1,85 @@
+================
+CephFS Snapshots
+================
+
+CephFS snapshots create an immutable view of the file system at the point
+in time they are taken. CephFS support snapshots which is managed in a
+special hidden subdirectory named ``.snap`` .Snapshots are created using
+``mkdir`` inside this directory.
+
+Snapshots can be exposed with a different name by changing the following client configurations.
+
+- ``snapdirname`` which is a mount option for kernel clients
+- ``client_snapdir`` which is a mount option for ceph-fuse.
+
+Snapshot Creation
+==================
+
+CephFS snapshot feature is enabled by default on new file systems. To enable
+it on existing file systems, use the command below.
+
+.. code-block:: bash
+
+ $ ceph fs set <fs_name> allow_new_snaps true
+
+When snapshots are enabled, all directories in CephFS will have a special ``.snap``
+directory. (You may configure a different name with the client snapdir setting if
+you wish.)
+To create a CephFS snapshot, create a subdirectory under ``.snap`` with a name of
+your choice.
+For example, to create a snapshot on directory ``/file1/``, invoke ``mkdir /file1/.snap/snapshot-name``
+
+.. code-block:: bash
+
+ $ touch file1
+ $ cd .snap
+ $ mkdir my_snapshot
+
+Using snapshot to recover data
+===============================
+
+Snapshots can also be used to recover some deleted files.
+
+- ``create a file1 and create snapshot snap1``
+
+.. code-block:: bash
+
+ $ touch /mnt/cephfs/file1
+ $ cd .snap
+ $ mkdir snap1
+
+- ``create a file2 and create snapshot snap2``
+
+.. code-block:: bash
+
+ $ touch /mnt/cephfs/file2
+ $ cd .snap
+ $ mkdir snap2
+
+- ``delete file1 and create a new snapshot snap3``
+
+.. code-block:: bash
+
+ $ rm /mnt/cephfs/file1
+ $ cd .snap
+ $ mkdir snap3
+
+- ``recover file1 using snapshot snap2 using cp command``
+
+.. code-block:: bash
+
+ $ cd .snap
+ $ cd snap2
+ $ cp file1 /mnt/cephfs/
+
+Snapshot Deletion
+==================
+
+Snapshots are deleted by invoking ``rmdir`` on the ``.snap`` directory they are
+rooted in. (Attempts to delete a directory which roots the snapshots will fail;
+you must delete the snapshots first.)
+
+.. code-block:: bash
+
+ $ cd .snap
+ $ rmdir my_snapshot