diff options
Diffstat (limited to 'doc/cephfs')
-rw-r--r-- | doc/cephfs/disaster-recovery-experts.rst | 41 | ||||
-rw-r--r-- | doc/cephfs/health-messages.rst | 8 | ||||
-rw-r--r-- | doc/cephfs/index.rst | 2 | ||||
-rw-r--r-- | doc/cephfs/purge-queue.rst | 106 | ||||
-rw-r--r-- | doc/cephfs/snap-schedule.rst | 9 | ||||
-rw-r--r-- | doc/cephfs/snapshots.rst | 85 |
6 files changed, 232 insertions, 19 deletions
diff --git a/doc/cephfs/disaster-recovery-experts.rst b/doc/cephfs/disaster-recovery-experts.rst index 7677b42f47e..b01a3dfde6a 100644 --- a/doc/cephfs/disaster-recovery-experts.rst +++ b/doc/cephfs/disaster-recovery-experts.rst @@ -21,43 +21,46 @@ Advanced: Metadata repair tools Journal export -------------- -Before attempting dangerous operations, make a copy of the journal like so: +Before attempting any dangerous operation, make a copy of the journal by +running the following command: -:: +.. prompt:: bash # - cephfs-journal-tool journal export backup.bin + cephfs-journal-tool journal export backup.bin -Note that this command may not always work if the journal is badly corrupted, -in which case a RADOS-level copy should be made (http://tracker.ceph.com/issues/9902). +If the journal is badly corrupted, this command might not work. If the journal +is badly corrupted, make a RADOS-level copy +(http://tracker.ceph.com/issues/9902). Dentry recovery from journal ---------------------------- If a journal is damaged or for any reason an MDS is incapable of replaying it, -attempt to recover what file metadata we can like so: +attempt to recover file metadata by running the following command: -:: +.. prompt:: bash # - cephfs-journal-tool event recover_dentries summary + cephfs-journal-tool event recover_dentries summary -This command by default acts on MDS rank 0, pass --rank=<n> to operate on other ranks. +By default, this command acts on MDS rank ``0``. Pass the option ``--rank=<n>`` +to the ``cephfs-journal-tool`` command to operate on other ranks. -This command will write any inodes/dentries recoverable from the journal -into the backing store, if these inodes/dentries are higher-versioned -than the previous contents of the backing store. If any regions of the journal -are missing/damaged, they will be skipped. +This command writes all inodes and dentries recoverable from the journal into +the backing store, but only if these inodes and dentries are higher-versioned +than the existing contents of the backing store. Any regions of the journal +that are missing or damaged will be skipped. -Note that in addition to writing out dentries and inodes, this command will update -the InoTables of each 'in' MDS rank, to indicate that any written inodes' numbers -are now in use. In simple cases, this will result in an entirely valid backing +In addition to writing out dentries and inodes, this command updates the +InoTables of each ``in`` MDS rank, to indicate that any written inodes' numbers +are now in use. In simple cases, this will result in an entirely valid backing store state. .. warning:: - The resulting state of the backing store is not guaranteed to be self-consistent, - and an online MDS scrub will be required afterwards. The journal contents - will not be modified by this command, you should truncate the journal + The resulting state of the backing store is not guaranteed to be + self-consistent, and an online MDS scrub will be required afterwards. The + journal contents will not be modified by this command. Truncate the journal separately after recovering what you can. Journal truncation diff --git a/doc/cephfs/health-messages.rst b/doc/cephfs/health-messages.rst index 0f171c6ccc9..7aa1f2e44ee 100644 --- a/doc/cephfs/health-messages.rst +++ b/doc/cephfs/health-messages.rst @@ -269,3 +269,11 @@ other daemons, please see :ref:`health-checks`. To evict and permanently block broken clients from connecting to the cluster, set the ``required_client_feature`` bit ``client_mds_auth_caps``. + +``MDS_ESTIMATED_REPLAY_TIME`` +----------------------------- + Message + "HEALTH_WARN Replay: x% complete. Estimated time remaining *x* seconds + + Description + When an MDS journal replay takes more than 30 seconds, this message indicates the estimated time to completion. diff --git a/doc/cephfs/index.rst b/doc/cephfs/index.rst index 57ea336c00b..630d29f1956 100644 --- a/doc/cephfs/index.rst +++ b/doc/cephfs/index.rst @@ -93,6 +93,7 @@ Administration CephFS Top Utility <cephfs-top> Scheduled Snapshots <snap-schedule> CephFS Snapshot Mirroring <cephfs-mirroring> + Purge Queue <purge-queue> .. raw:: html @@ -147,6 +148,7 @@ CephFS Concepts LazyIO <lazyio> Directory fragmentation <dirfrags> Multiple active MDS daemons <multimds> + Snapshots <snapshots> .. raw:: html diff --git a/doc/cephfs/purge-queue.rst b/doc/cephfs/purge-queue.rst new file mode 100644 index 00000000000..d7a68e7fa55 --- /dev/null +++ b/doc/cephfs/purge-queue.rst @@ -0,0 +1,106 @@ +============ +Purge Queue +============ + +MDS maintains a data structure known as **Purge Queue** which is responsible +for managing and executing the parallel deletion of files. +There is a purge queue for every MDS rank. Purge queues consist of purge items +which contain nominal information from the inodes such as size and the layout +(i.e. all other un-needed metadata information is discarded making it +independent of all metadata structures). + +Deletion process +================ + +When a client requests deletion of a directory (say ``rm -rf``): + +- MDS queues the files and subdirectories (purge items) from pq (purge queue) + journal in the purge queue. +- Processes the deletion of inodes in background in small and manageable + chunks. +- MDS instructs underlying OSDs to clean up the associated objects in data + pool. +- Updates the journal. + +.. note:: If the users delete the files more quickly than the + purge queue can process then the data pool usage might increase + substantially over time. In extreme scenarios, the purge queue + backlog can become so huge that it can slacken the capacity reclaim + and the linux ``du`` command for CephFS might report inconsistent + data compared to the CephFS Data pool. + +There are a few tunable configs that MDS uses internally to throttle purge +queue processing: + +.. confval:: filer_max_purge_ops +.. confval:: mds_max_purge_files +.. confval:: mds_max_purge_ops +.. confval:: mds_max_purge_ops_per_pg + +Generally, the defaults are adequate for most clusters. However, in +case of pretty huge clusters, if the need arises like ``pq_item_in_journal`` +(counter of things pending deletion) reaching gigantic figure then the configs +can be tuned to 4-5 times of the default value as a starting point and +further increments are subject to more requirements. + +Start from the most trivial config ``filer_max_purge_ops``, which should help +reclaim the space more quickly:: + + $ ceph config set mds filer_max_purge_ops 40 + +Incrementing ``filer_max_purge_ops`` should just work for most +clusters but if it doesn't then move ahead with tuning other configs:: + + $ ceph config set mds mds_max_purge_files 256 + $ ceph config set mds mds_max_purge_ops 32768 + $ ceph config set mds mds_max_purge_ops_per_pg 2 + +.. note:: Setting these values won't immediately break anything except + inasmuch as they control how many delete ops we issue to the + underlying RADOS cluster, but might eat up some cluster performance + if the values set are staggeringly high. + +.. note:: The purge queue is not an auto-tuning system in terms of its work + limits as compared to what is going on. So it is advised to make + a conscious decision while tuning the configs based on the cluster + size and workload. + +Examining purge queue perf counters +=================================== + +When analysing MDS perf dumps, the purge queue statistics look like:: + + "purge_queue": { + "pq_executing_ops": 56655, + "pq_executing_ops_high_water": 65350, + "pq_executing": 1, + "pq_executing_high_water": 3, + "pq_executed": 25, + "pq_item_in_journal": 6567004 + } + +Let us understand what each of these means: + +.. list-table:: + :widths: 50 50 + :header-rows: 1 + + * - Name + - Description + * - pq_executing_ops + - Purge queue operations in flight + * - pq_executing_ops_high_water + - Maximum number of executing purge operations recorded + * - pq_executing + - Purge queue files being deleted + * - pq_executing_high_water + - Maximum number of executing file purges + * - pq_executed + - Purge queue files deleted + * - pq_item_in_journal + - Purge items (files) left in journal + +.. note:: ``pq_executing`` and ``pq_executing_ops`` might look similar but + there is a small nuance. ``pq_executing`` tracks number of files + in the purge queue while ``pq_executing_ops`` is the count of RADOS + objects from all the files in purge queue. diff --git a/doc/cephfs/snap-schedule.rst b/doc/cephfs/snap-schedule.rst index a94d938040f..48e79047864 100644 --- a/doc/cephfs/snap-schedule.rst +++ b/doc/cephfs/snap-schedule.rst @@ -197,6 +197,15 @@ this happens, the next snapshot will be schedule as if the previous one was not delayed, i.e. one or more delayed snapshots will not cause drift in the overall schedule. +If a volume is deleted while snapshot schedules are active on the volume, then +there might be cases when Python Tracebacks are seen in the log file or on the +command-line when commands are executed on such volumes. Although measures have +been taken to take note of the fs_map changes and delete active timers and +close database connections to avoid Python Tracebacks, it is not possible to +completely mute the tracebacks due to the inherent nature of problem. In the +event that such tracebacks are seen, the only solution to get the system to a +stable state is the disable and re-enable the snap_schedule Manager Module. + In order to somewhat limit the overall number of snapshots in a file system, the module will only keep a maximum of 50 snapshots per directory. If the retention policy results in more then 50 retained snapshots, the retention list will be diff --git a/doc/cephfs/snapshots.rst b/doc/cephfs/snapshots.rst new file mode 100644 index 00000000000..a60be96ed53 --- /dev/null +++ b/doc/cephfs/snapshots.rst @@ -0,0 +1,85 @@ +================ +CephFS Snapshots +================ + +CephFS snapshots create an immutable view of the file system at the point +in time they are taken. CephFS support snapshots which is managed in a +special hidden subdirectory named ``.snap`` .Snapshots are created using +``mkdir`` inside this directory. + +Snapshots can be exposed with a different name by changing the following client configurations. + +- ``snapdirname`` which is a mount option for kernel clients +- ``client_snapdir`` which is a mount option for ceph-fuse. + +Snapshot Creation +================== + +CephFS snapshot feature is enabled by default on new file systems. To enable +it on existing file systems, use the command below. + +.. code-block:: bash + + $ ceph fs set <fs_name> allow_new_snaps true + +When snapshots are enabled, all directories in CephFS will have a special ``.snap`` +directory. (You may configure a different name with the client snapdir setting if +you wish.) +To create a CephFS snapshot, create a subdirectory under ``.snap`` with a name of +your choice. +For example, to create a snapshot on directory ``/file1/``, invoke ``mkdir /file1/.snap/snapshot-name`` + +.. code-block:: bash + + $ touch file1 + $ cd .snap + $ mkdir my_snapshot + +Using snapshot to recover data +=============================== + +Snapshots can also be used to recover some deleted files. + +- ``create a file1 and create snapshot snap1`` + +.. code-block:: bash + + $ touch /mnt/cephfs/file1 + $ cd .snap + $ mkdir snap1 + +- ``create a file2 and create snapshot snap2`` + +.. code-block:: bash + + $ touch /mnt/cephfs/file2 + $ cd .snap + $ mkdir snap2 + +- ``delete file1 and create a new snapshot snap3`` + +.. code-block:: bash + + $ rm /mnt/cephfs/file1 + $ cd .snap + $ mkdir snap3 + +- ``recover file1 using snapshot snap2 using cp command`` + +.. code-block:: bash + + $ cd .snap + $ cd snap2 + $ cp file1 /mnt/cephfs/ + +Snapshot Deletion +================== + +Snapshots are deleted by invoking ``rmdir`` on the ``.snap`` directory they are +rooted in. (Attempts to delete a directory which roots the snapshots will fail; +you must delete the snapshots first.) + +.. code-block:: bash + + $ cd .snap + $ rmdir my_snapshot |