summaryrefslogtreecommitdiffstats
path: root/Documentation/filesystems
diff options
context:
space:
mode:
Diffstat (limited to 'Documentation/filesystems')
-rw-r--r--Documentation/filesystems/caching/backend-api.rst2
-rw-r--r--Documentation/filesystems/ext2.rst2
-rw-r--r--Documentation/filesystems/ext4/blockmap.rst2
-rw-r--r--Documentation/filesystems/ext4/super.rst6
-rw-r--r--Documentation/filesystems/f2fs.rst23
-rw-r--r--Documentation/filesystems/fscrypt.rst22
-rw-r--r--Documentation/filesystems/fsverity.rst53
-rw-r--r--Documentation/filesystems/fuse.rst29
-rw-r--r--Documentation/filesystems/idmappings.rst2
-rw-r--r--Documentation/filesystems/locking.rst9
-rw-r--r--Documentation/filesystems/netfs_library.rst8
-rw-r--r--Documentation/filesystems/overlayfs.rst2
-rw-r--r--Documentation/filesystems/porting.rst19
-rw-r--r--Documentation/filesystems/proc.rst11
-rw-r--r--Documentation/filesystems/qnx6.rst2
-rw-r--r--Documentation/filesystems/spufs/spufs.rst2
-rw-r--r--Documentation/filesystems/vfs.rst65
-rw-r--r--Documentation/filesystems/xfs-delayed-logging-design.rst371
18 files changed, 495 insertions, 135 deletions
diff --git a/Documentation/filesystems/caching/backend-api.rst b/Documentation/filesystems/caching/backend-api.rst
index d7507becf674..3a199fc50828 100644
--- a/Documentation/filesystems/caching/backend-api.rst
+++ b/Documentation/filesystems/caching/backend-api.rst
@@ -122,7 +122,7 @@ volumes, calling::
to tell fscache that a volume has been withdrawn. This waits for all
outstanding accesses on the volume to complete before returning.
-When the the cache is completely withdrawn, fscache should be notified by
+When the cache is completely withdrawn, fscache should be notified by
calling::
void fscache_relinquish_cache(struct fscache_cache *cache);
diff --git a/Documentation/filesystems/ext2.rst b/Documentation/filesystems/ext2.rst
index 154101cf0e4f..92aae683e16a 100644
--- a/Documentation/filesystems/ext2.rst
+++ b/Documentation/filesystems/ext2.rst
@@ -59,8 +59,6 @@ acl Enable POSIX Access Control Lists support
(requires CONFIG_EXT2_FS_POSIX_ACL).
noacl Don't support POSIX ACLs.
-nobh Do not attach buffer_heads to file pagecache.
-
quota, usrquota Enable user disk quota support
(requires CONFIG_QUOTA).
diff --git a/Documentation/filesystems/ext4/blockmap.rst b/Documentation/filesystems/ext4/blockmap.rst
index 2bd990402a5c..cc596541ce79 100644
--- a/Documentation/filesystems/ext4/blockmap.rst
+++ b/Documentation/filesystems/ext4/blockmap.rst
@@ -1,7 +1,7 @@
.. SPDX-License-Identifier: GPL-2.0
+---------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
-| i.i_block Offset | Where It Points |
+| i.i_block Offset | Where It Points |
+=====================+==============================================================================================================================================================================================================================+
| 0 to 11 | Direct map to file blocks 0 to 11. |
+---------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
diff --git a/Documentation/filesystems/ext4/super.rst b/Documentation/filesystems/ext4/super.rst
index 268888522e35..0152888cac29 100644
--- a/Documentation/filesystems/ext4/super.rst
+++ b/Documentation/filesystems/ext4/super.rst
@@ -456,15 +456,15 @@ The ext4 superblock is laid out as follows in
* - 0x277
- __u8
- s_lastcheck_hi
- - Upper 8 bits of the s_lastcheck_hi field.
+ - Upper 8 bits of the s_lastcheck field.
* - 0x278
- __u8
- s_first_error_time_hi
- - Upper 8 bits of the s_first_error_time_hi field.
+ - Upper 8 bits of the s_first_error_time field.
* - 0x279
- __u8
- s_last_error_time_hi
- - Upper 8 bits of the s_last_error_time_hi field.
+ - Upper 8 bits of the s_last_error_time field.
* - 0x27A
- __u8
- s_pad[2]
diff --git a/Documentation/filesystems/f2fs.rst b/Documentation/filesystems/f2fs.rst
index ad8dc8c040a2..17df9a02ccff 100644
--- a/Documentation/filesystems/f2fs.rst
+++ b/Documentation/filesystems/f2fs.rst
@@ -286,9 +286,8 @@ compress_algorithm=%s:%d Control compress algorithm and its compress level, now,
algorithm level range
lz4 3 - 16
zstd 1 - 22
-compress_log_size=%u Support configuring compress cluster size, the size will
- be 4KB * (1 << %u), 16KB is minimum size, also it's
- default size.
+compress_log_size=%u Support configuring compress cluster size. The size will
+ be 4KB * (1 << %u). The default and minimum sizes are 16KB.
compress_extension=%s Support adding specified extension, so that f2fs can enable
compression on those corresponding files, e.g. if all files
with '.ext' has high compression rate, we can set the '.ext'
@@ -336,6 +335,11 @@ discard_unit=%s Control discard unit, the argument can be "block", "segment"
default, it is helpful for large sized SMR or ZNS devices to
reduce memory cost by getting rid of fs metadata supports small
discard.
+memory=%s Control memory mode. This supports "normal" and "low" modes.
+ "low" mode is introduced to support low memory devices.
+ Because of the nature of low memory devices, in this mode, f2fs
+ will try to save memory sometimes by sacrificing performance.
+ "normal" mode is the default mode and same as before.
======================== ============================================================
Debugfs Entries
@@ -818,10 +822,11 @@ Compression implementation
Instead, the main goal is to reduce data writes to flash disk as much as
possible, resulting in extending disk life time as well as relaxing IO
congestion. Alternatively, we've added ioctl(F2FS_IOC_RELEASE_COMPRESS_BLOCKS)
- interface to reclaim compressed space and show it to user after putting the
- immutable bit. Immutable bit, after release, it doesn't allow writing/mmaping
- on the file, until reserving compressed space via
- ioctl(F2FS_IOC_RESERVE_COMPRESS_BLOCKS) or truncating filesize to zero.
+ interface to reclaim compressed space and show it to user after setting a
+ special flag to the inode. Once the compressed space is released, the flag
+ will block writing data to the file until either the compressed space is
+ reserved via ioctl(F2FS_IOC_RESERVE_COMPRESS_BLOCKS) or the file size is
+ truncated to zero.
Compress metadata layout::
@@ -830,12 +835,12 @@ Compress metadata layout::
| cluster 1 | cluster 2 | ......... | cluster N |
+-----------------------------------------------+
. . . .
- . . . .
+ . . . .
. Compressed Cluster . . Normal Cluster .
+----------+---------+---------+---------+ +---------+---------+---------+---------+
|compr flag| block 1 | block 2 | block 3 | | block 1 | block 2 | block 3 | block 4 |
+----------+---------+---------+---------+ +---------+---------+---------+---------+
- . .
+ . .
. .
. .
+-------------+-------------+----------+----------------------------+
diff --git a/Documentation/filesystems/fscrypt.rst b/Documentation/filesystems/fscrypt.rst
index 2e9aaa295125..5ba5817c17c2 100644
--- a/Documentation/filesystems/fscrypt.rst
+++ b/Documentation/filesystems/fscrypt.rst
@@ -337,6 +337,7 @@ Currently, the following pairs of encryption modes are supported:
- AES-256-XTS for contents and AES-256-CTS-CBC for filenames
- AES-128-CBC for contents and AES-128-CTS-CBC for filenames
- Adiantum for both contents and filenames
+- AES-256-XTS for contents and AES-256-HCTR2 for filenames (v2 policies only)
If unsure, you should use the (AES-256-XTS, AES-256-CTS-CBC) pair.
@@ -357,6 +358,17 @@ To use Adiantum, CONFIG_CRYPTO_ADIANTUM must be enabled. Also, fast
implementations of ChaCha and NHPoly1305 should be enabled, e.g.
CONFIG_CRYPTO_CHACHA20_NEON and CONFIG_CRYPTO_NHPOLY1305_NEON for ARM.
+AES-256-HCTR2 is another true wide-block encryption mode that is intended for
+use on CPUs with dedicated crypto instructions. AES-256-HCTR2 has the property
+that a bitflip in the plaintext changes the entire ciphertext. This property
+makes it desirable for filename encryption since initialization vectors are
+reused within a directory. For more details on AES-256-HCTR2, see the paper
+"Length-preserving encryption with HCTR2"
+(https://eprint.iacr.org/2021/1441.pdf). To use AES-256-HCTR2,
+CONFIG_CRYPTO_HCTR2 must be enabled. Also, fast implementations of XCTR and
+POLYVAL should be enabled, e.g. CRYPTO_POLYVAL_ARM64_CE and
+CRYPTO_AES_ARM64_CE_BLK for ARM64.
+
New encryption modes can be added relatively easily, without changes
to individual filesystems. However, authenticated encryption (AE)
modes are not currently supported because of the difficulty of dealing
@@ -404,11 +416,11 @@ alternatively has the file's nonce (for `DIRECT_KEY policies`_) or
inode number (for `IV_INO_LBLK_64 policies`_) included in the IVs.
Thus, IV reuse is limited to within a single directory.
-With CTS-CBC, the IV reuse means that when the plaintext filenames
-share a common prefix at least as long as the cipher block size (16
-bytes for AES), the corresponding encrypted filenames will also share
-a common prefix. This is undesirable. Adiantum does not have this
-weakness, as it is a wide-block encryption mode.
+With CTS-CBC, the IV reuse means that when the plaintext filenames share a
+common prefix at least as long as the cipher block size (16 bytes for AES), the
+corresponding encrypted filenames will also share a common prefix. This is
+undesirable. Adiantum and HCTR2 do not have this weakness, as they are
+wide-block encryption modes.
All supported filenames encryption modes accept any plaintext length
>= 16 bytes; cipher block alignment is not required. However,
diff --git a/Documentation/filesystems/fsverity.rst b/Documentation/filesystems/fsverity.rst
index 756f2c215ba1..cb8e7573882a 100644
--- a/Documentation/filesystems/fsverity.rst
+++ b/Documentation/filesystems/fsverity.rst
@@ -11,9 +11,9 @@ Introduction
fs-verity (``fs/verity/``) is a support layer that filesystems can
hook into to support transparent integrity and authenticity protection
-of read-only files. Currently, it is supported by the ext4 and f2fs
-filesystems. Like fscrypt, not too much filesystem-specific code is
-needed to support fs-verity.
+of read-only files. Currently, it is supported by the ext4, f2fs, and
+btrfs filesystems. Like fscrypt, not too much filesystem-specific
+code is needed to support fs-verity.
fs-verity is similar to `dm-verity
<https://www.kernel.org/doc/Documentation/device-mapper/verity.txt>`_
@@ -473,9 +473,9 @@ files being swapped around.
Filesystem support
==================
-fs-verity is currently supported by the ext4 and f2fs filesystems.
-The CONFIG_FS_VERITY kconfig option must be enabled to use fs-verity
-on either filesystem.
+fs-verity is supported by several filesystems, described below. The
+CONFIG_FS_VERITY kconfig option must be enabled to use fs-verity on
+any of these filesystems.
``include/linux/fsverity.h`` declares the interface between the
``fs/verity/`` support layer and filesystems. Briefly, filesystems
@@ -544,6 +544,13 @@ Currently, f2fs verity only supports a Merkle tree block size of 4096.
Also, f2fs doesn't support enabling verity on files that currently
have atomic or volatile writes pending.
+btrfs
+-----
+
+btrfs supports fs-verity since Linux v5.15. Verity-enabled inodes are
+marked with a RO_COMPAT inode flag, and the verity metadata is stored
+in separate btree items.
+
Implementation details
======================
@@ -622,14 +629,14 @@ workqueue, and then the workqueue work does the decryption or
verification. Finally, pages where no decryption or verity error
occurred are marked Uptodate, and the pages are unlocked.
-Files on ext4 and f2fs may contain holes. Normally, ``->readahead()``
-simply zeroes holes and sets the corresponding pages Uptodate; no bios
-are issued. To prevent this case from bypassing fs-verity, these
-filesystems use fsverity_verify_page() to verify hole pages.
+On many filesystems, files can contain holes. Normally,
+``->readahead()`` simply zeroes holes and sets the corresponding pages
+Uptodate; no bios are issued. To prevent this case from bypassing
+fs-verity, these filesystems use fsverity_verify_page() to verify hole
+pages.
-ext4 and f2fs disable direct I/O on verity files, since otherwise
-direct I/O would bypass fs-verity. (They also do the same for
-encrypted files.)
+Filesystems also disable direct I/O on verity files, since otherwise
+direct I/O would bypass fs-verity.
Userspace utility
=================
@@ -648,7 +655,7 @@ Tests
To test fs-verity, use xfstests. For example, using `kvm-xfstests
<https://github.com/tytso/xfstests-bld/blob/master/Documentation/kvm-quickstart.md>`_::
- kvm-xfstests -c ext4,f2fs -g verity
+ kvm-xfstests -c ext4,f2fs,btrfs -g verity
FAQ
===
@@ -771,15 +778,15 @@ weren't already directly answered in other parts of this document.
e.g. magically trigger construction of a Merkle tree.
:Q: Does fs-verity support remote filesystems?
-:A: Only ext4 and f2fs support is implemented currently, but in
- principle any filesystem that can store per-file verity metadata
- can support fs-verity, regardless of whether it's local or remote.
- Some filesystems may have fewer options of where to store the
- verity metadata; one possibility is to store it past the end of
- the file and "hide" it from userspace by manipulating i_size. The
- data verification functions provided by ``fs/verity/`` also assume
- that the filesystem uses the Linux pagecache, but both local and
- remote filesystems normally do so.
+:A: So far all filesystems that have implemented fs-verity support are
+ local filesystems, but in principle any filesystem that can store
+ per-file verity metadata can support fs-verity, regardless of
+ whether it's local or remote. Some filesystems may have fewer
+ options of where to store the verity metadata; one possibility is
+ to store it past the end of the file and "hide" it from userspace
+ by manipulating i_size. The data verification functions provided
+ by ``fs/verity/`` also assume that the filesystem uses the Linux
+ pagecache, but both local and remote filesystems normally do so.
:Q: Why is anything filesystem-specific at all? Shouldn't fs-verity
be implemented entirely at the VFS level?
diff --git a/Documentation/filesystems/fuse.rst b/Documentation/filesystems/fuse.rst
index 8120c3c0cb4e..1e31e87aee68 100644
--- a/Documentation/filesystems/fuse.rst
+++ b/Documentation/filesystems/fuse.rst
@@ -279,7 +279,7 @@ How are requirements fulfilled?
the filesystem or not.
Note that the *ptrace* check is not strictly necessary to
- prevent B/2/i, it is enough to check if mount owner has enough
+ prevent C/2/i, it is enough to check if mount owner has enough
privilege to send signal to the process accessing the
filesystem, since *SIGSTOP* can be used to get a similar effect.
@@ -288,10 +288,29 @@ I think these limitations are unacceptable?
If a sysadmin trusts the users enough, or can ensure through other
measures, that system processes will never enter non-privileged
-mounts, it can relax the last limitation with a 'user_allow_other'
-config option. If this config option is set, the mounting user can
-add the 'allow_other' mount option which disables the check for other
-users' processes.
+mounts, it can relax the last limitation in several ways:
+
+ - With the 'user_allow_other' config option. If this config option is
+ set, the mounting user can add the 'allow_other' mount option which
+ disables the check for other users' processes.
+
+ User namespaces have an unintuitive interaction with 'allow_other':
+ an unprivileged user - normally restricted from mounting with
+ 'allow_other' - could do so in a user namespace where they're
+ privileged. If any process could access such an 'allow_other' mount
+ this would give the mounting user the ability to manipulate
+ processes in user namespaces where they're unprivileged. For this
+ reason 'allow_other' restricts access to users in the same userns
+ or a descendant.
+
+ - With the 'allow_sys_admin_access' module option. If this option is
+ set, super user's processes have unrestricted access to mounts
+ irrespective of allow_other setting or user namespace of the
+ mounting user.
+
+Note that both of these relaxations expose the system to potential
+information leak or *DoS* as described in points B and C/2/i-ii in the
+preceding section.
Kernel - userspace interface
============================
diff --git a/Documentation/filesystems/idmappings.rst b/Documentation/filesystems/idmappings.rst
index c1db8748389c..b9b31066aef2 100644
--- a/Documentation/filesystems/idmappings.rst
+++ b/Documentation/filesystems/idmappings.rst
@@ -661,7 +661,7 @@ idmappings::
mount idmapping: u0:k10000:r10000
Assume a file owned by ``u1000`` is read from disk. The filesystem maps this id
-to ``k21000`` according to it's idmapping. This is what is stored in the
+to ``k21000`` according to its idmapping. This is what is stored in the
inode's ``i_uid`` and ``i_gid`` fields.
When the caller queries the ownership of this file via ``stat()`` the kernel
diff --git a/Documentation/filesystems/locking.rst b/Documentation/filesystems/locking.rst
index c0fe711f14d3..4bb2627026ec 100644
--- a/Documentation/filesystems/locking.rst
+++ b/Documentation/filesystems/locking.rst
@@ -252,9 +252,8 @@ prototypes::
bool (*release_folio)(struct folio *, gfp_t);
void (*free_folio)(struct folio *);
int (*direct_IO)(struct kiocb *, struct iov_iter *iter);
- bool (*isolate_page) (struct page *, isolate_mode_t);
- int (*migratepage)(struct address_space *, struct page *, struct page *);
- void (*putback_page) (struct page *);
+ int (*migrate_folio)(struct address_space *, struct folio *dst,
+ struct folio *src, enum migrate_mode);
int (*launder_folio)(struct folio *);
bool (*is_partially_uptodate)(struct folio *, size_t from, size_t count);
int (*error_remove_page)(struct address_space *, struct page *);
@@ -280,9 +279,7 @@ invalidate_folio: yes exclusive
release_folio: yes
free_folio: yes
direct_IO:
-isolate_page: yes
-migratepage: yes (both)
-putback_page: yes
+migrate_folio: yes (both)
launder_folio: yes
is_partially_uptodate: yes
error_remove_page: yes
diff --git a/Documentation/filesystems/netfs_library.rst b/Documentation/filesystems/netfs_library.rst
index 4d19b19bcc08..73a4176144b3 100644
--- a/Documentation/filesystems/netfs_library.rst
+++ b/Documentation/filesystems/netfs_library.rst
@@ -301,7 +301,7 @@ through which it can issue requests and negotiate::
void (*issue_read)(struct netfs_io_subrequest *subreq);
bool (*is_still_valid)(struct netfs_io_request *rreq);
int (*check_write_begin)(struct file *file, loff_t pos, unsigned len,
- struct folio *folio, void **_fsdata);
+ struct folio **foliop, void **_fsdata);
void (*done)(struct netfs_io_request *rreq);
};
@@ -381,8 +381,10 @@ The operations are as follows:
allocated/grabbed the folio to be modified to allow the filesystem to flush
conflicting state before allowing it to be modified.
- It should return 0 if everything is now fine, -EAGAIN if the folio should be
- regrabbed and any other error code to abort the operation.
+ It may unlock and discard the folio it was given and set the caller's folio
+ pointer to NULL. It should return 0 if everything is now fine (``*foliop``
+ left set) or the op should be retried (``*foliop`` cleared) and any other
+ error code to abort the operation.
* ``done``
diff --git a/Documentation/filesystems/overlayfs.rst b/Documentation/filesystems/overlayfs.rst
index 7da6c30ed596..4c76fda07645 100644
--- a/Documentation/filesystems/overlayfs.rst
+++ b/Documentation/filesystems/overlayfs.rst
@@ -607,7 +607,7 @@ can be removed.
User xattr
----------
-The the "-o userxattr" mount option forces overlayfs to use the
+The "-o userxattr" mount option forces overlayfs to use the
"user.overlay." xattr namespace instead of "trusted.overlay.". This is
useful for unprivileged mounting of overlayfs.
diff --git a/Documentation/filesystems/porting.rst b/Documentation/filesystems/porting.rst
index 2e0e4f0e0c6f..e8f370d9ce9c 100644
--- a/Documentation/filesystems/porting.rst
+++ b/Documentation/filesystems/porting.rst
@@ -914,3 +914,22 @@ Calling conventions for file_open_root() changed; now it takes struct path *
instead of passing mount and dentry separately. For callers that used to
pass <mnt, mnt->mnt_root> pair (i.e. the root of given mount), a new helper
is provided - file_open_root_mnt(). In-tree users adjusted.
+
+---
+
+**mandatory**
+
+no_llseek is gone; don't set .llseek to that - just leave it NULL instead.
+Checks for "does that file have llseek(2), or should it fail with ESPIPE"
+should be done by looking at FMODE_LSEEK in file->f_mode.
+
+---
+
+*mandatory*
+
+filldir_t (readdir callbacks) calling conventions have changed. Instead of
+returning 0 or -E... it returns bool now. false means "no more" (as -E... used
+to) and true - "keep going" (as 0 in old calling conventions). Rationale:
+callers never looked at specific -E... values anyway. ->iterate() and
+->iterate_shared() instance require no changes at all, all filldir_t ones in
+the tree converted.
diff --git a/Documentation/filesystems/proc.rst b/Documentation/filesystems/proc.rst
index 1bc91fb8c321..e7aafc82be99 100644
--- a/Documentation/filesystems/proc.rst
+++ b/Documentation/filesystems/proc.rst
@@ -448,6 +448,7 @@ Memory Area, or VMA) there is a series of lines such as the following::
MMUPageSize: 4 kB
Rss: 892 kB
Pss: 374 kB
+ Pss_Dirty: 0 kB
Shared_Clean: 892 kB
Shared_Dirty: 0 kB
Private_Clean: 0 kB
@@ -479,7 +480,9 @@ dirty shared and private pages in the mapping.
The "proportional set size" (PSS) of a process is the count of pages it has
in memory, where each page is divided by the number of processes sharing it.
So if a process has 1000 pages all to itself, and 1000 shared with one other
-process, its PSS will be 1500.
+process, its PSS will be 1500. "Pss_Dirty" is the portion of PSS which
+consists of dirty pages. ("Pss_Clean" is not included, but it can be
+calculated by subtracting "Pss_Dirty" from "Pss".)
Note that even a page which is part of a MAP_SHARED mapping, but has only
a single pte mapped, i.e. is currently used by only one process, is accounted
@@ -514,8 +517,10 @@ replaced by copy-on-write) part of the underlying shmem object out on swap.
"SwapPss" shows proportional swap share of this mapping. Unlike "Swap", this
does not take into account swapped out page of underlying shmem objects.
"Locked" indicates whether the mapping is locked in memory or not.
+
"THPeligible" indicates whether the mapping is eligible for allocating THP
-pages - 1 if true, 0 otherwise. It just shows the current status.
+pages as well as the THP is PMD mappable or not - 1 if true, 0 otherwise.
+It just shows the current status.
"VmFlags" field deserves a separate description. This member represents the
kernel flags associated with the particular virtual memory area in two letter
@@ -1109,7 +1114,7 @@ CommitLimit
yield a CommitLimit of 7.3G.
For more details, see the memory overcommit documentation
- in vm/overcommit-accounting.
+ in mm/overcommit-accounting.
Committed_AS
The amount of memory presently allocated on the system.
The committed memory is a sum of all of the memory which
diff --git a/Documentation/filesystems/qnx6.rst b/Documentation/filesystems/qnx6.rst
index fd13433d362c..523b798f04e7 100644
--- a/Documentation/filesystems/qnx6.rst
+++ b/Documentation/filesystems/qnx6.rst
@@ -176,7 +176,7 @@ Then userspace.
The requirement for a static, fixed preallocated system area comes from how
qnx6fs deals with writes.
-Each superblock got it's own half of the system area. So superblock #1
+Each superblock got its own half of the system area. So superblock #1
always uses blocks from the lower half while superblock #2 just writes to
blocks represented by the upper half bitmap system area bits.
diff --git a/Documentation/filesystems/spufs/spufs.rst b/Documentation/filesystems/spufs/spufs.rst
index 8a42859bb100..ca0441cbe37e 100644
--- a/Documentation/filesystems/spufs/spufs.rst
+++ b/Documentation/filesystems/spufs/spufs.rst
@@ -227,7 +227,7 @@ Files
from the data buffer, updating the value of the specified signal
notification register. The signal notification register will
either be replaced with the input data or will be updated to the
- bitwise OR or the old value and the input data, depending on the
+ bitwise OR of the old value and the input data, depending on the
contents of the signal1_type, or signal2_type respectively,
file.
diff --git a/Documentation/filesystems/vfs.rst b/Documentation/filesystems/vfs.rst
index 08069ecd49a6..6cd6953e175b 100644
--- a/Documentation/filesystems/vfs.rst
+++ b/Documentation/filesystems/vfs.rst
@@ -737,12 +737,8 @@ cache in your filesystem. The following members are defined:
bool (*release_folio)(struct folio *, gfp_t);
void (*free_folio)(struct folio *);
ssize_t (*direct_IO)(struct kiocb *, struct iov_iter *iter);
- /* isolate a page for migration */
- bool (*isolate_page) (struct page *, isolate_mode_t);
- /* migrate the contents of a page to the specified target */
- int (*migratepage) (struct page *, struct page *);
- /* put migration-failed page back to right list */
- void (*putback_page) (struct page *);
+ int (*migrate_folio)(struct mapping *, struct folio *dst,
+ struct folio *src, enum migrate_mode);
int (*launder_folio) (struct folio *);
bool (*is_partially_uptodate) (struct folio *, size_t from,
@@ -774,13 +770,38 @@ cache in your filesystem. The following members are defined:
See the file "Locking" for more details.
``read_folio``
- called by the VM to read a folio from backing store. The folio
- will be locked when read_folio is called, and should be unlocked
- and marked uptodate once the read completes. If ->read_folio
- discovers that it cannot perform the I/O at this time, it can
- unlock the folio and return AOP_TRUNCATED_PAGE. In this case,
- the folio will be looked up again, relocked and if that all succeeds,
- ->read_folio will be called again.
+ Called by the page cache to read a folio from the backing store.
+ The 'file' argument supplies authentication information to network
+ filesystems, and is generally not used by block based filesystems.
+ It may be NULL if the caller does not have an open file (eg if
+ the kernel is performing a read for itself rather than on behalf
+ of a userspace process with an open file).
+
+ If the mapping does not support large folios, the folio will
+ contain a single page. The folio will be locked when read_folio
+ is called. If the read completes successfully, the folio should
+ be marked uptodate. The filesystem should unlock the folio
+ once the read has completed, whether it was successful or not.
+ The filesystem does not need to modify the refcount on the folio;
+ the page cache holds a reference count and that will not be
+ released until the folio is unlocked.
+
+ Filesystems may implement ->read_folio() synchronously.
+ In normal operation, folios are read through the ->readahead()
+ method. Only if this fails, or if the caller needs to wait for
+ the read to complete will the page cache call ->read_folio().
+ Filesystems should not attempt to perform their own readahead
+ in the ->read_folio() operation.
+
+ If the filesystem cannot perform the read at this time, it can
+ unlock the folio, do whatever action it needs to ensure that the
+ read will succeed in the future and return AOP_TRUNCATED_PAGE.
+ In this case, the caller should look up the folio, lock it,
+ and call ->read_folio again.
+
+ Callers may invoke the ->read_folio() method directly, but using
+ read_mapping_folio() will take care of locking, waiting for the
+ read to complete and handle cases such as AOP_TRUNCATED_PAGE.
``writepages``
called by the VM to write out pages associated with the
@@ -905,20 +926,12 @@ cache in your filesystem. The following members are defined:
data directly between the storage and the application's address
space.
-``isolate_page``
- Called by the VM when isolating a movable non-lru page. If page
- is successfully isolated, VM marks the page as PG_isolated via
- __SetPageIsolated.
-
-``migrate_page``
+``migrate_folio``
This is used to compact the physical memory usage. If the VM
- wants to relocate a page (maybe off a memory card that is
- signalling imminent failure) it will pass a new page and an old
- page to this function. migrate_page should transfer any private
- data across and update any references that it has to the page.
-
-``putback_page``
- Called by the VM when isolated page's migration fails.
+ wants to relocate a folio (maybe from a memory device that is
+ signalling imminent failure) it will pass a new folio and an old
+ folio to this function. migrate_folio should transfer any private
+ data across and update any references that it has to the folio.
``launder_folio``
Called before freeing a folio - it writes back the dirty folio.
diff --git a/Documentation/filesystems/xfs-delayed-logging-design.rst b/Documentation/filesystems/xfs-delayed-logging-design.rst
index 464405d2801e..6402ab8e370c 100644
--- a/Documentation/filesystems/xfs-delayed-logging-design.rst
+++ b/Documentation/filesystems/xfs-delayed-logging-design.rst
@@ -1,29 +1,314 @@
.. SPDX-License-Identifier: GPL-2.0
-==========================
-XFS Delayed Logging Design
-==========================
-
-Introduction to Re-logging in XFS
-=================================
-
-XFS logging is a combination of logical and physical logging. Some objects,
-such as inodes and dquots, are logged in logical format where the details
-logged are made up of the changes to in-core structures rather than on-disk
-structures. Other objects - typically buffers - have their physical changes
-logged. The reason for these differences is to reduce the amount of log space
-required for objects that are frequently logged. Some parts of inodes are more
-frequently logged than others, and inodes are typically more frequently logged
-than any other object (except maybe the superblock buffer) so keeping the
-amount of metadata logged low is of prime importance.
-
-The reason that this is such a concern is that XFS allows multiple separate
-modifications to a single object to be carried in the log at any given time.
-This allows the log to avoid needing to flush each change to disk before
-recording a new change to the object. XFS does this via a method called
-"re-logging". Conceptually, this is quite simple - all it requires is that any
-new change to the object is recorded with a *new copy* of all the existing
-changes in the new transaction that is written to the log.
+==================
+XFS Logging Design
+==================
+
+Preamble
+========
+
+This document describes the design and algorithms that the XFS journalling
+subsystem is based on. This document describes the design and algorithms that
+the XFS journalling subsystem is based on so that readers may familiarize
+themselves with the general concepts of how transaction processing in XFS works.
+
+We begin with an overview of transactions in XFS, followed by describing how
+transaction reservations are structured and accounted, and then move into how we
+guarantee forwards progress for long running transactions with finite initial
+reservations bounds. At this point we need to explain how relogging works. With
+the basic concepts covered, the design of the delayed logging mechanism is
+documented.
+
+
+Introduction
+============
+
+XFS uses Write Ahead Logging for ensuring changes to the filesystem metadata
+are atomic and recoverable. For reasons of space and time efficiency, the
+logging mechanisms are varied and complex, combining intents, logical and
+physical logging mechanisms to provide the necessary recovery guarantees the
+filesystem requires.
+
+Some objects, such as inodes and dquots, are logged in logical format where the
+details logged are made up of the changes to in-core structures rather than
+on-disk structures. Other objects - typically buffers - have their physical
+changes logged. Long running atomic modifications have individual changes
+chained together by intents, ensuring that journal recovery can restart and
+finish an operation that was only partially done when the system stopped
+functioning.
+
+The reason for these differences is to keep the amount of log space and CPU time
+required to process objects being modified as small as possible and hence the
+logging overhead as low as possible. Some items are very frequently modified,
+and some parts of objects are more frequently modified than others, so keeping
+the overhead of metadata logging low is of prime importance.
+
+The method used to log an item or chain modifications together isn't
+particularly important in the scope of this document. It suffices to know that
+the method used for logging a particular object or chaining modifications
+together are different and are dependent on the object and/or modification being
+performed. The logging subsystem only cares that certain specific rules are
+followed to guarantee forwards progress and prevent deadlocks.
+
+
+Transactions in XFS
+===================
+
+XFS has two types of high level transactions, defined by the type of log space
+reservation they take. These are known as "one shot" and "permanent"
+transactions. Permanent transaction reservations can take reservations that span
+commit boundaries, whilst "one shot" transactions are for a single atomic
+modification.
+
+The type and size of reservation must be matched to the modification taking
+place. This means that permanent transactions can be used for one-shot
+modifications, but one-shot reservations cannot be used for permanent
+transactions.
+
+In the code, a one-shot transaction pattern looks somewhat like this::
+
+ tp = xfs_trans_alloc(<reservation>)
+ <lock items>
+ <join item to transaction>
+ <do modification>
+ xfs_trans_commit(tp);
+
+As items are modified in the transaction, the dirty regions in those items are
+tracked via the transaction handle. Once the transaction is committed, all
+resources joined to it are released, along with the remaining unused reservation
+space that was taken at the transaction allocation time.
+
+In contrast, a permanent transaction is made up of multiple linked individual
+transactions, and the pattern looks like this::
+
+ tp = xfs_trans_alloc(<reservation>)
+ xfs_ilock(ip, XFS_ILOCK_EXCL)
+
+ loop {
+ xfs_trans_ijoin(tp, 0);
+ <do modification>
+ xfs_trans_log_inode(tp, ip);
+ xfs_trans_roll(&tp);
+ }
+
+ xfs_trans_commit(tp);
+ xfs_iunlock(ip, XFS_ILOCK_EXCL);
+
+While this might look similar to a one-shot transaction, there is an important
+difference: xfs_trans_roll() performs a specific operation that links two
+transactions together::
+
+ ntp = xfs_trans_dup(tp);
+ xfs_trans_commit(tp);
+ xfs_trans_reserve(ntp);
+
+This results in a series of "rolling transactions" where the inode is locked
+across the entire chain of transactions. Hence while this series of rolling
+transactions is running, nothing else can read from or write to the inode and
+this provides a mechanism for complex changes to appear atomic from an external
+observer's point of view.
+
+It is important to note that a series of rolling transactions in a permanent
+transaction does not form an atomic change in the journal. While each
+individual modification is atomic, the chain is *not atomic*. If we crash half
+way through, then recovery will only replay up to the last transactional
+modification the loop made that was committed to the journal.
+
+This affects long running permanent transactions in that it is not possible to
+predict how much of a long running operation will actually be recovered because
+there is no guarantee of how much of the operation reached stale storage. Hence
+if a long running operation requires multiple transactions to fully complete,
+the high level operation must use intents and deferred operations to guarantee
+recovery can complete the operation once the first transactions is persisted in
+the on-disk journal.
+
+
+Transactions are Asynchronous
+=============================
+
+In XFS, all high level transactions are asynchronous by default. This means that
+xfs_trans_commit() does not guarantee that the modification has been committed
+to stable storage when it returns. Hence when a system crashes, not all the
+completed transactions will be replayed during recovery.
+
+However, the logging subsystem does provide global ordering guarantees, such
+that if a specific change is seen after recovery, all metadata modifications
+that were committed prior to that change will also be seen.
+
+For single shot operations that need to reach stable storage immediately, or
+ensuring that a long running permanent transaction is fully committed once it is
+complete, we can explicitly tag a transaction as synchronous. This will trigger
+a "log force" to flush the outstanding committed transactions to stable storage
+in the journal and wait for that to complete.
+
+Synchronous transactions are rarely used, however, because they limit logging
+throughput to the IO latency limitations of the underlying storage. Instead, we
+tend to use log forces to ensure modifications are on stable storage only when
+a user operation requires a synchronisation point to occur (e.g. fsync).
+
+
+Transaction Reservations
+========================
+
+It has been mentioned a number of times now that the logging subsystem needs to
+provide a forwards progress guarantee so that no modification ever stalls
+because it can't be written to the journal due to a lack of space in the
+journal. This is achieved by the transaction reservations that are made when
+a transaction is first allocated. For permanent transactions, these reservations
+are maintained as part of the transaction rolling mechanism.
+
+A transaction reservation provides a guarantee that there is physical log space
+available to write the modification into the journal before we start making
+modifications to objects and items. As such, the reservation needs to be large
+enough to take into account the amount of metadata that the change might need to
+log in the worst case. This means that if we are modifying a btree in the
+transaction, we have to reserve enough space to record a full leaf-to-root split
+of the btree. As such, the reservations are quite complex because we have to
+take into account all the hidden changes that might occur.
+
+For example, a user data extent allocation involves allocating an extent from
+free space, which modifies the free space trees. That's two btrees. Inserting
+the extent into the inode's extent map might require a split of the extent map
+btree, which requires another allocation that can modify the free space trees
+again. Then we might have to update reverse mappings, which modifies yet
+another btree which might require more space. And so on. Hence the amount of
+metadata that a "simple" operation can modify can be quite large.
+
+This "worst case" calculation provides us with the static "unit reservation"
+for the transaction that is calculated at mount time. We must guarantee that the
+log has this much space available before the transaction is allowed to proceed
+so that when we come to write the dirty metadata into the log we don't run out
+of log space half way through the write.
+
+For one-shot transactions, a single unit space reservation is all that is
+required for the transaction to proceed. For permanent transactions, however, we
+also have a "log count" that affects the size of the reservation that is to be
+made.
+
+While a permanent transaction can get by with a single unit of space
+reservation, it is somewhat inefficient to do this as it requires the
+transaction rolling mechanism to re-reserve space on every transaction roll. We
+know from the implementation of the permanent transactions how many transaction
+rolls are likely for the common modifications that need to be made.
+
+For example, an inode allocation is typically two transactions - one to
+physically allocate a free inode chunk on disk, and another to allocate an inode
+from an inode chunk that has free inodes in it. Hence for an inode allocation
+transaction, we might set the reservation log count to a value of 2 to indicate
+that the common/fast path transaction will commit two linked transactions in a
+chain. Each time a permanent transaction rolls, it consumes an entire unit
+reservation.
+
+Hence when the permanent transaction is first allocated, the log space
+reservation is increased from a single unit reservation to multiple unit
+reservations. That multiple is defined by the reservation log count, and this
+means we can roll the transaction multiple times before we have to re-reserve
+log space when we roll the transaction. This ensures that the common
+modifications we make only need to reserve log space once.
+
+If the log count for a permanent transaction reaches zero, then it needs to
+re-reserve physical space in the log. This is somewhat complex, and requires
+an understanding of how the log accounts for space that has been reserved.
+
+
+Log Space Accounting
+====================
+
+The position in the log is typically referred to as a Log Sequence Number (LSN).
+The log is circular, so the positions in the log are defined by the combination
+of a cycle number - the number of times the log has been overwritten - and the
+offset into the log. A LSN carries the cycle in the upper 32 bits and the
+offset in the lower 32 bits. The offset is in units of "basic blocks" (512
+bytes). Hence we can do realtively simple LSN based math to keep track of
+available space in the log.
+
+Log space accounting is done via a pair of constructs called "grant heads". The
+position of the grant heads is an absolute value, so the amount of space
+available in the log is defined by the distance between the position of the
+grant head and the current log tail. That is, how much space can be
+reserved/consumed before the grant heads would fully wrap the log and overtake
+the tail position.
+
+The first grant head is the "reserve" head. This tracks the byte count of the
+reservations currently held by active transactions. It is a purely in-memory
+accounting of the space reservation and, as such, actually tracks byte offsets
+into the log rather than basic blocks. Hence it technically isn't using LSNs to
+represent the log position, but it is still treated like a split {cycle,offset}
+tuple for the purposes of tracking reservation space.
+
+The reserve grant head is used to accurately account for exact transaction
+reservations amounts and the exact byte count that modifications actually make
+and need to write into the log. The reserve head is used to prevent new
+transactions from taking new reservations when the head reaches the current
+tail. It will block new reservations in a FIFO queue and as the log tail moves
+forward it will wake them in order once sufficient space is available. This FIFO
+mechanism ensures no transaction is starved of resources when log space
+shortages occur.
+
+The other grant head is the "write" head. Unlike the reserve head, this grant
+head contains an LSN and it tracks the physical space usage in the log. While
+this might sound like it is accounting the same state as the reserve grant head
+- and it mostly does track exactly the same location as the reserve grant head -
+there are critical differences in behaviour between them that provides the
+forwards progress guarantees that rolling permanent transactions require.
+
+These differences when a permanent transaction is rolled and the internal "log
+count" reaches zero and the initial set of unit reservations have been
+exhausted. At this point, we still require a log space reservation to continue
+the next transaction in the sequeunce, but we have none remaining. We cannot
+sleep during the transaction commit process waiting for new log space to become
+available, as we may end up on the end of the FIFO queue and the items we have
+locked while we sleep could end up pinning the tail of the log before there is
+enough free space in the log to fulfill all of the pending reservations and
+then wake up transaction commit in progress.
+
+To take a new reservation without sleeping requires us to be able to take a
+reservation even if there is no reservation space currently available. That is,
+we need to be able to *overcommit* the log reservation space. As has already
+been detailed, we cannot overcommit physical log space. However, the reserve
+grant head does not track physical space - it only accounts for the amount of
+reservations we currently have outstanding. Hence if the reserve head passes
+over the tail of the log all it means is that new reservations will be throttled
+immediately and remain throttled until the log tail is moved forward far enough
+to remove the overcommit and start taking new reservations. In other words, we
+can overcommit the reserve head without violating the physical log head and tail
+rules.
+
+As a result, permanent transactions only "regrant" reservation space during
+xfs_trans_commit() calls, while the physical log space reservation - tracked by
+the write head - is then reserved separately by a call to xfs_log_reserve()
+after the commit completes. Once the commit completes, we can sleep waiting for
+physical log space to be reserved from the write grant head, but only if one
+critical rule has been observed::
+
+ Code using permanent reservations must always log the items they hold
+ locked across each transaction they roll in the chain.
+
+"Re-logging" the locked items on every transaction roll ensures that the items
+attached to the transaction chain being rolled are always relocated to the
+physical head of the log and so do not pin the tail of the log. If a locked item
+pins the tail of the log when we sleep on the write reservation, then we will
+deadlock the log as we cannot take the locks needed to write back that item and
+move the tail of the log forwards to free up write grant space. Re-logging the
+locked items avoids this deadlock and guarantees that the log reservation we are
+making cannot self-deadlock.
+
+If all rolling transactions obey this rule, then they can all make forwards
+progress independently because nothing will block the progress of the log
+tail moving forwards and hence ensuring that write grant space is always
+(eventually) made available to permanent transactions no matter how many times
+they roll.
+
+
+Re-logging Explained
+====================
+
+XFS allows multiple separate modifications to a single object to be carried in
+the log at any given time. This allows the log to avoid needing to flush each
+change to disk before recording a new change to the object. XFS does this via a
+method called "re-logging". Conceptually, this is quite simple - all it requires
+is that any new change to the object is recorded with a *new copy* of all the
+existing changes in the new transaction that is written to the log.
That is, if we have a sequence of changes A through to F, and the object was
written to disk after change D, we would see in the log the following series
@@ -42,16 +327,13 @@ transaction::
In other words, each time an object is relogged, the new transaction contains
the aggregation of all the previous changes currently held only in the log.
-This relogging technique also allows objects to be moved forward in the log so
-that an object being relogged does not prevent the tail of the log from ever
-moving forward. This can be seen in the table above by the changing
-(increasing) LSN of each subsequent transaction - the LSN is effectively a
-direct encoding of the location in the log of the transaction.
+This relogging technique allows objects to be moved forward in the log so that
+an object being relogged does not prevent the tail of the log from ever moving
+forward. This can be seen in the table above by the changing (increasing) LSN
+of each subsequent transaction, and it's the technique that allows us to
+implement long-running, multiple-commit permanent transactions.
-This relogging is also used to implement long-running, multiple-commit
-transactions. These transaction are known as rolling transactions, and require
-a special log reservation known as a permanent transaction reservation. A
-typical example of a rolling transaction is the removal of extents from an
+A typical example of a rolling transaction is the removal of extents from an
inode which can only be done at a rate of two extents per transaction because
of reservation size limitations. Hence a rolling extent removal transaction
keeps relogging the inode and btree buffers as they get modified in each
@@ -67,12 +349,13 @@ the log over and over again. Worse is the fact that objects tend to get
dirtier as they get relogged, so each subsequent transaction is writing more
metadata into the log.
-Another feature of the XFS transaction subsystem is that most transactions are
-asynchronous. That is, they don't commit to disk until either a log buffer is
-filled (a log buffer can hold multiple transactions) or a synchronous operation
-forces the log buffers holding the transactions to disk. This means that XFS is
-doing aggregation of transactions in memory - batching them, if you like - to
-minimise the impact of the log IO on transaction throughput.
+It should now also be obvious how relogging and asynchronous transactions go
+hand in hand. That is, transactions don't get written to the physical journal
+until either a log buffer is filled (a log buffer can hold multiple
+transactions) or a synchronous operation forces the log buffers holding the
+transactions to disk. This means that XFS is doing aggregation of transactions
+in memory - batching them, if you like - to minimise the impact of the log IO on
+transaction throughput.
The limitation on asynchronous transaction throughput is the number and size of
log buffers made available by the log manager. By default there are 8 log
@@ -268,14 +551,14 @@ Essentially, this shows that an item that is in the AIL can still be modified
and relogged, so any tracking must be separate to the AIL infrastructure. As
such, we cannot reuse the AIL list pointers for tracking committed items, nor
can we store state in any field that is protected by the AIL lock. Hence the
-committed item tracking needs it's own locks, lists and state fields in the log
+committed item tracking needs its own locks, lists and state fields in the log
item.
Similar to the AIL, tracking of committed items is done through a new list
called the Committed Item List (CIL). The list tracks log items that have been
committed and have formatted memory buffers attached to them. It tracks objects
in transaction commit order, so when an object is relogged it is removed from
-it's place in the list and re-inserted at the tail. This is entirely arbitrary
+its place in the list and re-inserted at the tail. This is entirely arbitrary
and done to make it easy for debugging - the last items in the list are the
ones that are most recently modified. Ordering of the CIL is not necessary for
transactional integrity (as discussed in the next section) so the ordering is
@@ -332,7 +615,7 @@ those changes into the current checkpoint context. We then initialise a new
context and attach that to the CIL for aggregation of new transactions.
This allows us to unlock the CIL immediately after transfer of all the
-committed items and effectively allow new transactions to be issued while we
+committed items and effectively allows new transactions to be issued while we
are formatting the checkpoint into the log. It also allows concurrent
checkpoints to be written into the log buffers in the case of log force heavy
workloads, just like the existing transaction commit code does. This, however,
@@ -601,9 +884,9 @@ pin the object the first time it is inserted into the CIL - if it is already in
the CIL during a transaction commit, then we do not pin it again. Because there
can be multiple outstanding checkpoint contexts, we can still see elevated pin
counts, but as each checkpoint completes the pin count will retain the correct
-value according to it's context.
+value according to its context.
-Just to make matters more slightly more complex, this checkpoint level context
+Just to make matters slightly more complex, this checkpoint level context
for the pin count means that the pinning of an item must take place under the
CIL commit/flush lock. If we pin the object outside this lock, we cannot
guarantee which context the pin count is associated with. This is because of