diff options
Diffstat (limited to 'Documentation/filesystems')
-rw-r--r-- | Documentation/filesystems/caching/cachefiles.rst | 2 | ||||
-rw-r--r-- | Documentation/filesystems/f2fs.rst | 44 | ||||
-rw-r--r-- | Documentation/filesystems/index.rst | 1 | ||||
-rw-r--r-- | Documentation/filesystems/iomap/operations.rst | 17 | ||||
-rw-r--r-- | Documentation/filesystems/multigrain-ts.rst | 125 | ||||
-rw-r--r-- | Documentation/filesystems/netfs_library.rst | 1 | ||||
-rw-r--r-- | Documentation/filesystems/nfs/exporting.rst | 7 | ||||
-rw-r--r-- | Documentation/filesystems/overlayfs.rst | 17 | ||||
-rw-r--r-- | Documentation/filesystems/porting.rst | 2 | ||||
-rw-r--r-- | Documentation/filesystems/proc.rst | 2 | ||||
-rw-r--r-- | Documentation/filesystems/tmpfs.rst | 24 |
11 files changed, 230 insertions, 12 deletions
diff --git a/Documentation/filesystems/caching/cachefiles.rst b/Documentation/filesystems/caching/cachefiles.rst index e04a27bdbe19..b3ccc782cb3b 100644 --- a/Documentation/filesystems/caching/cachefiles.rst +++ b/Documentation/filesystems/caching/cachefiles.rst @@ -115,7 +115,7 @@ set up cache ready for use. The following script commands are available: This mask can also be set through sysfs, eg:: - echo 5 >/sys/modules/cachefiles/parameters/debug + echo 5 > /sys/module/cachefiles/parameters/debug Starting the Cache diff --git a/Documentation/filesystems/f2fs.rst b/Documentation/filesystems/f2fs.rst index 68a0885fb5e6..fb7d2ee022bc 100644 --- a/Documentation/filesystems/f2fs.rst +++ b/Documentation/filesystems/f2fs.rst @@ -943,3 +943,47 @@ NVMe Zoned Namespace devices can start before the zone-capacity and span across zone-capacity boundary. Such spanning segments are also considered as usable segments. All blocks past the zone-capacity are considered unusable in these segments. + +Device aliasing feature +----------------------- + +f2fs can utilize a special file called a "device aliasing file." This file allows +the entire storage device to be mapped with a single, large extent, not using +the usual f2fs node structures. This mapped area is pinned and primarily intended +for holding the space. + +Essentially, this mechanism allows a portion of the f2fs area to be temporarily +reserved and used by another filesystem or for different purposes. Once that +external usage is complete, the device aliasing file can be deleted, releasing +the reserved space back to F2FS for its own use. + +<use-case> + +# ls /dev/vd* +/dev/vdb (32GB) /dev/vdc (32GB) +# mkfs.ext4 /dev/vdc +# mkfs.f2fs -c /dev/vdc@vdc.file /dev/vdb +# mount /dev/vdb /mnt/f2fs +# ls -l /mnt/f2fs +vdc.file +# df -h +/dev/vdb 64G 33G 32G 52% /mnt/f2fs + +# mount -o loop /dev/vdc /mnt/ext4 +# df -h +/dev/vdb 64G 33G 32G 52% /mnt/f2fs +/dev/loop7 32G 24K 30G 1% /mnt/ext4 +# umount /mnt/ext4 + +# f2fs_io getflags /mnt/f2fs/vdc.file +get a flag on /mnt/f2fs/vdc.file ret=0, flags=nocow(pinned),immutable +# f2fs_io setflags noimmutable /mnt/f2fs/vdc.file +get a flag on noimmutable ret=0, flags=800010 +set a flag on /mnt/f2fs/vdc.file ret=0, flags=noimmutable +# rm /mnt/f2fs/vdc.file +# df -h +/dev/vdb 64G 753M 64G 2% /mnt/f2fs + +So, the key idea is, user can do any file operations on /dev/vdc, and +reclaim the space after the use, while the space is counted as /data. +That doesn't require modifying partition size and filesystem format. diff --git a/Documentation/filesystems/index.rst b/Documentation/filesystems/index.rst index e8e496d23e1d..44e9e77ffe0d 100644 --- a/Documentation/filesystems/index.rst +++ b/Documentation/filesystems/index.rst @@ -29,6 +29,7 @@ algorithms work. fiemap files locks + multigrain-ts mount_api quota seq_file diff --git a/Documentation/filesystems/iomap/operations.rst b/Documentation/filesystems/iomap/operations.rst index 8e6c721d2330..ef082e5a4e0c 100644 --- a/Documentation/filesystems/iomap/operations.rst +++ b/Documentation/filesystems/iomap/operations.rst @@ -208,7 +208,7 @@ The filesystem must arrange to `cancel such `reservations <https://lore.kernel.org/linux-xfs/20220817093627.GZ3600936@dread.disaster.area/>`_ because writeback will not consume the reservation. -The ``iomap_file_buffered_write_punch_delalloc`` can be called from a +The ``iomap_write_delalloc_release`` can be called from a ``->iomap_end`` function to find all the clean areas of the folios caching a fresh (``IOMAP_F_NEW``) delalloc mapping. It takes the ``invalidate_lock``. @@ -513,6 +513,21 @@ IOMAP_WRITE`` with any combination of the following enhancements: if the mapping is unwritten and the filesystem cannot handle zeroing the unaligned regions without exposing stale contents. + * ``IOMAP_ATOMIC``: This write is being issued with torn-write + protection. + Only a single bio can be created for the write, and the write must + not be split into multiple I/O requests, i.e. flag REQ_ATOMIC must be + set. + The file range to write must be aligned to satisfy the requirements + of both the filesystem and the underlying block device's atomic + commit capabilities. + If filesystem metadata updates are required (e.g. unwritten extent + conversion or copy on write), all updates for the entire file range + must be committed atomically as well. + Only one space mapping is allowed per untorn write. + Untorn writes must be aligned to, and must not be longer than, a + single file block. + Callers commonly hold ``i_rwsem`` in shared or exclusive mode before calling this function. diff --git a/Documentation/filesystems/multigrain-ts.rst b/Documentation/filesystems/multigrain-ts.rst new file mode 100644 index 000000000000..c779e47284e8 --- /dev/null +++ b/Documentation/filesystems/multigrain-ts.rst @@ -0,0 +1,125 @@ +.. SPDX-License-Identifier: GPL-2.0 + +===================== +Multigrain Timestamps +===================== + +Introduction +============ +Historically, the kernel has always used coarse time values to stamp inodes. +This value is updated every jiffy, so any change that happens within that jiffy +will end up with the same timestamp. + +When the kernel goes to stamp an inode (due to a read or write), it first gets +the current time and then compares it to the existing timestamp(s) to see +whether anything will change. If nothing changed, then it can avoid updating +the inode's metadata. + +Coarse timestamps are therefore good from a performance standpoint, since they +reduce the need for metadata updates, but bad from the standpoint of +determining whether anything has changed, since a lot of things can happen in a +jiffy. + +They are particularly troublesome with NFSv3, where unchanging timestamps can +make it difficult to tell whether to invalidate caches. NFSv4 provides a +dedicated change attribute that should always show a visible change, but not +all filesystems implement this properly, causing the NFS server to substitute +the ctime in many cases. + +Multigrain timestamps aim to remedy this by selectively using fine-grained +timestamps when a file has had its timestamps queried recently, and the current +coarse-grained time does not cause a change. + +Inode Timestamps +================ +There are currently 3 timestamps in the inode that are updated to the current +wallclock time on different activity: + +ctime: + The inode change time. This is stamped with the current time whenever + the inode's metadata is changed. Note that this value is not settable + from userland. + +mtime: + The inode modification time. This is stamped with the current time + any time a file's contents change. + +atime: + The inode access time. This is stamped whenever an inode's contents are + read. Widely considered to be a terrible mistake. Usually avoided with + options like noatime or relatime. + +Updating the mtime always implies a change to the ctime, but updating the +atime due to a read request does not. + +Multigrain timestamps are only tracked for the ctime and the mtime. atimes are +not affected and always use the coarse-grained value (subject to the floor). + +Inode Timestamp Ordering +======================== + +In addition to just providing info about changes to individual files, file +timestamps also serve an important purpose in applications like "make". These +programs measure timestamps in order to determine whether source files might be +newer than cached objects. + +Userland applications like make can only determine ordering based on +operational boundaries. For a syscall those are the syscall entry and exit +points. For io_uring or nfsd operations, that's the request submission and +response. In the case of concurrent operations, userland can make no +determination about the order in which things will occur. + +For instance, if a single thread modifies one file, and then another file in +sequence, the second file must show an equal or later mtime than the first. The +same is true if two threads are issuing similar operations that do not overlap +in time. + +If however, two threads have racing syscalls that overlap in time, then there +is no such guarantee, and the second file may appear to have been modified +before, after or at the same time as the first, regardless of which one was +submitted first. + +Note that the above assumes that the system doesn't experience a backward jump +of the realtime clock. If that occurs at an inopportune time, then timestamps +can appear to go backward, even on a properly functioning system. + +Multigrain Timestamp Implementation +=================================== +Multigrain timestamps are aimed at ensuring that changes to a single file are +always recognizable, without violating the ordering guarantees when multiple +different files are modified. This affects the mtime and the ctime, but the +atime will always use coarse-grained timestamps. + +It uses an unused bit in the i_ctime_nsec field to indicate whether the mtime +or ctime has been queried. If either or both have, then the kernel takes +special care to ensure the next timestamp update will display a visible change. +This ensures tight cache coherency for use-cases like NFS, without sacrificing +the benefits of reduced metadata updates when files aren't being watched. + +The Ctime Floor Value +===================== +It's not sufficient to simply use fine or coarse-grained timestamps based on +whether the mtime or ctime has been queried. A file could get a fine grained +timestamp, and then a second file modified later could get a coarse-grained one +that appears earlier than the first, which would break the kernel's timestamp +ordering guarantees. + +To mitigate this problem, maintain a global floor value that ensures that +this can't happen. The two files in the above example may appear to have been +modified at the same time in such a case, but they will never show the reverse +order. To avoid problems with realtime clock jumps, the floor is managed as a +monotonic ktime_t, and the values are converted to realtime clock values as +needed. + +Implementation Notes +==================== +Multigrain timestamps are intended for use by local filesystems that get +ctime values from the local clock. This is in contrast to network filesystems +and the like that just mirror timestamp values from a server. + +For most filesystems, it's sufficient to just set the FS_MGTIME flag in the +fstype->fs_flags in order to opt-in, providing the ctime is only ever set via +inode_set_ctime_current(). If the filesystem has a ->getattr routine that +doesn't call generic_fillattr, then it should call fill_mg_cmtime() to +fill those values. For setattr, it should use setattr_copy() to update the +timestamps, or otherwise mimic its behavior. diff --git a/Documentation/filesystems/netfs_library.rst b/Documentation/filesystems/netfs_library.rst index f0d2cb257bb8..73f0bfd7e903 100644 --- a/Documentation/filesystems/netfs_library.rst +++ b/Documentation/filesystems/netfs_library.rst @@ -592,4 +592,3 @@ API Function Reference .. kernel-doc:: include/linux/netfs.h .. kernel-doc:: fs/netfs/buffered_read.c -.. kernel-doc:: fs/netfs/io.c diff --git a/Documentation/filesystems/nfs/exporting.rst b/Documentation/filesystems/nfs/exporting.rst index f04ce1215a03..de64d2d002a2 100644 --- a/Documentation/filesystems/nfs/exporting.rst +++ b/Documentation/filesystems/nfs/exporting.rst @@ -238,10 +238,3 @@ following flags are defined: all of an inode's dirty data on last close. Exports that behave this way should set EXPORT_OP_FLUSH_ON_CLOSE so that NFSD knows to skip waiting for writeback when closing such files. - - EXPORT_OP_ASYNC_LOCK - Indicates a capable filesystem to do async lock - requests from lockd. Only set EXPORT_OP_ASYNC_LOCK if the filesystem has - it's own ->lock() functionality as core posix_lock_file() implementation - has no async lock request handling yet. For more information about how to - indicate an async lock request from a ->lock() file_operations struct, see - fs/locks.c and comment for the function vfs_lock_file(). diff --git a/Documentation/filesystems/overlayfs.rst b/Documentation/filesystems/overlayfs.rst index 343644712340..4c8387e1c880 100644 --- a/Documentation/filesystems/overlayfs.rst +++ b/Documentation/filesystems/overlayfs.rst @@ -440,6 +440,23 @@ For example:: fsconfig(fs_fd, FSCONFIG_SET_STRING, "datadir+", "/do2", 0); +Specifying layers via file descriptors +-------------------------------------- + +Since kernel v6.13, overlayfs supports specifying layers via file descriptors in +addition to specifying them as paths. This feature is available for the +"datadir+", "lowerdir+", "upperdir", and "workdir+" mount options with the +fsconfig syscall from the new mount api:: + + fsconfig(fs_fd, FSCONFIG_SET_FD, "lowerdir+", NULL, fd_lower1); + fsconfig(fs_fd, FSCONFIG_SET_FD, "lowerdir+", NULL, fd_lower2); + fsconfig(fs_fd, FSCONFIG_SET_FD, "lowerdir+", NULL, fd_lower3); + fsconfig(fs_fd, FSCONFIG_SET_FD, "datadir+", NULL, fd_data1); + fsconfig(fs_fd, FSCONFIG_SET_FD, "datadir+", NULL, fd_data2); + fsconfig(fs_fd, FSCONFIG_SET_FD, "workdir", NULL, fd_work); + fsconfig(fs_fd, FSCONFIG_SET_FD, "upperdir", NULL, fd_upper); + + fs-verity support ----------------- diff --git a/Documentation/filesystems/porting.rst b/Documentation/filesystems/porting.rst index 92bffcc6747a..9ab2a3d6f2b4 100644 --- a/Documentation/filesystems/porting.rst +++ b/Documentation/filesystems/porting.rst @@ -177,7 +177,7 @@ settles down a bit. **mandatory** s_export_op is now required for exporting a filesystem. -isofs, ext2, ext3, reiserfs, fat +isofs, ext2, ext3, fat can be used as examples of very different filesystems. --- diff --git a/Documentation/filesystems/proc.rst b/Documentation/filesystems/proc.rst index e834779d9611..6a882c57a7e7 100644 --- a/Documentation/filesystems/proc.rst +++ b/Documentation/filesystems/proc.rst @@ -579,7 +579,7 @@ encoded manner. The codes are the following: mt arm64 MTE allocation tags are enabled um userfaultfd missing tracking uw userfaultfd wr-protect tracking - ss shadow stack page + ss shadow/guarded control stack page sl sealed == ======================================= diff --git a/Documentation/filesystems/tmpfs.rst b/Documentation/filesystems/tmpfs.rst index 56a26c843dbe..d677e0428c3f 100644 --- a/Documentation/filesystems/tmpfs.rst +++ b/Documentation/filesystems/tmpfs.rst @@ -241,6 +241,28 @@ So 'mount -t tmpfs -o size=10G,nr_inodes=10k,mode=700 tmpfs /mytmpfs' will give you tmpfs instance on /mytmpfs which can allocate 10GB RAM/SWAP in 10240 inodes and it is only accessible by root. +tmpfs has the following mounting options for case-insensitive lookup support: + +================= ============================================================== +casefold Enable casefold support at this mount point using the given + argument as the encoding standard. Currently only UTF-8 + encodings are supported. If no argument is used, it will load + the latest UTF-8 encoding available. +strict_encoding Enable strict encoding at this mount point (disabled by + default). In this mode, the filesystem refuses to create file + and directory with names containing invalid UTF-8 characters. +================= ============================================================== + +This option doesn't render the entire filesystem case-insensitive. One needs to +still set the casefold flag per directory, by flipping the +F attribute in an +empty directory. Nevertheless, new directories will inherit the attribute. The +mountpoint itself cannot be made case-insensitive. + +Example:: + + $ mount -t tmpfs -o casefold=utf8-12.1.0,strict_encoding fs_name /mytmpfs + $ mount -t tmpfs -o casefold fs_name /mytmpfs + :Author: Christoph Rohland <cr@sap.com>, 1.12.01 @@ -250,3 +272,5 @@ RAM/SWAP in 10240 inodes and it is only accessible by root. KOSAKI Motohiro, 16 Mar 2010 :Updated: Chris Down, 13 July 2020 +:Updated: + André Almeida, 23 Aug 2024 |