1 files changed, 289 insertions, 0 deletions
diff --git a/doc/dev/libcephfs_proxy.rst b/doc/dev/libcephfs_proxy.rst
new file mode 100644
index 00000000000..baa96f765c9
--- /dev/null
+++ b/doc/dev/libcephfs_proxy.rst
@@ -0,0 +1,289 @@
+Design of the libcephfs proxy
+=============================
+
+Description of the problem
+--------------------------
+
+When an application connects to a Ceph volume through the *libcephfs.so*
+library, a cache is created locally inside the process. The *libcephfs.so*
+implementation already deals with memory usage of the cache and adjusts it so
+that it doesn't consume all the available memory. However, if multiple
+processes connect to CephFS through different instances of the library, each
+one of them will keep a private cache. In this case memory management is not
+effective because, even configuring memory limits, the number of libcephfs
+instances that can be created is unbounded and they can't work in a coordinated
+way to correctly control ressource usage. Due to this, it's relatively easy to
+consume all memory when all processes are using data cache intensively. This
+causes the OOM killer to terminate those processes.
+
+Proposed solution
+-----------------
+
+High level approach
+^^^^^^^^^^^^^^^^^^^
+
+The main idea is to create a *libcephfs_proxy.so* library that will provide the
+same API as the original *libcephfs.so*, but won't cache any data. This library
+can be used by any application currently using *libcephfs.so* in a transparent
+way (i.e. no code modification required) just by linking against
+*libcephfs_proxy.so* instead of *libcephfs.so*, or even using *LD_PRELOAD*.
+
+A new *libcephfsd* daemon will also be created. This daemon will link against
+the real *libcephfs.so* library and will listen for incoming connections on a
+UNIX socket.
+
+When an application starts and initiates CephFS requests through the
+*libcephfs_proxy.so* library, it will connect to the *libcephfsd* daemon
+through the UNIX socket and it will forward all CephFS requests to it. The
+daemon will use the real *libcephfs.so* to execute those requests and the
+answers will be returned back to the application, potentially caching data in
+the *libcephfsd* process itself. All this will happen transparently, without
+any knowledge from the application.
+
+The daemon will share low level *libcephfs.so* mounts between different
+applications to avoid creating an instance for each application, which would
+have the same effect on memory as linking each application directly to the
+*libcephfs.so* library. This will be done only if the configuration defined by
+the applications is identical. Otherwise new independent instances will still
+be created.
+
+Some *libcephfs.so* functions will need to be implemented in an special way
+inside the *libcephfsd* daemon to hide the differences caused by sharing the
+same mount instance with more than one client (for example chdir/getcwd cannot
+rely directly on the ``ceph_chdir()``/``ceph_getcwd()`` of *libcephfs.so*).
+
+Initially, only the subset of the low-level interface functions of
+*libcephfs.so* that are used by the Samba's VFS CephFS module will be provided.
+
+Design of common components
+^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Network protocol
+""""""""""""""""
+
+Since the connection through the UNIX socket is to another process that runs on
+the same machine and the data we need to pass is quite simple, we'll avoid all
+the overhead of generic XDR encoding/decoding and RPC transmissions by using a
+very simple serialization implemented in the code itself. For the future we may
+consider using cap'n proto (https://capnproto.org), which claims to have zero
+overhead for encoding and decoding, and would provide an easy way to support
+backward compatibility if the network protocol needs to be modified in the
+future.
+
+Design of the *libcephfs_proxy.so* library
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+This library will basically connect to the UNIX socket where the *libcephfsd*
+daemon is listening, wait for requests coming from the application, serialize
+all function arguments and send them to the daemon. Once the daemon responds it
+will deserialize the answer and return the result to the application.
+
+Local caching
+"""""""""""""
+
+While the main purpose of this library is to avoid independent caches on each
+process, some preliminary testing has shown a big performance drop for
+workloads based on metadata operations and/or small files when all requests go
+through the proxy daemon. To minimize this, metadata caching should be
+implemented. Metadata cache is much smaller than data cache and will provide a
+good trade-off between memory usage and performance.
+
+To implement caching in a safe way, it's required to correctly invalidate data
+before it becomes stale. Currently libcephfs.so provides invalidation
+notifications that can be used to implement this, but its semantics are not
+fully understood yet, so the cache in the libcephfs_proxy.so library will be
+designed and implemented in a future version.
+
+
+Design of the *libcephfsd* daemon
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+The daemon will be a regular process that will centralize libcephfs requests
+coming from other processes on the same machine.
+
+Process maintenance
+"""""""""""""""""""
+
+Since the process will work as a standalone daemon, a simple systemd unit file
+will be provided to manage it as a regular system service. Most probably this
+will be integrated inside cephadm in the future.
+
+In case the *libcephfsd* daemon crashes, we'll rely on systemd to restart it.
+
+
+Special functions
+^^^^^^^^^^^^^^^^^
+
+Some functions will need to be handled in a special way inside the *libcephfsd*
+daemon to provide correct functionality since forwarding them directly to
+*libcephfs.so* could return incorrect results due to the sharing of low-level
+mounts.
+
+**Sharing of underlying struct ceph_mount_info**
+
+The main purpose of the proxy is to avoid creating a new mount for each process
+when they are accessing the same data. To be able to provide this we need to
+"virtualize" the mount points and let the application believe it's using its
+own mount when, in fact, it could be using a shared mount.
+
+The daemon will track the Ceph account used to connect to the volume, the
+configuration file and any specific configuration changes done before mounting
+the volume. Only if all settings are exactly the same as another already
+mounted instance, then the mount will be shared. The daemon won't understand
+CephFS settings nor any potential dependencies between settings. For this
+reason, a very strict comparison will be done: the configuration file needs to
+be identical and any other changes made afterwards need to be set to the exact
+same value and in the same order so that two configurations can be considered
+identical.
+
+The check to determine whether two configurations are identical or not will be
+done just before mounting the volume (i.e. ``ceph_mount()``). This means that
+during the configuration phase, we may have many simultaneous mounts allocated
+but not yet mounted. However only one of them will become a real mount. The
+others will remain unmounted and will be eventually destroyed once users
+unmount and release them.
+
+The following functions will be affected:
+
+* **ceph_create**
+
+  This one will allocate a new ceph_mount_info structure, and the provided id
+  will be recorded for future comparison of potentially matching mounts.
+
+* **ceph_release**
+
+  This one will release an unmounted ceph_mount_info structure. Unmounted
+  structures won't be shared with anyone else.
+
+* **ceph_conf_read_file**
+
+  This one will read the configuration file, compute a checksum and make a
+  copy. The copy will make sure that there are no changes in the configuration
+  file since the time the checksum was computed, and the checksum will be
+  recorded for future comparison of potentially matching mounts.
+
+* **ceph_conf_get**
+
+  This one will obtain the requested setting, recording it for future
+  comparison of potentially matching mounts.
+
+  Even though this may seem unnecessary, since the daemon is considering the
+  configuration as a black box, it could be possible to have some dynamic
+  setting that could return different values depending on external factors, so
+  the daemon also requires that any requested setting returns the same value to
+  consider two configurations identical.
+
+* **ceph_conf_set**
+
+  This one will record the modified value for future comparison of potentially
+  matching mounts.
+
+  In normal circumstances, some settings may be set even after having mounted
+  the volume. The proxy won't allow that to avoid potential interferences with
+  other clients sharing the same mount.
+
+* **ceph_init**
+
+  This one will be a no-op. Calling this function triggers the allocation of
+  several resources and starts some threads. This is just a waste of resources
+  if this *ceph_mount_info* structure is not finally mounted because it matches
+  with an already existing mount.
+
+  Only if at the time of mount (i.e. ``ceph_mount()``) there's no match with
+  already existing mounts, then the mount will be initialized and mounted at
+  the same time.
+
+* **ceph_select_filesystem**
+
+  This one will record the selected file system for future comparison of
+  potentially matching mounts.
+
+* **ceph_mount**
+
+  This one will try to find an active mount that matches with all the
+  configurations defined for this *ceph_mount_info* structure. If none is
+  found, it will be mounted. Otherwise, the already existing mount will be
+  shared with this client.
+
+  The unmounted *ceph_mount_info* structures will be kept around associated
+  with the mounted one.
+
+  All "real" mounts will be made against the absolute root of the volume
+  (i.e. "/") to make sure they can be shared with other clients later,
+  regardless of whether they use the same mount point or not. This means that
+  just after mounting, the daemon will need to resolve and store the root inode
+  of the "virtual" mount point.
+
+  The CWD (Current Working Directory) will also be initialized to the same
+  inode.
+
+* **ceph_unmount**
+
+  This one will detach the client from the mounted *ceph_mount_info* structure
+  and reattach it to one of the associated unmounted structures. If this was
+  the last user of the mount, it's finally unmounted instead.
+
+  After calling this function, the client continues using a private
+  *ceph_mount_info* structure that is used exclusively by itself, so other
+  configuration changes and operations can be done safely.
+
+**Confine accesses to the intended mount point**
+
+Since the effective mount point may not match the real mount point, some
+functions could be able to return inodes outside of the effective mount point
+if not handled with care. To avoid it and provide the result that the user
+application expects, we will need to simulate some of them inside the
+*libcephfsd* daemon.
+
+There are three special cases to consider:
+
+1. Handling of paths starting with "/"
+2. Handling of paths containing ".." (i.e. parent directory)
+3. Handling of paths containing symbolic links
+
+When these special paths are found, they need to be handled in a special way to
+make sure that the returned inodes are what the client expects.
+
+The following functions will be affected:
+
+* **ceph_ll_lookup**
+
+  Lookup accepts ".." as the name to resolve. If the parent directory is the
+  root of the "virtual" mount point (which may not be the same as the real
+  mount point), we'll need to return the inode corresponding to the "virtual"
+  mount point stored at the time of mount, instead of the real parent.
+
+* **ceph_ll_lookup_root**
+
+  This one needs to return the root inode stored at the time of mount.
+
+* **ceph_ll_walk**
+
+  This one will be completely reimplemented inside the daemon to be able to
+  correctly parse each path component and symbolic link, and handle "/" and
+  ".." in the correct way.
+
+* **ceph_chdir**
+
+  This one will resolve the passed path and store it along the corresponding
+  inode inside the current "virtual" mount. The real ``ceph_chdir()`` won't be
+  called.
+
+* **ceph_getcwd**
+
+  This one will just return the path stored in the "virtual" mount from
+  previous ``ceph_chdir()`` calls.
+
+**Handle AT_FDCWD**
+
+Any function that receives a file descriptor could also receive the special
+*AT_FDCWD* value. These functions need to check for that value and use the
+"virtual" CWD instead.
+
+Testing
+-------
+
+The proxy should be transparent to any application already using
+*libcephfs.so*. This also applies to testing scripts and applications. So any
+existing test against the regular *libcephfs.so* library can also be used to
+test the proxy.