>=20.0.0 * RGW: The User Account feature introduced in Squid provides first-class support for IAM APIs and policy. Our preliminary STS support was instead based on tenants, and exposed some IAM APIs to admins only. This tenant-level IAM functionality is now deprecated in favor of accounts. While we'll continue to support the tenant feature itself for namespace isolation, the following features will be removed no sooner than the V release: * tenant-level IAM APIs like CreateRole, PutRolePolicy and PutUserPolicy, * use of tenant names instead of accounts in IAM policy documents, * interpretation of IAM policy without cross-account policy evaluation, * S3 API support for cross-tenant names such as `Bucket='tenant:bucketname'` * RBD: All Python APIs that produce timestamps now return "aware" `datetime` objects instead of "naive" ones (i.e. those including time zone information instead of those not including it). All timestamps remain to be in UTC but including `timezone.utc` makes it explicit and avoids the potential of the returned timestamp getting misinterpreted -- in Python 3, many `datetime` methods treat "naive" `datetime` objects as local times. * RBD: `rbd group info` and `rbd group snap info` commands are introduced to show information about a group and a group snapshot respectively. * RBD: `rbd group snap ls` output now includes the group snapshot IDs. The header of the column showing the state of a group snapshot in the unformatted CLI output is changed from 'STATUS' to 'STATE'. The state of a group snapshot that was shown as 'ok' is now shown as 'complete', which is more descriptive. * Based on tests performed at scale on an HDD based Ceph cluster, it was found that scheduling with mClock was not optimal with multiple OSD shards. For example, in the test cluster with multiple OSD node failures, the client throughput was found to be inconsistent across test runs coupled with multiple reported slow requests. However, the same test with a single OSD shard and with multiple worker threads yielded significantly better results in terms of consistency of client and recovery throughput across multiple test runs. Therefore, as an interim measure until the issue with multiple OSD shards (or multiple mClock queues per OSD) is investigated and fixed, the following changes to the default option values have been made: - osd_op_num_shards_hdd = 1 (was 5) - osd_op_num_threads_per_shard_hdd = 5 (was 1) For more details see https://tracker.ceph.com/issues/66289. * MGR: The Ceph Manager's always-on modulues/plugins can now be force-disabled. This can be necessary in cases where we wish to prevent the manager from being flooded by module commands when Ceph services are down or degraded. * CephFS: Modifying the setting "max_mds" when a cluster is unhealthy now requires users to pass the confirmation flag (--yes-i-really-mean-it). This has been added as a precaution to tell the users that modifying "max_mds" may not help with troubleshooting or recovery effort. Instead, it might further destabilize the cluster. * RADOS: Added convenience function `librados::AioCompletion::cancel()` with the same behavior as `librados::IoCtx::aio_cancel()`. * mgr/restful, mgr/zabbix: both modules, already deprecated since 2020, have been finally removed. They have not been actively maintenance in the last years, and started suffering from vulnerabilities in their dependency chain (e.g.: CVE-2023-46136). As alternatives, for the `restful` module, the `dashboard` module provides a richer and better maintained RESTful API. Regarding the `zabbix` module, there are alternative monitoring solutions, like `prometheus`, which is the most widely adopted among the Ceph user community. * CephFS: EOPNOTSUPP (Operation not supported ) is now returned by the CephFS fuse client for `fallocate` for the default case (i.e. mode == 0) since CephFS does not support disk space reservation. The only flags supported are `FALLOC_FL_KEEP_SIZE` and `FALLOC_FL_PUNCH_HOLE`. * pybind/rados: Fixes WriteOp.zero() in the original reversed order of arguments `offset` and `length`. When pybind calls WriteOp.zero(), the argument passed does not match rados_write_op_zero, and offset and length are swapped, which results in an unexpected response. * The HeadBucket API now reports the `X-RGW-Bytes-Used` and `X-RGW-Object-Count` headers only when the `read-stats` querystring is explicitly included in the API request. >=19.2.1 * CephFS: Command `fs subvolume create` now allows tagging subvolumes through option `--earmark` with a unique identifier needed for NFS or SMB services. The earmark string for a subvolume is empty by default. To remove an already present earmark, an empty string can be assigned to it. Additionally, commands `ceph fs subvolume earmark set`, `ceph fs subvolume earmark get` and `ceph fs subvolume earmark rm` have been added to set, get and remove earmark from a given subvolume. * RADOS: A performance botteneck in the balancer mgr module has been fixed. Related Tracker: https://tracker.ceph.com/issues/68657 >=19.0.0 * cephx: key rotation is now possible using `ceph auth rotate`. Previously, this was only possible by deleting and then recreating the key. * Ceph: a new --daemon-output-file switch is available for `ceph tell` commands to dump output to a file local to the daemon. For commands which produce large amounts of output, this avoids a potential spike in memory usage on the daemon, allows for faster streaming writes to a file local to the daemon, and reduces time holding any locks required to execute the command. For analysis, it is necessary to retrieve the file from the host running the daemon manually. Currently, only --format=json|json-pretty are supported. * RGW: GetObject and HeadObject requests now return an x-rgw-replicated-at header for replicated objects. This timestamp can be compared against the Last-Modified header to determine how long the object took to replicate. * The cephfs-shell utility is now packaged for RHEL / CentOS / Rocky 9 as required Python dependencies are now available in EPEL9. * RGW: S3 multipart uploads using Server-Side Encryption now replicate correctly in multi-site deployments Previously, replicas of such objects were corrupted on decryption. A new tool, ``radosgw-admin bucket resync encrypted multipart``, can be used to identify these original multipart uploads. The ``LastModified`` timestamp of any identified object is incremented by one ns to cause peer zones to replicate it again. For multi-site deployments that make use of Server-Side Encryption, we recommended running this command against every bucket in every zone after all zones have upgraded. * Tracing: The blkin tracing feature (see https://docs.ceph.com/en/reef/dev/blkin/) is now deprecated in favor of Opentracing (https://docs.ceph.com/en/reef/dev/developer_guide/jaegertracing/) and will be removed in a later release. * RGW: Introducing a new data layout for the Topic metadata associated with S3 Bucket Notifications, where each Topic is stored as a separate RADOS object and the bucket notification configuration is stored in a bucket attribute. This new representation supports multisite replication via metadata sync and can scale to many topics. This is on by default for new deployments, but is is not enabled by default on upgrade. Once all radosgws have upgraded (on all zones in a multisite configuration), the ``notification_v2`` zone feature can be enabled to migrate to the new format. See https://docs.ceph.com/en/squid/radosgw/zone-features for details. The "v1" format is now considered deprecated and may be removed after 2 major releases. * CephFS: The MDS evicts clients which are not advancing their request tids, which causes a large buildup of session metadata, which in turn results in the MDS going read-only due to RADOS operations exceeding the size threshold. `mds_session_metadata_threshold` config controls the maximum size to which (encoded) session metadata can grow. * CephFS: A new "mds last-seen" command is available for querying the last time an MDS was in the FSMap, subject to a pruning threshold. * CephFS: For clusters with multiple CephFS file systems, all snap-schedule commands now expect the '--fs' argument. * CephFS: The period specifier ``m`` now implies minutes and the period specifier ``M`` now implies months. This is consistent with the rest of the system. * RGW: New tools have been added to radosgw-admin for identifying and correcting issues with versioned bucket indexes. Historical bugs with the versioned bucket index transaction workflow made it possible for the index to accumulate extraneous "book-keeping" olh entries and plain placeholder entries. In some specific scenarios where clients made concurrent requests referencing the same object key, it was likely that extra index entries would accumulate. When a significant number of these entries are present in a single bucket index shard, they can cause high bucket listing latency and lifecycle processing failures. To check whether a versioned bucket has unnecessary olh entries, users can now run ``radosgw-admin bucket check olh``. If the ``--fix`` flag is used, the extra entries will be safely removed. An additional issue is that some versioned buckets may maintain extra unlinked objects that are not listable via the S3/Swift APIs. These extra objects are typically a result of PUT requests that exited abnormally in the middle of a bucket index transaction, and thus the client would not have received a successful response. Bugs in prior releases made these unlinked objects easy to reproduce with any PUT request made on a bucket that was actively resharding. In certain scenarios, a client of a bucket that was a victim of this bug may find the object associated with the key to be in an inconsistent state. To check whether a versioned bucket has unlinked entries, users can now run ``radosgw-admin bucket check unlinked``. If the ``--fix`` flag is used, the unlinked objects will be safely removed. Finally, a third issue made it possible for versioned bucket index stats to be accounted inaccurately. The tooling for recalculating versioned bucket stats also had a bug, and was not previously capable of fixing these inaccuracies. This release resolves those issues and users can now expect that the existing ``radosgw-admin bucket check`` command will produce correct results. We recommend that users with versioned buckets, especially those that existed on prior releases, use these new tools to check whether their buckets are affected and to clean them up accordingly. * RGW: The "user accounts" feature unlocks several new AWS-compatible IAM APIs for self-service management of users, keys, groups, roles, policy and more. Existing users can be adopted into new accounts. This process is optional but irreversible. See https://docs.ceph.com/en/squid/radosgw/account and https://docs.ceph.com/en/squid/radosgw/iam for details. * RGW: On startup, radosgw and radosgw-admin now validate the ``rgw_realm`` config option. Previously, they would ignore invalid or missing realms and go on to load a zone/zonegroup in a different realm. If startup fails with a "failed to load realm" error, fix or remove the ``rgw_realm`` option. * RGW: The radosgw-admin commands ``realm create`` and ``realm pull`` no longer set the default realm without ``--default``. * CephFS: Running the command "ceph fs authorize" for an existing entity now upgrades the entity's capabilities instead of printing an error. It can now also change read/write permissions in a capability that the entity already holds. If the capability passed by user is same as one of the capabilities that the entity already holds, idempotency is maintained. * CephFS: Two FS names can now be swapped, optionally along with their IDs, using "ceph fs swap" command. The function of this API is to facilitate file system swaps for disaster recovery. In particular, it avoids situations where a named file system is temporarily missing which would prompt a higher level storage operator (like Rook) to recreate the missing file system. See https://docs.ceph.com/en/latest/cephfs/administration/#file-systems docs for more information. * CephFS: Before running the command "ceph fs rename", the filesystem to be renamed must be offline and the config "refuse_client_session" must be set for it. The config "refuse_client_session" can be removed/unset and filesystem can be online after the rename operation is complete. * RADOS: A POOL_APP_NOT_ENABLED health warning will now be reported if the application is not enabled for the pool irrespective of whether the pool is in use or not. Always tag a pool with an application using ``ceph osd pool application enable`` command to avoid reporting of POOL_APP_NOT_ENABLED health warning for that pool. The user might temporarily mute this warning using ``ceph health mute POOL_APP_NOT_ENABLED``. * The `mon_cluster_log_file_level` and `mon_cluster_log_to_syslog_level` options have been removed. Henceforth, users should use the new generic option `mon_cluster_log_level` to control the cluster log level verbosity for the cluster log file as well as for all external entities. CephFS: Disallow delegating preallocated inode ranges to clients. Config `mds_client_delegate_inos_pct` defaults to 0 which disables async dirops in the kclient. * S3 Get/HeadObject now support query parameter `partNumber` to read a specific part of a completed multipart upload. * RGW: Fixed a S3 Object Lock bug with PutObjectRetention requests that specify a RetainUntilDate after the year 2106. This date was truncated to 32 bits when stored, so a much earlier date was used for object lock enforcement. This does not effect PutBucketObjectLockConfiguration where a duration is given in Days. The RetainUntilDate encoding is fixed for new PutObjectRetention requests, but cannot repair the dates of existing object locks. Such objects can be identified with a HeadObject request based on the x-amz-object-lock-retain-until-date response header. * RADOS: `get_pool_is_selfmanaged_snaps_mode` C++ API has been deprecated due to being prone to false negative results. It's safer replacement is `pool_is_in_selfmanaged_snaps_mode`. * RADOS: For bug 62338 (https://tracker.ceph.com/issues/62338), in order to simplify backporting, we choose to not condition the fix on a server flag. As a result, in rare cases it may be possible for a PG to flip between two acting sets while an upgrade to a version with the fix is in progress. If you observe this behavior, you should be able to work around it by completing the upgrade or by disabling async recovery by setting osd_async_recovery_min_cost to a very large value on all OSDs until the upgrade is complete: ``ceph config set osd osd_async_recovery_min_cost 1099511627776`` * RADOS: A detailed version of the `balancer status` CLI command in the balancer module is now available. Users may run `ceph balancer status detail` to see more details about which PGs were updated in the balancer's last optimization. See https://docs.ceph.com/en/latest/rados/operations/balancer/ for more information. * CephFS: Full support for subvolumes and subvolume groups is now available for snap_schedule Manager module. * RGW: The SNS CreateTopic API now enforces the same topic naming requirements as AWS: Topic names must be made up of only uppercase and lowercase ASCII letters, numbers, underscores, and hyphens, and must be between 1 and 256 characters long. * RBD: When diffing against the beginning of time (`fromsnapname == NULL`) in fast-diff mode (`whole_object == true` with `fast-diff` image feature enabled and valid), diff-iterate is now guaranteed to execute locally if exclusive lock is available. This brings a dramatic performance improvement for QEMU live disk synchronization and backup use cases. * RBD: The ``try-netlink`` mapping option for rbd-nbd has become the default and is now deprecated. If the NBD netlink interface is not supported by the kernel, then the mapping is retried using the legacy ioctl interface. * RADOS: Read balancing may now be managed automatically via the balancer manager module. Users may choose between two new modes: ``upmap-read``, which offers upmap and read optimization simultaneously, or ``read``, which may be used to only optimize reads. For more detailed information see https://docs.ceph.com/en/latest/rados/operations/read-balancer/#online-optimization. * CephFS: MDS log trimming is now driven by a separate thread which tries to trim the log every second (`mds_log_trim_upkeep_interval` config). Also, a couple of configs govern how much time the MDS spends in trimming its logs. These configs are `mds_log_trim_threshold` and `mds_log_trim_decay_rate`. * RGW: Notification topics are now owned by the user that created them. By default, only the owner can read/write their topics. Topic policy documents are now supported to grant these permissions to other users. Preexisting topics are treated as if they have no owner, and any user can read/write them using the SNS API. If such a topic is recreated with CreateTopic, the issuing user becomes the new owner. For backward compatibility, all users still have permission to publish bucket notifications to topics owned by other users. A new configuration parameter: ``rgw_topic_require_publish_policy`` can be enabled to deny ``sns:Publish`` permissions unless explicitly granted by topic policy. * RGW: Fix issue with persistent notifications where the changes to topic param that were modified while persistent notifications were in the queue will be reflected in notifications. So if user sets up topic with incorrect config (password/ssl) causing failure while delivering the notifications to broker, can now modify the incorrect topic attribute and on retry attempt to delivery the notifications, new configs will be used. * RBD: The option ``--image-id`` has been added to `rbd children` CLI command, so it can be run for images in the trash. * PG dump: The default output of `ceph pg dump --format json` has changed. The default json format produces a rather massive output in large clusters and isn't scalable. So we have removed the 'network_ping_times' section from the output. Details in the tracker: https://tracker.ceph.com/issues/57460 * mgr/REST: The REST manager module will trim requests based on the 'max_requests' option. Without this feature, and in the absence of manual deletion of old requests, the accumulation of requests in the array can lead to Out Of Memory (OOM) issues, resulting in the Manager crashing. * CephFS: The `subvolume snapshot clone` command now depends on the config option `snapshot_clone_no_wait` which is used to reject the clone operation when all the cloner threads are busy. This config option is enabled by default which means that if no cloner threads are free, the clone request errors out with EAGAIN. The value of the config option can be fetched by using: `ceph config get mgr mgr/volumes/snapshot_clone_no_wait` and it can be disabled by using: `ceph config set mgr mgr/volumes/snapshot_clone_no_wait false` * RBD: `RBD_IMAGE_OPTION_CLONE_FORMAT` option has been exposed in Python bindings via `clone_format` optional parameter to `clone`, `deep_copy` and `migration_prepare` methods. * RBD: `RBD_IMAGE_OPTION_FLATTEN` option has been exposed in Python bindings via `flatten` optional parameter to `deep_copy` and `migration_prepare` methods. * CephFS: Command "ceph mds fail" and "ceph fs fail" now requires a confirmation flag when some MDSs exhibit health warning MDS_TRIM or MDS_CACHE_OVERSIZED. This is to prevent accidental MDS failover causing further delays in recovery. * CephFS: fixes to the implementation of the ``root_squash`` mechanism enabled via cephx ``mds`` caps on a client credential require a new client feature bit, ``client_mds_auth_caps``. Clients using credentials with ``root_squash`` without this feature will trigger the MDS to raise a HEALTH_ERR on the cluster, MDS_CLIENTS_BROKEN_ROOTSQUASH. See the documentation on this warning and the new feature bit for more information. * CephFS: Expanded removexattr support for cephfs virtual extended attributes. Previously one had to use setxattr to restore the default in order to "remove". You may now properly use removexattr to remove. You can also now remove layout on root inode, which then will restore layout to default layout. * cls_cxx_gather is marked as deprecated. * CephFS: cephfs-journal-tool is guarded against running on an online file system. The 'cephfs-journal-tool --rank : journal reset' and 'cephfs-journal-tool --rank : journal reset --force' commands require '--yes-i-really-really-mean-it'. * Dashboard: Rearranged Navigation Layout: The navigation layout has been reorganized for improved usability and easier access to key features. * Dashboard: CephFS Improvments * Support for managing CephFS snapshots and clones, as well as snapshot schedule management * Manage authorization capabilities for CephFS resources * Helpers on mounting a CephFS volume * Dashboard: RGW Improvements * Support for managing bucket policies * Add/Remove bucket tags * ACL Management * Several UI/UX Improvements to the bucket form * Monitoring: Grafana dashboards are now loaded into the container at runtime rather than building a grafana image with the grafana dashboards. Official Ceph grafana images can be found in quay.io/ceph/grafana * Monitoring: RGW S3 Analytics: A new Grafana dashboard is now available, enabling you to visualize per bucket and user analytics data, including total GETs, PUTs, Deletes, Copies, and list metrics. * RBD: `Image::access_timestamp` and `Image::modify_timestamp` Python APIs now return timestamps in UTC. * RBD: Support for cloning from non-user type snapshots is added. This is intended primarily as a building block for cloning new groups from group snapshots created with `rbd group snap create` command, but has also been exposed via the new `--snap-id` option for `rbd clone` command. * RBD: The output of `rbd snap ls --all` command now includes the original type for trashed snapshots. * CephFS: "ceph fs clone status" command will now print statistics about clone progress in terms of how much data has been cloned (in both percentage as well as bytes) and how many files have been cloned. * CephFS: "ceph status" command will now print a progress bar when cloning is ongoing. If clone jobs are more than the cloner threads, it will print one more progress bar that shows total amount of progress made by both ongoing as well as pending clones. Both progress are accompanied by messages that show number of clone jobs in the respective categories and the amount of progress made by each of them. * RGW: in bucket notifications, the `principalId` inside `ownerIdentity` now contains complete user id, prefixed with tenant id * NFS: The export create/apply of CephFS based exports will now have a additional parameter `cmount_path` under the FSAL block, which specifies the path within the CephFS to mount this export on. If this and the other `EXPORT { FSAL {} }` options are the same between multiple exports, those exports will share a single CephFS client. If not specified, the default is `/`. * CephFS: MDS emits a warning with estimated replay completion time when replay runs for more than 30 seconds. >=18.0.0 * The RGW policy parser now rejects unknown principals by default. If you are mirroring policies between RGW and AWS, you may wish to set "rgw policy reject invalid principals" to "false". This affects only newly set policies, not policies that are already in place. * The CephFS automatic metadata load (sometimes called "default") balancer is now disabled by default. The new file system flag `balance_automate` can be used to toggle it on or off. It can be enabled or disabled via `ceph fs set balance_automate `. * RGW's default backend for `rgw_enable_ops_log` changed from RADOS to file. The default value of `rgw_ops_log_rados` is now false, and `rgw_ops_log_file_path` defaults to "/var/log/ceph/ops-log-$cluster-$name.log". * The SPDK backend for BlueStore is now able to connect to an NVMeoF target. Please note that this is not an officially supported feature. * RGW's pubsub interface now returns boolean fields using bool. Before this change, `/topics/` returns "stored_secret" and "persistent" using a string of "true" or "false" with quotes around them. After this change, these fields are returned without quotes so they can be decoded as boolean values in JSON. The same applies to the `is_truncated` field returned by `/subscriptions/`. * RGW's response of `Action=GetTopicAttributes&TopicArn=` REST API now returns `HasStoredSecret` and `Persistent` as boolean in the JSON string encoded in `Attributes/EndPoint`. * All boolean fields previously rendered as string by `rgw-admin` command when the JSON format is used are now rendered as boolean. If your scripts/tools relies on this behavior, please update them accordingly. The impacted field names are: * absolute * add * admin * appendable * bucket_key_enabled * delete_marker * exists * has_bucket_info * high_precision_time * index * is_master * is_prefix * is_truncated * linked * log_meta * log_op * pending_removal * read_only * retain_head_object * rule_exist * start_with_full_sync * sync_from_all * syncstopped * system * truncated * user_stats_sync * RGW: The beast frontend's HTTP access log line uses a new debug_rgw_access configurable. This has the same defaults as debug_rgw, but can now be controlled independently. * RBD: The semantics of compare-and-write C++ API (`Image::compare_and_write` and `Image::aio_compare_and_write` methods) now match those of C API. Both compare and write steps operate only on `len` bytes even if the respective buffers are larger. The previous behavior of comparing up to the size of the compare buffer was prone to subtle breakage upon straddling a stripe unit boundary. * RBD: compare-and-write operation is no longer limited to 512-byte sectors. Assuming proper alignment, it now allows operating on stripe units (4M by default). * RBD: New `rbd_aio_compare_and_writev` API method to support scatter/gather on both compare and write buffers. This compliments existing `rbd_aio_readv` and `rbd_aio_writev` methods. * The 'AT_NO_ATTR_SYNC' macro is deprecated, please use the standard 'AT_STATX_DONT_SYNC' macro. The 'AT_NO_ATTR_SYNC' macro will be removed in the future. * Trimming of PGLog dups is now controlled by the size instead of the version. This fixes the PGLog inflation issue that was happening when the on-line (in OSD) trimming got jammed after a PG split operation. Also, a new off-line mechanism has been added: `ceph-objectstore-tool` got `trim-pg-log-dups` op that targets situations where OSD is unable to boot due to those inflated dups. If that is the case, in OSD logs the "You can be hit by THE DUPS BUG" warning will be visible. Relevant tracker: https://tracker.ceph.com/issues/53729 * RBD: `rbd device unmap` command gained `--namespace` option. Support for namespaces was added to RBD in Nautilus 14.2.0 and it has been possible to map and unmap images in namespaces using the `image-spec` syntax since then but the corresponding option available in most other commands was missing. * RGW: Compression is now supported for objects uploaded with Server-Side Encryption. When both are enabled, compression is applied before encryption. Earlier releases of multisite do not replicate such objects correctly, so all zones must upgrade to Reef before enabling the `compress-encrypted` zonegroup feature: see https://docs.ceph.com/en/reef/radosgw/multisite/#zone-features and note the security considerations. * RGW: the "pubsub" functionality for storing bucket notifications inside Ceph is removed. Together with it, the "pubsub" zone should not be used anymore. The REST operations, as well as radosgw-admin commands for manipulating subscriptions, as well as fetching and acking the notifications are removed as well. In case that the endpoint to which the notifications are sent maybe down or disconnected, it is recommended to use persistent notifications to guarantee the delivery of the notifications. In case the system that consumes the notifications needs to pull them (instead of the notifications be pushed to it), an external message bus (e.g. rabbitmq, Kafka) should be used for that purpose. * RGW: The serialized format of notification and topics has changed, so that new/updated topics will be unreadable by old RGWs. We recommend completing the RGW upgrades before creating or modifying any notification topics. * RBD: Trailing newline in passphrase files (`` argument in `rbd encryption format` command and `--encryption-passphrase-file` option in other commands) is no longer stripped. * RBD: Support for layered client-side encryption is added. Cloned images can now be encrypted each with its own encryption format and passphrase, potentially different from that of the parent image. The efficient copy-on-write semantics intrinsic to unformatted (regular) cloned images are retained. * CEPHFS: Rename the `mds_max_retries_on_remount_failure` option to `client_max_retries_on_remount_failure` and move it from mds.yaml.in to mds-client.yaml.in because this option was only used by MDS client from its birth. * The `perf dump` and `perf schema` commands are deprecated in favor of new `counter dump` and `counter schema` commands. These new commands add support for labeled perf counters and also emit existing unlabeled perf counters. Some unlabeled perf counters became labeled in this release, with more to follow in future releases; such converted perf counters are no longer emitted by the `perf dump` and `perf schema` commands. * `ceph mgr dump` command now outputs `last_failure_osd_epoch` and `active_clients` fields at the top level. Previously, these fields were output under `always_on_modules` field. * `ceph mgr dump` command now displays the name of the mgr module that registered a RADOS client in the `name` field added to elements of the `active_clients` array. Previously, only the address of a module's RADOS client was shown in the `active_clients` array. * RBD: All rbd-mirror daemon perf counters became labeled and as such are now emitted only by the new `counter dump` and `counter schema` commands. As part of the conversion, many also got renamed to better disambiguate journal-based and snapshot-based mirroring. * RBD: list-watchers C++ API (`Image::list_watchers`) now clears the passed `std::list` before potentially appending to it, aligning with the semantics of the corresponding C API (`rbd_watchers_list`). * The rados python binding is now able to process (opt-in) omap keys as bytes objects. This enables interacting with RADOS omap keys that are not decodeable as UTF-8 strings. * Telemetry: Users who are opted-in to telemetry can also opt-in to participating in a leaderboard in the telemetry public dashboards (https://telemetry-public.ceph.com/). Users can now also add a description of the cluster to publicly appear in the leaderboard. For more details, see: https://docs.ceph.com/en/latest/mgr/telemetry/#leaderboard See a sample report with `ceph telemetry preview`. Opt-in to telemetry with `ceph telemetry on`. Opt-in to the leaderboard with `ceph config set mgr mgr/telemetry/leaderboard true`. Add leaderboard description with: `ceph config set mgr mgr/telemetry/leaderboard_description ‘Cluster description’`. * CEPHFS: After recovering a Ceph File System post following the disaster recovery procedure, the recovered files under `lost+found` directory can now be deleted. * core: cache-tiering is now deprecated. * mClock Scheduler: The mClock scheduler (default scheduler in Quincy) has undergone significant usability and design improvements to address the slow backfill issue. Some important changes are: * The 'balanced' profile is set as the default mClock profile because it represents a compromise between prioritizing client IO or recovery IO. Users can then choose either the 'high_client_ops' profile to prioritize client IO or the 'high_recovery_ops' profile to prioritize recovery IO. * QoS parameters like reservation and limit are now specified in terms of a fraction (range: 0.0 to 1.0) of the OSD's IOPS capacity. * The cost parameters (osd_mclock_cost_per_io_usec_* and osd_mclock_cost_per_byte_usec_*) have been removed. The cost of an operation is now determined using the random IOPS and maximum sequential bandwidth capability of the OSD's underlying device. * Degraded object recovery is given higher priority when compared to misplaced object recovery because degraded objects present a data safety issue not present with objects that are merely misplaced. Therefore, backfilling operations with the 'balanced' and 'high_client_ops' mClock profiles may progress slower than what was seen with the 'WeightedPriorityQueue' (WPQ) scheduler. * The QoS allocations in all the mClock profiles are optimized based on the above fixes and enhancements. * For more detailed information see: https://docs.ceph.com/en/latest/rados/configuration/mclock-config-ref/ * mgr/snap_schedule: The snap-schedule mgr module now retains one less snapshot than the number mentioned against the config tunable `mds_max_snaps_per_dir` so that a new snapshot can be created and retained during the next schedule run. * `ceph config dump --format ` output will display the localized option names instead of its normalized version. For e.g., "mgr/prometheus/x/server_port" will be displayed instead of "mgr/prometheus/server_port". This matches the output of the non pretty-print formatted version of the command. * CEPHFS: MDS config option name "mds_kill_skip_replaying_inotable" is a bit confusing with "mds_inject_skip_replaying_inotable", therefore renaming it to "mds_kill_after_journal_logs_flushed" >=17.2.1 * The "BlueStore zero block detection" feature (first introduced to Quincy in https://github.com/ceph/ceph/pull/43337) has been turned off by default with a new global configuration called `bluestore_zero_block_detection`. This feature, intended for large-scale synthetic testing, does not interact well with some RBD and CephFS features. Any side effects experienced in previous Quincy versions would no longer occur, provided that the configuration remains set to false. Relevant tracker: https://tracker.ceph.com/issues/55521 * telemetry: Added new Rook metrics to the 'basic' channel to report Rook's version, Kubernetes version, node metrics, etc. See a sample report with `ceph telemetry preview`. Opt-in with `ceph telemetry on`. For more details, see: https://docs.ceph.com/en/latest/mgr/telemetry/ * OSD: The issue of high CPU utilization during recovery/backfill operations has been fixed. For more details, see: https://tracker.ceph.com/issues/56530. >=15.2.17 * OSD: Octopus modified the SnapMapper key format from __ to ___ When this change was introduced, 94ebe0e also introduced a conversion with a crucial bug which essentially destroyed legacy keys by mapping them to __ without the object-unique suffix. The conversion is fixed in this release. Relevant tracker: https://tracker.ceph.com/issues/56147 * Cephadm may now be configured to carry out CephFS MDS upgrades without reducing ``max_mds`` to 1. Previously, Cephadm would reduce ``max_mds`` to 1 to avoid having two active MDS modifying on-disk structures with new versions, communicating cross-version-incompatible messages, or other potential incompatibilities. This could be disruptive for large-scale CephFS deployments because the cluster cannot easily reduce active MDS daemons to 1. NOTE: Staggered upgrade of the mons/mgrs may be necessary to take advantage of the feature, refer this link on how to perform it: https://docs.ceph.com/en/quincy/cephadm/upgrade/#staggered-upgrade Relevant tracker: https://tracker.ceph.com/issues/55715 * Introduced a new file system flag `refuse_client_session` that can be set using the `fs set` command. This flag allows blocking any incoming session request from client(s). This can be useful during some recovery situations where it's desirable to bring MDS up but have no client workload. Relevant tracker: https://tracker.ceph.com/issues/57090 * New MDSMap field `max_xattr_size` which can be set using the `fs set` command. This MDSMap field allows to configure the maximum size allowed for the full key/value set for a filesystem extended attributes. It effectively replaces the old per-MDS `max_xattr_pairs_size` setting, which is now dropped. Relevant tracker: https://tracker.ceph.com/issues/55725 * Introduced a new file system flag `refuse_standby_for_another_fs` that can be set using the `fs set` command. This flag prevents using a standby for another file system (join_fs = X) when standby for the current filesystem is not available. Relevant tracker: https://tracker.ceph.com/issues/61599 * mon: add NVMe-oF gateway monitor and HA This PR adds high availability support for the nvmeof Ceph service. High availability means that even in the case that a certain GW is down, there will be another available path for the initiator to be able to continue the IO through another GW. It is also adding 2 new mon commands, to notify monitor about the gateway creation/deletion: - nvme-gw create - nvme-gw delete Relevant tracker: https://tracker.ceph.com/issues/64777