1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
|
======================
Troubleshooting OSDs
======================
Before troubleshooting the cluster's OSDs, check the monitors
and the network.
First, determine whether the monitors have a quorum. Run the ``ceph health``
command or the ``ceph -s`` command and if Ceph shows ``HEALTH_OK`` then there
is a monitor quorum.
If the monitors don't have a quorum or if there are errors with the monitor
status, address the monitor issues before proceeding by consulting the material
in `Troubleshooting Monitors <../troubleshooting-mon>`_.
Next, check your networks to make sure that they are running properly. Networks
can have a significant impact on OSD operation and performance. Look for
dropped packets on the host side and CRC errors on the switch side.
Obtaining Data About OSDs
=========================
When troubleshooting OSDs, it is useful to collect different kinds of
information about the OSDs. Some information comes from the practice of
`monitoring OSDs`_ (for example, by running the ``ceph osd tree`` command).
Additional information concerns the topology of your cluster, and is discussed
in the following sections.
Ceph Logs
---------
Ceph log files are stored under ``/var/log/ceph``. Unless the path has been
changed (or you are in a containerized environment that stores logs in a
different location), the log files can be listed by running the following
command:
.. prompt:: bash
ls /var/log/ceph
If there is not enough log detail, change the logging level. To ensure that
Ceph performs adequately under high logging volume, see `Logging and
Debugging`_.
Admin Socket
------------
Use the admin socket tool to retrieve runtime information. First, list the
sockets of Ceph's daemons by running the following command:
.. prompt:: bash
ls /var/run/ceph
Next, run a command of the following form (replacing ``{daemon-name}`` with the
name of a specific daemon: for example, ``osd.0``):
.. prompt:: bash
ceph daemon {daemon-name} help
Alternatively, run the command with a ``{socket-file}`` specified (a "socket
file" is a specific file in ``/var/run/ceph``):
.. prompt:: bash
ceph daemon {socket-file} help
The admin socket makes many tasks possible, including:
- Listing Ceph configuration at runtime
- Dumping historic operations
- Dumping the operation priority queue state
- Dumping operations in flight
- Dumping perfcounters
Display Free Space
------------------
Filesystem issues may arise. To display your filesystems' free space, run the
following command:
.. prompt:: bash
df -h
To see this command's supported syntax and options, run ``df --help``.
I/O Statistics
--------------
The `iostat`_ tool can be used to identify I/O-related issues. Run the
following command:
.. prompt:: bash
iostat -x
Diagnostic Messages
-------------------
To retrieve diagnostic messages from the kernel, run the ``dmesg`` command and
specify the output with ``less``, ``more``, ``grep``, or ``tail``. For
example:
.. prompt:: bash
dmesg | grep scsi
Stopping without Rebalancing
============================
It might be occasionally necessary to perform maintenance on a subset of your
cluster or to resolve a problem that affects a failure domain (for example, a
rack). However, when you stop OSDs for maintenance, you might want to prevent
CRUSH from automatically rebalancing the cluster. To avert this rebalancing
behavior, set the cluster to ``noout`` by running the following command:
.. prompt:: bash
ceph osd set noout
.. warning:: This is more a thought exercise offered for the purpose of giving
the reader a sense of failure domains and CRUSH behavior than a suggestion
that anyone in the post-Luminous world run ``ceph osd set noout``. When the
OSDs return to an ``up`` state, rebalancing will resume and the change
introduced by the ``ceph osd set noout`` command will be reverted.
In Luminous and later releases, however, it is a safer approach to flag only
affected OSDs. To add or remove a ``noout`` flag to a specific OSD, run a
command like the following:
.. prompt:: bash
ceph osd add-noout osd.0
ceph osd rm-noout osd.0
It is also possible to flag an entire CRUSH bucket. For example, if you plan to
take down ``prod-ceph-data1701`` in order to add RAM, you might run the
following command:
.. prompt:: bash
ceph osd set-group noout prod-ceph-data1701
After the flag is set, stop the OSDs and any other colocated
Ceph services within the failure domain that requires maintenance work::
systemctl stop ceph\*.service ceph\*.target
.. note:: When an OSD is stopped, any placement groups within the OSD are
marked as ``degraded``.
After the maintenance is complete, it will be necessary to restart the OSDs
and any other daemons that have stopped. However, if the host was rebooted as
part of the maintenance, they do not need to be restarted and will come back up
automatically. To restart OSDs or other daemons, use a command of the following
form:
.. prompt:: bash
sudo systemctl start ceph.target
Finally, unset the ``noout`` flag as needed by running commands like the
following:
.. prompt:: bash
ceph osd unset noout
ceph osd unset-group noout prod-ceph-data1701
Many contemporary Linux distributions employ ``systemd`` for service
management. However, for certain operating systems (especially older ones) it
might be necessary to issue equivalent ``service`` or ``start``/``stop``
commands.
.. _osd-not-running:
OSD Not Running
===============
Under normal conditions, restarting a ``ceph-osd`` daemon will allow it to
rejoin the cluster and recover.
An OSD Won't Start
------------------
If the cluster has started but an OSD isn't starting, check the following:
- **Configuration File:** If you were not able to get OSDs running from a new
installation, check your configuration file to ensure it conforms to the
standard (for example, make sure that it says ``host`` and not ``hostname``,
etc.).
- **Check Paths:** Ensure that the paths specified in the configuration
correspond to the paths for data and metadata that actually exist (for
example, the paths to the journals, the WAL, and the DB). Separate the OSD
data from the metadata in order to see whether there are errors in the
configuration file and in the actual mounts. If so, these errors might
explain why OSDs are not starting. To store the metadata on a separate block
device, partition or LVM the drive and assign one partition per OSD.
- **Check Max Threadcount:** If the cluster has a node with an especially high
number of OSDs, it might be hitting the default maximum number of threads
(usually 32,000). This is especially likely to happen during recovery.
Increasing the maximum number of threads to the maximum possible number of
threads allowed (4194303) might help with the problem. To increase the number
of threads to the maximum, run the following command:
.. prompt:: bash
sysctl -w kernel.pid_max=4194303
If this increase resolves the issue, you must make the increase permanent by
including a ``kernel.pid_max`` setting either in a file under
``/etc/sysctl.d`` or within the master ``/etc/sysctl.conf`` file. For
example::
kernel.pid_max = 4194303
- **Check ``nf_conntrack``:** This connection-tracking and connection-limiting
system causes problems for many production Ceph clusters. The problems often
emerge slowly and subtly. As cluster topology and client workload grow,
mysterious and intermittent connection failures and performance glitches
occur more and more, especially at certain times of the day. To begin taking
the measure of your problem, check the ``syslog`` history for "table full"
events. One way to address this kind of problem is as follows: First, use the
``sysctl`` utility to assign ``nf_conntrack_max`` a much higher value. Next,
raise the value of ``nf_conntrack_buckets`` so that ``nf_conntrack_buckets``
× 8 = ``nf_conntrack_max``; this action might require running commands
outside of ``sysctl`` (for example, ``"echo 131072 >
/sys/module/nf_conntrack/parameters/hashsize``). Another way to address the
problem is to blacklist the associated kernel modules in order to disable
processing altogether. This approach is powerful, but fragile. The modules
and the order in which the modules must be listed can vary among kernel
versions. Even when blacklisted, ``iptables`` and ``docker`` might sometimes
activate connection tracking anyway, so we advise a "set and forget" strategy
for the tunables. On modern systems, this approach will not consume
appreciable resources.
- **Kernel Version:** Identify the kernel version and distribution that are in
use. By default, Ceph uses third-party tools that might be buggy or come into
conflict with certain distributions or kernel versions (for example, Google's
``gperftools`` and ``TCMalloc``). Check the `OS recommendations`_ and the
release notes for each Ceph version in order to make sure that you have
addressed any issues related to your kernel.
- **Segment Fault:** If there is a segment fault, increase log levels and
restart the problematic daemon(s). If segment faults recur, search the Ceph
bug tracker `https://tracker.ceph/com/projects/ceph
<https://tracker.ceph.com/projects/ceph/>`_ and the ``dev`` and
``ceph-users`` mailing list archives `https://ceph.io/resources
<https://ceph.io/resources>`_ to see if others have experienced and reported
these issues. If this truly is a new and unique failure, post to the ``dev``
email list and provide the following information: the specific Ceph release
being run, ``ceph.conf`` (with secrets XXX'd out), your monitor status
output, and excerpts from your log file(s).
An OSD Failed
-------------
When an OSD fails, this means that a ``ceph-osd`` process is unresponsive or
has died and that the corresponding OSD has been marked ``down``. Surviving
``ceph-osd`` daemons will report to the monitors that the OSD appears to be
down, and a new status will be visible in the output of the ``ceph health``
command, as in the following example:
.. prompt:: bash
ceph health
::
HEALTH_WARN 1/3 in osds are down
This health alert is raised whenever there are one or more OSDs marked ``in``
and ``down``. To see which OSDs are ``down``, add ``detail`` to the command as in
the following example:
.. prompt:: bash
ceph health detail
::
HEALTH_WARN 1/3 in osds are down
osd.0 is down since epoch 23, last address 192.168.106.220:6800/11080
Alternatively, run the following command:
.. prompt:: bash
ceph osd tree down
If there is a drive failure or another fault that is preventing a given
``ceph-osd`` daemon from functioning or restarting, then there should be an
error message present in its log file under ``/var/log/ceph``.
If the ``ceph-osd`` daemon stopped because of a heartbeat failure or a
``suicide timeout`` error, then the underlying drive or filesystem might be
unresponsive. Check ``dmesg`` output and `syslog` output for drive errors or
kernel errors. It might be necessary to specify certain flags (for example,
``dmesg -T`` to see human-readable timestamps) in order to avoid mistaking old
errors for new errors.
If an entire host's OSDs are ``down``, check to see if there is a network
error or a hardware issue with the host.
If the OSD problem is the result of a software error (for example, a failed
assertion or another unexpected error), search for reports of the issue in the
`bug tracker <https://tracker.ceph/com/projects/ceph>`_ , the `dev mailing list
archives <https://lists.ceph.io/hyperkitty/list/dev@ceph.io/>`_, and the
`ceph-users mailing list archives
<https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/>`_. If there is no
clear fix or existing bug, then :ref:`report the problem to the ceph-devel
email list <Get Involved>`.
.. _no-free-drive-space:
No Free Drive Space
-------------------
If an OSD is full, Ceph prevents data loss by ensuring that no new data is
written to the OSD. In an properly running cluster, health checks are raised
when the cluster's OSDs and pools approach certain "fullness" ratios. The
``mon_osd_full_ratio`` threshold defaults to ``0.95`` (or 95% of capacity):
this is the point above which clients are prevented from writing data. The
``mon_osd_backfillfull_ratio`` threshold defaults to ``0.90`` (or 90% of
capacity): this is the point above which backfills will not start. The
``mon_osd_nearfull_ratio`` threshold defaults to ``0.85`` (or 85% of capacity):
this is the point at which it raises the ``OSD_NEARFULL`` health check.
OSDs within a cluster will vary in how much data is allocated to them by Ceph.
To check "fullness" by displaying data utilization for every OSD, run the
following command:
.. prompt:: bash
ceph osd df
To check "fullness" by displaying a cluster’s overall data usage and data
distribution among pools, run the following command:
.. prompt:: bash
ceph df
When examining the output of the ``ceph df`` command, pay special attention to
the **most full** OSDs, as opposed to the percentage of raw space used. If a
single outlier OSD becomes full, all writes to this OSD's pool might fail as a
result. When ``ceph df`` reports the space available to a pool, it considers
the ratio settings relative to the *most full* OSD that is part of the pool. To
flatten the distribution, two approaches are available: (1) Using the
``reweight-by-utilization`` command to progressively move data from excessively
full OSDs or move data to insufficiently full OSDs, and (2) in later revisions
of Luminous and subsequent releases, exploiting the ``ceph-mgr`` ``balancer``
module to perform the same task automatically.
To adjust the "fullness" ratios, run a command or commands of the following
form:
.. prompt:: bash
ceph osd set-nearfull-ratio <float[0.0-1.0]>
ceph osd set-full-ratio <float[0.0-1.0]>
ceph osd set-backfillfull-ratio <float[0.0-1.0]>
Sometimes full cluster issues arise because an OSD has failed. This can happen
either because of a test or because the cluster is small, very full, or
unbalanced. When an OSD or node holds an excessive percentage of the cluster's
data, component failures or natural growth can result in the ``nearfull`` and
``full`` ratios being exceeded. When testing Ceph's resilience to OSD failures
on a small cluster, it is advised to leave ample free disk space and to
consider temporarily lowering the OSD ``full ratio``, OSD ``backfillfull
ratio``, and OSD ``nearfull ratio``.
The "fullness" status of OSDs is visible in the output of the ``ceph health``
command, as in the following example:
.. prompt:: bash
ceph health
::
HEALTH_WARN 1 nearfull osd(s)
For details, add the ``detail`` command as in the following example:
.. prompt:: bash
ceph health detail
::
HEALTH_ERR 1 full osd(s); 1 backfillfull osd(s); 1 nearfull osd(s)
osd.3 is full at 97%
osd.4 is backfill full at 91%
osd.2 is near full at 87%
To address full cluster issues, it is recommended to add capacity by adding
OSDs. Adding new OSDs allows the cluster to redistribute data to newly
available storage. Search for ``rados bench`` orphans that are wasting space.
If a legacy Filestore OSD cannot be started because it is full, it is possible
to reclaim space by deleting a small number of placement group directories in
the full OSD.
.. important:: If you choose to delete a placement group directory on a full
OSD, **DO NOT** delete the same placement group directory on another full
OSD. **OTHERWISE YOU WILL LOSE DATA**. You **MUST** maintain at least one
copy of your data on at least one OSD. Deleting placement group directories
is a rare and extreme intervention. It is not to be undertaken lightly.
See `Monitor Config Reference`_ for more information.
OSDs are Slow/Unresponsive
==========================
OSDs are sometimes slow or unresponsive. When troubleshooting this common
problem, it is advised to eliminate other possibilities before investigating
OSD performance issues. For example, be sure to confirm that your network(s)
are working properly, to verify that your OSDs are running, and to check
whether OSDs are throttling recovery traffic.
.. tip:: In pre-Luminous releases of Ceph, ``up`` and ``in`` OSDs were
sometimes not available or were otherwise slow because recovering OSDs were
consuming system resources. Newer releases provide better recovery handling
by preventing this phenomenon.
Networking Issues
-----------------
As a distributed storage system, Ceph relies upon networks for OSD peering and
replication, recovery from faults, and periodic heartbeats. Networking issues
can cause OSD latency and flapping OSDs. For more information, see `Flapping
OSDs`_.
To make sure that Ceph processes and Ceph-dependent processes are connected and
listening, run the following commands:
.. prompt:: bash
netstat -a | grep ceph
netstat -l | grep ceph
sudo netstat -p | grep ceph
To check network statistics, run the following command:
.. prompt:: bash
netstat -s
Drive Configuration
-------------------
An SAS or SATA storage drive should house only one OSD, but a NVMe drive can
easily house two or more. However, it is possible for read and write throughput
to bottleneck if other processes share the drive. Such processes include:
journals / metadata, operating systems, Ceph monitors, ``syslog`` logs, other
OSDs, and non-Ceph processes.
Because Ceph acknowledges writes *after* journaling, fast SSDs are an
attractive option for accelerating response time -- particularly when using the
``XFS`` or ``ext4`` filesystems for legacy FileStore OSDs. By contrast, the
``Btrfs`` file system can write and journal simultaneously. (However, use of
``Btrfs`` is not recommended for production deployments.)
.. note:: Partitioning a drive does not change its total throughput or
sequential read/write limits. Throughput might be improved somewhat by
running a journal in a separate partition, but it is better still to run
such a journal in a separate physical drive.
.. warning:: Reef does not support FileStore. Releases after Reef do not
support FileStore. Any information that mentions FileStore is pertinent only
to the Quincy release of Ceph and to releases prior to Quincy.
Bad Sectors / Fragmented Disk
-----------------------------
Check your drives for bad blocks, fragmentation, and other errors that can
cause significantly degraded performance. Tools that are useful in checking for
drive errors include ``dmesg``, ``syslog`` logs, and ``smartctl`` (found in the
``smartmontools`` package).
.. note:: ``smartmontools`` 7.0 and late provides NVMe stat passthrough and
JSON output.
Co-resident Monitors/OSDs
-------------------------
Although monitors are relatively lightweight processes, performance issues can
result when monitors are run on the same host machine as an OSD. Monitors issue
many ``fsync()`` calls and this can interfere with other workloads. The danger
of performance issues is especially acute when the monitors are co-resident on
the same storage drive as an OSD. In addition, if the monitors are running an
older kernel (pre-3.0) or a kernel with no ``syncfs(2)`` syscall, then multiple
OSDs running on the same host might make so many commits as to undermine each
other's performance. This problem sometimes results in what is called "the
bursty writes".
Co-resident Processes
---------------------
Significant OSD latency can result from processes that write data to Ceph (for
example, cloud-based solutions and virtual machines) while operating on the
same hardware as OSDs. For this reason, making such processes co-resident with
OSDs is not generally recommended. Instead, the recommended practice is to
optimize certain hosts for use with Ceph and use other hosts for other
processes. This practice of separating Ceph operations from other applications
might help improve performance and might also streamline troubleshooting and
maintenance.
Running co-resident processes on the same hardware is sometimes called
"convergence". When using Ceph, engage in convergence only with expertise and
after consideration.
Logging Levels
--------------
Performance issues can result from high logging levels. Operators sometimes
raise logging levels in order to track an issue and then forget to lower them
afterwards. In such a situation, OSDs might consume valuable system resources to
write needlessly verbose logs onto the disk. Anyone who does want to use high logging
levels is advised to consider mounting a drive to the default path for logging
(for example, ``/var/log/ceph/$cluster-$name.log``).
Recovery Throttling
-------------------
Depending upon your configuration, Ceph may reduce recovery rates to maintain
client or OSD performance, or it may increase recovery rates to the point that
recovery impacts client or OSD performance. Check to see if the client or OSD
is recovering.
Kernel Version
--------------
Check the kernel version that you are running. Older kernels may lack updates
that improve Ceph performance.
Kernel Issues with SyncFS
-------------------------
If you have kernel issues with SyncFS, try running one OSD per host to see if
performance improves. Old kernels might not have a recent enough version of
``glibc`` to support ``syncfs(2)``.
Filesystem Issues
-----------------
In post-Luminous releases, we recommend deploying clusters with the BlueStore
back end. When running a pre-Luminous release, or if you have a specific
reason to deploy OSDs with the previous Filestore backend, we recommend
``XFS``.
We recommend against using ``Btrfs`` or ``ext4``. The ``Btrfs`` filesystem has
many attractive features, but bugs may lead to performance issues and spurious
ENOSPC errors. We do not recommend ``ext4`` for Filestore OSDs because
``xattr`` limitations break support for long object names, which are needed for
RGW.
For more information, see `Filesystem Recommendations`_.
.. _Filesystem Recommendations: ../configuration/filesystem-recommendations
Insufficient RAM
----------------
We recommend a *minimum* of 4GB of RAM per OSD daemon and we suggest rounding
up from 6GB to 8GB. During normal operations, you may notice that ``ceph-osd``
processes use only a fraction of that amount. You might be tempted to use the
excess RAM for co-resident applications or to skimp on each node's memory
capacity. However, when OSDs experience recovery their memory utilization
spikes. If there is insufficient RAM available during recovery, OSD performance
will slow considerably and the daemons may even crash or be killed by the Linux
``OOM Killer``.
Blocked Requests or Slow Requests
---------------------------------
When a ``ceph-osd`` daemon is slow to respond to a request, the cluster log
receives messages reporting ops that are taking too long. The warning threshold
defaults to 30 seconds and is configurable via the ``osd_op_complaint_time``
setting.
Legacy versions of Ceph complain about ``old requests``::
osd.0 192.168.106.220:6800/18813 312 : [WRN] old request osd_op(client.5099.0:790 fatty_26485_object789 [write 0~4096] 2.5e54f643) v4 received at 2012-03-06 15:42:56.054801 currently waiting for sub ops
Newer versions of Ceph complain about ``slow requests``::
{date} {osd.num} [WRN] 1 slow requests, 1 included below; oldest blocked for > 30.005692 secs
{date} {osd.num} [WRN] slow request 30.005692 seconds old, received at {date-time}: osd_op(client.4240.0:8 benchmark_data_ceph-1_39426_object7 [write 0~4194304] 0.69848840) v4 currently waiting for subops from [610]
Possible causes include:
- A failing drive (check ``dmesg`` output)
- A bug in the kernel file system (check ``dmesg`` output)
- An overloaded cluster (check system load, iostat, etc.)
- A bug in the ``ceph-osd`` daemon.
- Suboptimal OSD shard configuration (on HDD based cluster with mClock scheduler)
Possible solutions:
- Remove VMs from Ceph hosts
- Upgrade kernel
- Upgrade Ceph
- Restart OSDs
- Replace failed or failing components
- Override OSD shard configuration (on HDD based cluster with mClock scheduler)
- See :ref:`mclock-tblshoot-hdd-shard-config` for resolution
Debugging Slow Requests
-----------------------
If you run ``ceph daemon osd.<id> dump_historic_ops`` or ``ceph daemon osd.<id>
dump_ops_in_flight``, you will see a set of operations and a list of events
each operation went through. These are briefly described below.
Events from the Messenger layer:
- ``header_read``: The time that the messenger first started reading the message off the wire.
- ``throttled``: The time that the messenger tried to acquire memory throttle space to read
the message into memory.
- ``all_read``: The time that the messenger finished reading the message off the wire.
- ``dispatched``: The time that the messenger gave the message to the OSD.
- ``initiated``: This is identical to ``header_read``. The existence of both is a
historical oddity.
Events from the OSD as it processes ops:
- ``queued_for_pg``: The op has been put into the queue for processing by its PG.
- ``reached_pg``: The PG has started performing the op.
- ``waiting for \*``: The op is waiting for some other work to complete before
it can proceed (for example, a new OSDMap; the scrubbing of its object
target; the completion of a PG's peering; all as specified in the message).
- ``started``: The op has been accepted as something the OSD should do and
is now being performed.
- ``waiting for subops from``: The op has been sent to replica OSDs.
Events from ``Filestore``:
- ``commit_queued_for_journal_write``: The op has been given to the FileStore.
- ``write_thread_in_journal_buffer``: The op is in the journal's buffer and is waiting
to be persisted (as the next disk write).
- ``journaled_completion_queued``: The op was journaled to disk and its callback
has been queued for invocation.
Events from the OSD after data has been given to underlying storage:
- ``op_commit``: The op has been committed (that is, written to journal) by the
primary OSD.
- ``op_applied``: The op has been `written with write()
<https://www.freebsd.org/cgi/man.cgi?write(2)>`_ to the backing FS (that is,
applied in memory but not flushed out to disk) on the primary.
- ``sub_op_applied``: ``op_applied``, but for a replica's "subop".
- ``sub_op_committed``: ``op_commit``, but for a replica's subop (only for EC pools).
- ``sub_op_commit_rec/sub_op_apply_rec from <X>``: The primary marks this when it
hears about the above, but for a particular replica (i.e. ``<X>``).
- ``commit_sent``: We sent a reply back to the client (or primary OSD, for sub ops).
Although some of these events may appear redundant, they cross important
boundaries in the internal code (such as passing data across locks into new
threads).
.. _mclock-tblshoot-hdd-shard-config:
Slow Requests or Slow Recovery With mClock Scheduler
----------------------------------------------------
.. note:: This troubleshooting is applicable only for HDD based clusters running
mClock scheduler and with the following OSD shard configuration:
``osd_op_num_shards_hdd`` = 5 and ``osd_op_num_threads_per_shard_hdd`` = 1.
Also, see :ref:`mclock-hdd-cfg` for details around the reason for the change
made to the default OSD HDD shard configuration for mClock.
On scaled HDD based clusters with mClock scheduler enabled and under multiple
OSD node failure condition, the following could be reported or observed:
- slow requests: This also manifests into degraded client I/O performance.
- slow background recoveries: Lower than expected recovery throughput.
**Troubleshooting Steps:**
#. Verify from OSD events that the slow requests are predominantly of type
``queued_for_pg``.
#. Verify if the reported recovery rate is significantly lower than the expected
rate considering the QoS allocations for background recovery service.
If either of the above steps are true, then the following resolution may be
applied. Note that this is disruptive as it involves OSD restarts. Run the
following commands to change the default OSD shard configuration for HDDs:
.. prompt:: bash
ceph config set osd osd_op_num_shards_hdd 1
ceph config set osd osd_op_num_threads_per_shard_hdd 5
The above configuration won't take effect immediately and would require a
restart of the OSDs in the environment. For this process to be least disruptive,
the OSDs may be restarted in a carefully staggered manner.
.. _rados_tshooting_flapping_osd:
Flapping OSDs
=============
"Flapping" is the term for the phenomenon of an OSD being repeatedly marked
``up`` and then ``down`` in rapid succession. This section explains how to
recognize flapping, and how to mitigate it.
When OSDs peer and check heartbeats, they use the cluster (back-end) network
when it is available. See `Monitor/OSD Interaction`_ for details.
The upstream Ceph community has traditionally recommended separate *public*
(front-end) and *private* (cluster / back-end / replication) networks. This
provides the following benefits:
#. Segregation of (1) heartbeat traffic and replication/recovery traffic
(private) from (2) traffic from clients and between OSDs and monitors
(public). This helps keep one stream of traffic from DoS-ing the other,
which could in turn result in a cascading failure.
#. Additional throughput for both public and private traffic.
In the past, when common networking technologies were measured in a range
encompassing 100Mb/s and 1Gb/s, this separation was often critical. But with
today's 10Gb/s, 40Gb/s, and 25/50/100Gb/s networks, the above capacity concerns
are often diminished or even obviated. For example, if your OSD nodes have two
network ports, dedicating one to the public and the other to the private
network means that you have no path redundancy. This degrades your ability to
endure network maintenance and network failures without significant cluster or
client impact. In situations like this, consider instead using both links for
only a public network: with bonding (LACP) or equal-cost routing (for example,
FRR) you reap the benefits of increased throughput headroom, fault tolerance,
and reduced OSD flapping.
When a private network (or even a single host link) fails or degrades while the
public network continues operating normally, OSDs may not handle this situation
well. In such situations, OSDs use the public network to report each other
``down`` to the monitors, while marking themselves ``up``. The monitors then
send out-- again on the public network--an updated cluster map with the
affected OSDs marked `down`. These OSDs reply to the monitors "I'm not dead
yet!", and the cycle repeats. We call this scenario 'flapping`, and it can be
difficult to isolate and remediate. Without a private network, this irksome
dynamic is avoided: OSDs are generally either ``up`` or ``down`` without
flapping.
If something does cause OSDs to 'flap' (repeatedly being marked ``down`` and
then ``up`` again), you can force the monitors to halt the flapping by
temporarily freezing their states:
.. prompt:: bash
ceph osd set noup # prevent OSDs from getting marked up
ceph osd set nodown # prevent OSDs from getting marked down
These flags are recorded in the osdmap:
.. prompt:: bash
ceph osd dump | grep flags
::
flags no-up,no-down
You can clear these flags with:
.. prompt:: bash
ceph osd unset noup
ceph osd unset nodown
Two other flags are available, ``noin`` and ``noout``, which prevent booting
OSDs from being marked ``in`` (allocated data) or protect OSDs from eventually
being marked ``out`` (regardless of the current value of
``mon_osd_down_out_interval``).
.. note:: ``noup``, ``noout``, and ``nodown`` are temporary in the sense that
after the flags are cleared, the action that they were blocking should be
possible shortly thereafter. But the ``noin`` flag prevents OSDs from being
marked ``in`` on boot, and any daemons that started while the flag was set
will remain that way.
.. note:: The causes and effects of flapping can be mitigated somewhat by
making careful adjustments to ``mon_osd_down_out_subtree_limit``,
``mon_osd_reporter_subtree_level``, and ``mon_osd_min_down_reporters``.
Derivation of optimal settings depends on cluster size, topology, and the
Ceph release in use. The interaction of all of these factors is subtle and
is beyond the scope of this document.
.. _iostat: https://en.wikipedia.org/wiki/Iostat
.. _Ceph Logging and Debugging: ../../configuration/ceph-conf#ceph-logging-and-debugging
.. _Logging and Debugging: ../log-and-debug
.. _Debugging and Logging: ../debug
.. _Monitor/OSD Interaction: ../../configuration/mon-osd-interaction
.. _Monitor Config Reference: ../../configuration/mon-config-ref
.. _monitoring your OSDs: ../../operations/monitoring-osd-pg
.. _monitoring OSDs: ../../operations/monitoring-osd-pg/#monitoring-osds
.. _subscribe to the ceph-devel email list: mailto:majordomo@vger.kernel.org?body=subscribe+ceph-devel
.. _unsubscribe from the ceph-devel email list: mailto:majordomo@vger.kernel.org?body=unsubscribe+ceph-devel
.. _subscribe to the ceph-users email list: mailto:ceph-users-join@lists.ceph.com
.. _unsubscribe from the ceph-users email list: mailto:ceph-users-leave@lists.ceph.com
.. _OS recommendations: ../../../start/os-recommendations
.. _ceph-devel: ceph-devel@vger.kernel.org
|