summaryrefslogtreecommitdiffstats
path: root/doc/rados/troubleshooting/troubleshooting-pg.rst
blob: f8b62113745c70b5831d16ae38dc793c900be8bd (plain)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
====================
 Troubleshooting PGs
====================

Placement Groups Never Get Clean
================================

If, after you have created your cluster, any Placement Groups (PGs) remain in
the ``active`` status, the ``active+remapped`` status or the
``active+degraded`` status and never achieves an ``active+clean`` status, you
likely have a problem with your configuration.

In such a situation, it may be necessary to review the settings in the `Pool,
PG and CRUSH Config Reference`_ and make appropriate adjustments.

As a general rule, run your cluster with more than one OSD and a pool size
greater than two object replicas.

.. _one-node-cluster:

One Node Cluster
----------------

Ceph no longer provides documentation for operating on a single node.  Systems
designed for distributed computing by definition do not run on a single node.
The mounting of client kernel modules on a single node that contains a Ceph
daemon may cause a deadlock due to issues with the Linux kernel itself (unless
VMs are used as clients). You can experiment with Ceph in a one-node
configuration, in spite of the limitations as described herein.

To create a cluster on a single node, you must change the
``osd_crush_chooseleaf_type`` setting from the default of ``1`` (meaning
``host`` or ``node``) to ``0`` (meaning ``osd``) in your Ceph configuration
file before you create your monitors and OSDs. This tells Ceph that an OSD is
permitted to place another OSD on the same host. If you are trying to set up a
single-node cluster and ``osd_crush_chooseleaf_type`` is greater than ``0``,
Ceph will attempt to place the PGs of one OSD with the PGs of another OSD on
another node, chassis, rack, row, or datacenter depending on the setting.

.. tip:: DO NOT mount kernel clients directly on the same node as your Ceph
   Storage Cluster. Kernel conflicts can arise. However, you can mount kernel
   clients within virtual machines (VMs) on a single node.

If you are creating OSDs using a single disk, you must manually create
directories for the data first.


Fewer OSDs than Replicas
------------------------

If two OSDs are in an ``up`` and ``in`` state, but the placement gropus are not
in an ``active + clean`` state, you may have an ``osd_pool_default_size`` set
to greater than ``2``.

There are a few ways to address this situation. If you want to operate your
cluster in an ``active + degraded`` state with two replicas, you can set the
``osd_pool_default_min_size`` to ``2`` so that you can write objects in an
``active + degraded`` state. You may also set the ``osd_pool_default_size``
setting to ``2`` so that you have only two stored replicas (the original and
one replica). In such a case, the cluster should achieve an ``active + clean``
state.

.. note:: You can make the changes while the cluster is running. If you make
   the changes in your Ceph configuration file, you might need to restart your
   cluster.


Pool Size = 1
-------------

If you have ``osd_pool_default_size`` set to ``1``, you will have only one copy
of the object. OSDs rely on other OSDs to tell them which objects they should
have. If one OSD has a copy of an object and there is no second copy, then
there is no second OSD to tell the first OSD that it should have that copy. For
each placement group mapped to the first OSD (see ``ceph pg dump``), you can
force the first OSD to notice the placement groups it needs by running a
command of the following form:

.. prompt:: bash

   ceph osd force-create-pg <pgid>


CRUSH Map Errors
----------------

If any placement groups in your cluster are unclean, then there might be errors
in your CRUSH map.


Stuck Placement Groups
======================

It is normal for placement groups to enter "degraded" or "peering" states after
a component failure. Normally, these states reflect the expected progression
through the failure recovery process. However, a placement group that stays in
one of these states for a long time might be an indication of a larger problem.
For this reason, the Ceph Monitors will warn when placement groups get "stuck"
in a non-optimal state. Specifically, we check for:

* ``inactive`` - The placement group has not been ``active`` for too long (that
  is, it hasn't been able to service read/write requests).

* ``unclean`` - The placement group has not been ``clean`` for too long (that
  is, it hasn't been able to completely recover from a previous failure).

* ``stale`` - The placement group status has not been updated by a
  ``ceph-osd``.  This indicates that all nodes storing this placement group may
  be ``down``.

List stuck placement groups by running one of the following commands:

.. prompt:: bash

   ceph pg dump_stuck stale
   ceph pg dump_stuck inactive
   ceph pg dump_stuck unclean

- Stuck ``stale`` placement groups usually indicate that key ``ceph-osd``
  daemons are not running.
- Stuck ``inactive`` placement groups usually indicate a peering problem (see
  :ref:`failures-osd-peering`).
- Stuck ``unclean`` placement groups usually indicate that something is
  preventing recovery from completing, possibly unfound objects (see
  :ref:`failures-osd-unfound`);



.. _failures-osd-peering:

Placement Group Down - Peering Failure
======================================

In certain cases, the ``ceph-osd`` `peering` process can run into problems,
which can prevent a PG from becoming active and usable. In such a case, running
the command ``ceph health detail`` will report something similar to the following:

.. prompt:: bash

   ceph health detail

::

    HEALTH_ERR 7 pgs degraded; 12 pgs down; 12 pgs peering; 1 pgs recovering; 6 pgs stuck unclean; 114/3300 degraded (3.455%); 1/3 in osds are down
    ...
    pg 0.5 is down+peering
    pg 1.4 is down+peering
    ...
    osd.1 is down since epoch 69, last address 192.168.106.220:6801/8651

Query the cluster to determine exactly why the PG is marked ``down`` by running a command of the following form:

.. prompt:: bash

   ceph pg 0.5 query

.. code-block:: javascript

 { "state": "down+peering",
   ...
   "recovery_state": [
        { "name": "Started\/Primary\/Peering\/GetInfo",
          "enter_time": "2012-03-06 14:40:16.169679",
          "requested_info_from": []},
        { "name": "Started\/Primary\/Peering",
          "enter_time": "2012-03-06 14:40:16.169659",
          "probing_osds": [
                0,
                1],
          "blocked": "peering is blocked due to down osds",
          "down_osds_we_would_probe": [
                1],
          "peering_blocked_by": [
                { "osd": 1,
                  "current_lost_at": 0,
                  "comment": "starting or marking this osd lost may let us proceed"}]},
        { "name": "Started",
          "enter_time": "2012-03-06 14:40:16.169513"}
    ]
 }

The ``recovery_state`` section tells us that peering is blocked due to down
``ceph-osd`` daemons, specifically ``osd.1``. In this case, we can start that
particular ``ceph-osd`` and recovery will proceed.

Alternatively, if there is a catastrophic failure of ``osd.1`` (for example, if
there has been a disk failure), the cluster can be informed that the OSD is
``lost`` and the cluster can be instructed that it must cope as best it can.

.. important:: Informing the cluster that an OSD has been lost is dangerous
   because the cluster cannot guarantee that the other copies of the data are
   consistent and up to date.

To report an OSD ``lost`` and to instruct Ceph to continue to attempt recovery
anyway, run a command of the following form:

.. prompt:: bash

   ceph osd lost 1

Recovery will proceed.


.. _failures-osd-unfound:

Unfound Objects
===============

Under certain combinations of failures, Ceph may complain about ``unfound``
objects, as in this example:

.. prompt:: bash

   ceph health detail

::

   HEALTH_WARN 1 pgs degraded; 78/3778 unfound (2.065%)
   pg 2.4 is active+degraded, 78 unfound

This means that the storage cluster knows that some objects (or newer copies of
existing objects) exist, but it hasn't found copies of them.  Here is an
example of how this might come about for a PG whose data is on two OSDS, which
we will call "1" and "2":

* 1 goes down
* 2 handles some writes, alone
* 1 comes up
* 1 and 2 re-peer, and the objects missing on 1 are queued for recovery.
* Before the new objects are copied, 2 goes down.

At this point, 1 knows that these objects exist, but there is no live
``ceph-osd`` that has a copy of the objects. In this case, IO to those objects
will block, and the cluster will hope that the failed node comes back soon.
This is assumed to be preferable to returning an IO error to the user.

.. note:: The situation described immediately above is one reason that setting
   ``size=2`` on a replicated pool and ``m=1`` on an erasure coded pool risks
   data loss.

Identify which objects are unfound by running a command of the following form:

.. prompt:: bash

   ceph pg 2.4 list_unfound [starting offset, in json]

.. code-block:: javascript

  {
    "num_missing": 1,
    "num_unfound": 1,
    "objects": [
        {
            "oid": {
                "oid": "object",
                "key": "",
                "snapid": -2,
                "hash": 2249616407,
                "max": 0,
                "pool": 2,
                "namespace": ""
            },
            "need": "43'251",
            "have": "0'0",
            "flags": "none",
            "clean_regions": "clean_offsets: [], clean_omap: 0, new_object: 1",
            "locations": [
                "0(3)",
                "4(2)"
            ]
        }
    ],
    "state": "NotRecovering",
    "available_might_have_unfound": true,
    "might_have_unfound": [
        {
            "osd": "2(4)",
            "status": "osd is down"
        }
    ],
    "more": false
  }

If there are too many objects to list in a single result, the ``more`` field
will be true and you can query for more.  (Eventually the command line tool
will hide this from you, but not yet.)

Now you can identify which OSDs have been probed or might contain data.

At the end of the listing (before ``more: false``), ``might_have_unfound`` is
provided when ``available_might_have_unfound`` is true.  This is equivalent to
the output of ``ceph pg #.# query``.  This eliminates the need to use ``query``
directly.  The ``might_have_unfound`` information given behaves the same way as
that ``query`` does, which is described below.  The only difference is that
OSDs that have the status of ``already probed`` are ignored.

Use of ``query``:

.. prompt:: bash

   ceph pg 2.4 query

.. code-block:: javascript

   "recovery_state": [
        { "name": "Started\/Primary\/Active",
          "enter_time": "2012-03-06 15:15:46.713212",
          "might_have_unfound": [
                { "osd": 1,
                  "status": "osd is down"}]},

In this case, the cluster knows that ``osd.1`` might have data, but it is
``down``. Here is the full range of possible states:

* already probed
* querying
* OSD is down
* not queried (yet)

Sometimes it simply takes some time for the cluster to query possible
locations.

It is possible that there are other locations where the object might exist that
are not listed. For example: if an OSD is stopped and taken out of the cluster
and then the cluster fully recovers, and then through a subsequent set of
failures the cluster ends up with an unfound object, the cluster will ignore
the removed OSD. (This scenario, however, is unlikely.)

If all possible locations have been queried and objects are still lost, you may
have to give up on the lost objects. This, again, is possible only when unusual
combinations of failures have occurred that allow the cluster to learn about
writes that were performed before the writes themselves have been recovered. To
mark the "unfound" objects as "lost", run a command of the following form:

.. prompt:: bash

   ceph pg 2.5 mark_unfound_lost revert|delete

Here the final argument (``revert|delete``) specifies how the cluster should
deal with lost objects.

The ``delete`` option will cause the cluster to forget about them entirely.

The ``revert`` option (which is not available for erasure coded pools) will
either roll back to a previous version of the object or (if it was a new
object) forget about the object entirely. Use ``revert`` with caution, as it
may confuse applications that expect the object to exist.

Homeless Placement Groups
=========================

It is possible that every OSD that has copies of a given placement group fails.
If this happens, then the subset of the object store that contains those
placement groups becomes unavailable and the monitor will receive no status
updates for those placement groups. The monitor marks as ``stale`` any
placement group whose primary OSD has failed. For example:

.. prompt:: bash

   ceph health

::

    HEALTH_WARN 24 pgs stale; 3/300 in osds are down

Identify which placement groups are ``stale`` and which were the last OSDs to
store the ``stale`` placement groups by running the following command:

.. prompt:: bash

   ceph health detail

::

   HEALTH_WARN 24 pgs stale; 3/300 in osds are down
   ...
   pg 2.5 is stuck stale+active+remapped, last acting [2,0]
   ...
   osd.10 is down since epoch 23, last address 192.168.106.220:6800/11080
   osd.11 is down since epoch 13, last address 192.168.106.220:6803/11539
   osd.12 is down since epoch 24, last address 192.168.106.220:6806/11861

This output indicates that placement group 2.5 (``pg 2.5``) was last managed by
``osd.0`` and ``osd.2``. Restart those OSDs to allow the cluster to recover
that placement group.


Only a Few OSDs Receive Data
============================

If only a few of the nodes in the cluster are receiving data, check the number
of placement groups in the pool as instructed in the :ref:`Placement Groups
<rados_ops_pgs_get_pg_num>` documentation. Since placement groups get mapped to
OSDs in an operation involving dividing the number of placement groups in the
cluster by the number of OSDs in the cluster, a small number of placement
groups (the remainder, in this operation) are sometimes not distributed across
the cluster. In situations like this, create a pool with a placement group
count that is a multiple of the number of OSDs. See `Placement Groups`_ for
details. See the :ref:`Pool, PG, and CRUSH Config Reference
<rados_config_pool_pg_crush_ref>` for instructions on changing the default
values used to determine how many placement groups are assigned to each pool.


Can't Write Data
================

If the cluster is up, but some OSDs are down and you cannot write data, make
sure that you have the minimum number of OSDs running in the pool. If you don't
have the minimum number of OSDs running in the pool, Ceph will not allow you to
write data to it because there is no guarantee that Ceph can replicate your
data. See ``osd_pool_default_min_size`` in the :ref:`Pool, PG, and CRUSH
Config Reference <rados_config_pool_pg_crush_ref>` for details.


PGs Inconsistent
================

If the command ``ceph health detail`` returns an ``active + clean +
inconsistent`` state, this might indicate an error during scrubbing. Identify
the inconsistent placement group or placement groups by running the following
command:

.. prompt:: bash

    $ ceph health detail

::

    HEALTH_ERR 1 pgs inconsistent; 2 scrub errors
    pg 0.6 is active+clean+inconsistent, acting [0,1,2]
    2 scrub errors

Alternatively, run this command if you prefer to inspect the output in a
programmatic way:

.. prompt:: bash

   $ rados list-inconsistent-pg rbd

::

    ["0.6"]

There is only one consistent state, but in the worst case, we could have
different inconsistencies in multiple perspectives found in more than one
object. If an object named ``foo`` in PG ``0.6`` is truncated, the output of
``rados list-inconsistent-pg rbd`` will look something like this:

.. prompt:: bash

   rados list-inconsistent-obj 0.6 --format=json-pretty

.. code-block:: javascript

    {
        "epoch": 14,
        "inconsistents": [
            {
                "object": {
                    "name": "foo",
                    "nspace": "",
                    "locator": "",
                    "snap": "head",
                    "version": 1
                },
                "errors": [
                    "data_digest_mismatch",
                    "size_mismatch"
                ],
                "union_shard_errors": [
                    "data_digest_mismatch_info",
                    "size_mismatch_info"
                ],
                "selected_object_info": "0:602f83fe:::foo:head(16'1 client.4110.0:1 dirty|data_digest|omap_digest s 968 uv 1 dd e978e67f od ffffffff alloc_hint [0 0 0])",
                "shards": [
                    {
                        "osd": 0,
                        "errors": [],
                        "size": 968,
                        "omap_digest": "0xffffffff",
                        "data_digest": "0xe978e67f"
                    },
                    {
                        "osd": 1,
                        "errors": [],
                        "size": 968,
                        "omap_digest": "0xffffffff",
                        "data_digest": "0xe978e67f"
                    },
                    {
                        "osd": 2,
                        "errors": [
                            "data_digest_mismatch_info",
                            "size_mismatch_info"
                        ],
                        "size": 0,
                        "omap_digest": "0xffffffff",
                        "data_digest": "0xffffffff"
                    }
                ]
            }
        ]
    }

In this case, the output indicates the following:

* The only inconsistent object is named ``foo``, and its head has
  inconsistencies.
* The inconsistencies fall into two categories:

  #. ``errors``: these errors indicate inconsistencies between shards, without
     an indication of which shard(s) are bad. Check for the ``errors`` in the
     ``shards`` array, if available, to pinpoint the problem.

     * ``data_digest_mismatch``: the digest of the replica read from ``OSD.2``
       is different from the digests of the replica reads of ``OSD.0`` and
       ``OSD.1``
     * ``size_mismatch``: the size of the replica read from ``OSD.2`` is ``0``,
       but the size reported by ``OSD.0`` and ``OSD.1`` is ``968``.

  #. ``union_shard_errors``: the union of all shard-specific ``errors`` in the
     ``shards`` array. The ``errors`` are set for the shard with the problem.
     These errors include ``read_error`` and other similar errors. The
     ``errors`` ending in ``oi`` indicate a comparison with
     ``selected_object_info``. Examine the ``shards`` array to determine
     which shard has which error or errors.

     * ``data_digest_mismatch_info``: the digest stored in the ``object-info``
       is not ``0xffffffff``, which is calculated from the shard read from
       ``OSD.2``
     * ``size_mismatch_info``: the size stored in the ``object-info`` is
       different from the size read from ``OSD.2``. The latter is ``0``.

.. warning:: If ``read_error`` is listed in a shard's ``errors`` attribute, the
   inconsistency is likely due to physical storage errors. In cases like this,
   check the storage used by that OSD. 
   
   Examine the output of ``dmesg`` and ``smartctl`` before attempting a drive
   repair.

To repair the inconsistent placement group, run a command of the following
form:

.. prompt:: bash

   ceph pg repair {placement-group-ID}

For example:

.. prompt:: bash #

   ceph pg repair 1.4
    
.. warning: This command overwrites the "bad" copies with "authoritative"
   copies. In most cases, Ceph is able to choose authoritative copies from all
   the available replicas by using some predefined criteria. This, however,
   does not work in every case. For example, it might be the case that the
   stored data digest is missing, which means that the calculated digest is
   ignored when Ceph chooses the authoritative copies. Be aware of this, and
   use the above command with caution.

.. note:: PG IDs have the form ``N.xxxxx``, where ``N`` is the number of the
   pool that contains the PG. The command ``ceph osd listpools`` and the
   command ``ceph osd dump | grep pool`` return a list of pool numbers.


If you receive ``active + clean + inconsistent`` states periodically due to
clock skew, consider configuring the `NTP
<https://en.wikipedia.org/wiki/Network_Time_Protocol>`_ daemons on your monitor
hosts to act as peers. See `The Network Time Protocol <http://www.ntp.org>`_
and Ceph :ref:`Clock Settings <mon-config-ref-clock>` for more information.

More Information on PG Repair
-----------------------------
Ceph stores and updates the checksums of objects stored in the cluster. When a
scrub is performed on a PG, the lead OSD attempts to choose an authoritative
copy from among its replicas. Only one of the possible cases is consistent.
After performing a deep scrub, Ceph calculates the checksum of each object that
is read from disk and compares it to the checksum that was previously recorded.
If the current checksum and the previously recorded checksum do not match, that
mismatch is considered to be an inconsistency. In the case of replicated pools,
any mismatch between the checksum of any replica of an object and the checksum
of the authoritative copy means that there is an inconsistency. The discovery
of these inconsistencies cause a PG's state to be set to ``inconsistent``.

The ``pg repair`` command attempts to fix inconsistencies of various kinds. When 
``pg repair`` finds an inconsistent PG, it attempts to overwrite the digest of
the inconsistent copy with the digest of the authoritative copy. When ``pg
repair`` finds an inconsistent copy in a replicated pool, it marks the
inconsistent copy as missing. In the case of replicated pools, recovery is
beyond the scope of ``pg repair``.

In the case of erasure-coded and BlueStore pools, Ceph will automatically
perform repairs if ``osd_scrub_auto_repair`` (default ``false``) is set to
``true`` and if no more than ``osd_scrub_auto_repair_num_errors`` (default
``5``) errors are found.

The ``pg repair`` command will not solve every problem. Ceph does not
automatically repair PGs when they are found to contain inconsistencies.

The checksum of a RADOS object or an omap is not always available. Checksums
are calculated incrementally. If a replicated object is updated
non-sequentially, the write operation involved in the update changes the object
and invalidates its checksum. The whole object is not read while the checksum
is recalculated. The ``pg repair`` command is able to make repairs even when
checksums are not available to it, as in the case of Filestore. Users working
with replicated Filestore pools might prefer manual repair to ``ceph pg
repair``.

This material is relevant for Filestore, but not for BlueStore, which has its
own internal checksums. The matched-record checksum and the calculated checksum
cannot prove that any specific copy is in fact authoritative. If there is no
checksum available, ``pg repair`` favors the data on the primary, but this
might not be the uncorrupted replica. Because of this uncertainty, human
intervention is necessary when an inconsistency is discovered. This
intervention sometimes involves use of ``ceph-objectstore-tool``.

PG Repair Walkthrough
---------------------
https://ceph.io/geen-categorie/ceph-manually-repair-object/ - This page
contains a walkthrough of the repair of a PG. It is recommended reading if you
want to repair a PG but have never done so.

Erasure Coded PGs are not active+clean
======================================

If CRUSH fails to find enough OSDs to map to a PG, it will show as a
``2147483647`` which is ``ITEM_NONE`` or ``no OSD found``. For example::

     [2,1,6,0,5,8,2147483647,7,4]

Not enough OSDs
---------------

If the Ceph cluster has only eight OSDs and an erasure coded pool needs nine
OSDs, the cluster will show "Not enough OSDs". In this case, you either create
another erasure coded pool that requires fewer OSDs, by running commands of the
following form:

.. prompt:: bash

     ceph osd erasure-code-profile set myprofile k=5 m=3
     ceph osd pool create erasurepool erasure myprofile

or add new OSDs, and the PG will automatically use them.

CRUSH constraints cannot be satisfied
-------------------------------------

If the cluster has enough OSDs, it is possible that the CRUSH rule is imposing
constraints that cannot be satisfied. If there are ten OSDs on two hosts and
the CRUSH rule requires that no two OSDs from the same host are used in the
same PG, the mapping may fail because only two OSDs will be found. Check the
constraint by displaying ("dumping") the rule, as shown here:

.. prompt:: bash

   ceph osd crush rule ls

::

    [
        "replicated_rule",
        "erasurepool"]
    $ ceph osd crush rule dump erasurepool
    { "rule_id": 1,
      "rule_name": "erasurepool",
      "type": 3,
      "steps": [
            { "op": "take",
              "item": -1,
              "item_name": "default"},
            { "op": "chooseleaf_indep",
              "num": 0,
              "type": "host"},
            { "op": "emit"}]}


Resolve this problem by creating a new pool in which PGs are allowed to have
OSDs residing on the same host by running the following commands:

.. prompt:: bash

   ceph osd erasure-code-profile set myprofile crush-failure-domain=osd
   ceph osd pool create erasurepool erasure myprofile

CRUSH gives up too soon
-----------------------

If the Ceph cluster has just enough OSDs to map the PG (for instance a cluster
with a total of nine OSDs and an erasure coded pool that requires nine OSDs per
PG), it is possible that CRUSH gives up before finding a mapping. This problem
can be resolved by:

* lowering the erasure coded pool requirements to use fewer OSDs per PG (this
  requires the creation of another pool, because erasure code profiles cannot
  be modified dynamically).

* adding more OSDs to the cluster (this does not require the erasure coded pool
  to be modified, because it will become clean automatically)

* using a handmade CRUSH rule that tries more times to find a good mapping.
  This can be modified for an existing CRUSH rule by setting
  ``set_choose_tries`` to a value greater than the default.

First, verify the problem by using  ``crushtool`` after extracting the crushmap
from the cluster. This ensures that your experiments do not modify the Ceph
cluster and that they operate only on local files:

.. prompt:: bash

   ceph osd crush rule dump erasurepool

::

    { "rule_id": 1,
      "rule_name": "erasurepool",
      "type": 3,
      "steps": [
            { "op": "take",
              "item": -1,
              "item_name": "default"},
            { "op": "chooseleaf_indep",
              "num": 0,
              "type": "host"},
            { "op": "emit"}]}
    $ ceph osd getcrushmap > crush.map
    got crush map from osdmap epoch 13
    $ crushtool -i crush.map --test --show-bad-mappings \
       --rule 1 \
       --num-rep 9 \
       --min-x 1 --max-x $((1024 * 1024))
    bad mapping rule 8 x 43 num_rep 9 result [3,2,7,1,2147483647,8,5,6,0]
    bad mapping rule 8 x 79 num_rep 9 result [6,0,2,1,4,7,2147483647,5,8]
    bad mapping rule 8 x 173 num_rep 9 result [0,4,6,8,2,1,3,7,2147483647]

Here, ``--num-rep`` is the number of OSDs that the erasure code CRUSH rule
needs, ``--rule`` is the value of the ``rule_id`` field that was displayed by
``ceph osd crush rule dump``. This test will attempt to map one million values
(in this example, the range defined by ``[--min-x,--max-x]``) and must display
at least one bad mapping. If this test outputs nothing, all mappings have been
successful and you can be assured that the problem with your cluster is not
caused by bad mappings.

Changing the value of set_choose_tries
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

#. Decompile the CRUSH map to edit the CRUSH rule by running the following
   command:

   .. prompt:: bash

      crushtool --decompile crush.map > crush.txt

#. Add the following line to the rule::

      step set_choose_tries 100

   The relevant part of the ``crush.txt`` file will resemble this::

      rule erasurepool {
              id 1
              type erasure
              step set_chooseleaf_tries 5
              step set_choose_tries 100
              step take default
              step chooseleaf indep 0 type host
              step emit
      }

#. Recompile and retest the CRUSH rule:

   .. prompt:: bash

      crushtool --compile crush.txt -o better-crush.map

#. When all mappings succeed, display a histogram of the number of tries that
   were necessary to find all of the mapping by using the
   ``--show-choose-tries`` option of the ``crushtool`` command, as in the
   following example:

   .. prompt:: bash

      crushtool -i better-crush.map --test --show-bad-mappings \
       --show-choose-tries \
       --rule 1 \
       --num-rep 9 \
       --min-x 1 --max-x $((1024 * 1024))
    ...
    11:        42
    12:        44
    13:        54
    14:        45
    15:        35
    16:        34
    17:        30
    18:        25
    19:        19
    20:        22
    21:        20
    22:        17
    23:        13
    24:        16
    25:        13
    26:        11
    27:        11
    28:        13
    29:        11
    30:        10
    31:         6
    32:         5
    33:        10
    34:         3
    35:         7
    36:         5
    37:         2
    38:         5
    39:         5
    40:         2
    41:         5
    42:         4
    43:         1
    44:         2
    45:         2
    46:         3
    47:         1
    48:         0
    ...
    102:         0
    103:         1
    104:         0
    ...

   This output indicates that it took eleven tries to map forty-two PGs, twelve
   tries to map forty-four PGs etc. The highest number of tries is the minimum
   value of ``set_choose_tries`` that prevents bad mappings (for example,
   ``103`` in the above output, because it did not take more than 103 tries for
   any PG to be mapped).

.. _check: ../../operations/placement-groups#get-the-number-of-placement-groups
.. _Placement Groups: ../../operations/placement-groups
.. _Pool, PG and CRUSH Config Reference: ../../configuration/pool-pg-config-ref