| Commit message (Collapse) | Author | Age | Files | Lines |
|
|
|
|
|
|
|
|
|
|
| |
We don't use min/max_size for anything any more. Switch to encoding it
universally as 1 and 100, and ignoring when we decode. We keep it around
for backward compat when (re)encoding without the SERVER_QUINCY feature.
Drop the crush_rule_mask struct since the only variable piece now is the
type.
Signed-off-by: Sage Weil <sage@newdream.net>
|
|
|
|
|
|
|
|
|
|
| |
Remove last traces of rulesets from crush. Mostly this means removing the
checks that rule masks don't overlap, since no such rules should exist
at this point, and the compiler no longer allows them.
Scrub all mention of 'ruleset' from other code.
Signed-off-by: Sage Weil <sage@newdream.net>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
1. malloc(0) depends on libary's behavior:
https://pubs.opengroup.org/onlinepubs/009695399/functions/malloc.html
2. There's below frame path which call malloc(0)
# build_simple_crush_map
# |--> crush.add_bucket(0, 0, CRUSH_HASH_DEFAULT, root_type, 0, NULL, NULL, &rootid);
# |--> crush_bucket *b = crush_make_bucket(crush, alg, hash, type, size, items, weights);
# |--> crush_make_straw2_bucket(map, hash, type, size, items, weights);
# |--> bucket->h.items = malloc(sizeof(__s32)*size);
# |--> ceph_assert(b);
If the library return NULL(malloc(0)), ceph_assert(b) will trigger abort
Signed-off-by: Changcheng Liu <changcheng.liu@aliyun.com>
|
|
|
|
|
|
|
| |
memset(bucket, 0, sizeof(*bucket)) has set all bucket
field to be zero
Signed-off-by: Changcheng Liu <changcheng.liu@aliyun.com>
|
|
|
|
| |
Signed-off-by: Kefu Chai <kchai@redhat.com>
|
|
|
|
| |
Signed-off-by: Adam C. Emerson <aemerson@redhat.com>
|
|
|
|
|
|
|
| |
This naming was confusing! This is the number of positions we
have weight_sets for.
Signed-off-by: Sage Weil <sage@redhat.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
We were decrementing size and then breaking out ENOENT condition check.
Fix by decrementing size only after we break out of the loop and verify
we found the item.
Fix a follow-on bug by avoiding realloc when we have 0 items left. This case
was never exercised before due to the ENOENT issue; now we return explicitly.
It's really not necessary to realloc at all, probably, since these are very
small arrays, but in any case leaving a single item allocation there in place of
a 0-length allocation is fine. (And the 0-length allocation behvaior on realloc
is undefined.. may either return a pointer or NULL.)
Signed-off-by: Sage Weil <sage@redhat.com>
|
|
|
|
|
|
|
|
|
|
| |
crush_choose_arg::ids array is encoded on the wire. int is not fixed
size -- use __s32 instead (crush_bucket::id is __s32).
This was introduced in commit dbe36e08be00 ("crush: compile/decompile
crush_choose_arg_map") under SERVER_LUMINOUS bit.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
|
|
|
|
|
|
| |
Fixes: http://tracker.ceph.com/issues/19883 (Partially)
Signed-off-by: Brad Hubbard <bhubbard@redhat.com>
|
|
|
|
|
|
|
| |
The legacy behavior has chooseleaf_stable = 0 (it was introduced in
2016).
Signed-off-by: Loic Dachary <loic@dachary.org>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
bucket_straw2_choose needs to use weights that may be different from
weight_items. For instance to compensate for an uneven distribution
caused by a low number of values. Or to fix the probability biais
introduced by conditional probabilities (see
http://tracker.ceph.com/issues/15653 for more information).
We introduce a weight_set for each straw2 bucket to set the desired
weight for a given item at a given position. The weight of a given item
when picking the first replica (first position) may be different from
the weight the second replica (second position). For instance the weight
matrix for a given bucket containing items 3, 7 and 13 could be as
follows:
position 0 position 1
item 3 0x10000 0x100000
item 7 0x40000 0x10000
item 13 0x40000 0x10000
When crush_do_rule picks the first of two replicas (position 0), item 7,
3 are four times more likely to be choosen by bucket_straw2_choose than
item 13. When choosing the second replica (position 1), item 3 is ten
times more likely to be choosen than item 7, 13.
By default the weight_set of each bucket exactly matches the content of
item_weights for each position to ensure backward compatibility.
bucket_straw2_choose compares items by using their id. The same ids are
also used to index buckets and they must be unique. For each item in a
bucket an array of ids can be provided for placement purposes and they
are used instead of the ids. If no replacement ids are provided, the
legacy behavior is preserved.
Signed-off-by: Loic Dachary <loic@dachary.org>
|
|\
| |
| |
| |
| | |
crush: bucket: crush_add_uniform_bucket_item should check for uniformity
Reviewed-by: Sage Weil <sage@redhat.com>
|
| |
| |
| |
| |
| |
| |
| |
| | |
When defining a bucket using alg CRUSH_BUCKET_UNIFORM, the item weight
set during creation of bucket should be used to ensure that for every
new items added, the weights are the same.
Signed-off-by: Sahid Orentino Ferdjaoui <sahid.ferdjaoui@redhat.com>
|
|/
|
|
|
|
|
|
|
| |
In this commit we update the behavior of crush_create(...) to return a
crush_map initialized with optimal configurations. We also provide a
function set_legacy_crush_map(...) to allow users comming back to the
legacy configurations.
Signed-off-by: Sahid Orentino Ferdjaoui <sahid.ferdjaoui@redhat.com>
|
|
|
|
|
|
|
| |
This commit is just a cleanup to make the arguments of the method
around crush_reweight all coherent.
Signed-off-by: Sahid Orentino Ferdjaoui <sahid.ferdjaoui@redhat.com>
|
|
|
|
|
|
|
|
| |
When a uniform bucket is created with zero size, it returns on error
because crush_multiplication_is_unsafe returns false. Since the goal
of this safegard is to avoid overflow, it should not fail.
Signed-off-by: Loic Dachary <ldachary@redhat.com>
|
|
|
|
|
|
|
|
|
| |
Then add it to the working state. It would be very nice if we didn't
have to take a lock to calculate a crush placement. By moving the
permutation array into the working data, we can treat the CRUSH map as
immutable.
Signed-off-by: Adam C. Emerson <aemerson@redhat.com>
|
|
|
|
|
|
|
|
|
|
| |
* crush: so we don't see the reference after the removing, this keeps
check_item_loc() happy, and move_bucket() use check_item_loc() to see if
the removed bucket disappears after the removal.
* test: also add unittest_crush_wrapper::CrushWrapper.insert_item
Fixes: http://tracker.ceph.com/issues/16525
Signed-off-by: Kefu Chai <kchai@redhat.com>
|
|
|
|
|
| |
Fixes: #14496
Signed-off-by: song baisen song.baisen@zte.com.cn
|
|
|
|
|
|
|
|
| |
Given that crush_{addition,multiplication}_is_unsafe() are only used
for compiling maps, they have no business in crush.c which is shared
with the kernel.
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This tunable is a bitmask indicating which bucket algorithms are allowed.
For now, the only purpose is to affect get_default_bucket_alg(), which
will try to pick a type that is supported, with a preference for straw2 or
straw buckets. In the future, we likely want something a bit more
sophisticated that reflects whether the bucket is expected to be fixed-size
or not.
We could have added a default_bucket_type field to accomplish this, but I
think that would prove to be too limiting in the future, and this
accomplishes the same thing.
Note that if the admin selects the 'hammer' tunables, that means clients
and servers will support straw2 and that will be the preferred choice. The
default is still bobtail tunables, which means new clusters created on
hammer will not use straw2 just yet.
Signed-off-by: Sage Weil <sage@redhat.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This is an improved straw bucket that correctly avoids any data movement
between items A and B when neither A nor B's weights are changed. Said
differently, if we adjust the weight of item C (including adding it anew
or removing it completely), we will only see inputs move to or from C,
never between other items in the bucket.
Notably, there is not intermediate scaling factor that needs to be
calculated. The mapping function is a simple function of the item weights.
Unfortunately, we need a natural log, which is expensive. We use a lookup
table here.
Signed-off-by: Sage Weil <sage@redhat.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
crush/builder.c: In function 'crush_remove_list_bucket_item':
crush/builder.c:977:13: warning: comparison between signed and unsigned integer expressions [-Wsign-compare]
if (weight < bucket->h.weight)
^
crush/builder.c: In function 'crush_remove_tree_bucket_item':
crush/builder.c:1031:14: warning: comparison between signed and unsigned integer expressions [-Wsign-compare]
if (weight < bucket->h.weight)
^
Signed-off-by: Sage Weil <sage@redhat.com>
|
|\
| |
| |
| |
| | |
crush: fix straw bucket scaler bugs
Reviewed-by: Greg Farnum <gfarnum@redhat.com>
|
| |
| |
| |
| | |
Signed-off-by: Sage Weil <sage@redhat.com>
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
The straw bucket was originally tested with uniform weights and with a
few more complicated patterns, like a stair step (1,2,3,4,5,6,7,8,9). And
it worked!
However, it does not behave with a pattern like
1, 2, 2, 3, 3, 4, 4
Strangely, it does behave with
1, 1, 2, 2, 3, 3, 4, 4
and more usefully it does behave with
1, 2, 2.001, 3, 3.001, 4, 4.001
That is, the logic that explicitly copes with weights that are duplicates
is broken.
The fix is to simply remove the special handling for duplicate weights --
it isn't necessary and doesn't work correctly anyway.
Add a test that compares the mapping result of [1, 2, 2, 3, 3, ...] with
[1, 2, 2.001, 3, 3.001, ...] and verifies that the difference is small.
With the fix, we get .00012, whereas the original implementation gets
.015.
Note that this changes the straw bucket scalar *precalculated* values that
are encoded with the map, and only when the admin opts into the new behavior.
Backport: giant, firefly
Signed-off-by: Sage Weil <sage@redhat.com>
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
The presence of a 0-weight item in a straw bucket should have no effect
on the placement of other items. Add a test validating that and fix
crush_calc_straw() to fix the distortion.
Note that this effects the *precalculation* of the straw bucket inputs and
does not effect the actually mapping process given a compiled or encoded
CRUSH map, and only when straw_calc_version == 1 (i.e., the admin opted in
to the new behavior).
Backport: giant, firefly
Signed-off-by: Sage Weil <sage@redhat.com>
|
| |
| |
| |
| |
| |
| | |
No change, yet.
Signed-off-by: Sage Weil <sage@redhat.com>
|
| |
| |
| |
| |
| |
| | |
In particular, we will need it for crush_calc_straw().
Signed-off-by: Sage Weil <sage@redhat.com>
|
| |
| |
| |
| |
| |
| | |
It doesn't do anything, yet.
Signed-off-by: Sage Weil <sage@redhat.com>
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
The crushtool --reweight function triggers a fresh calculation of bucket
weights so that they are always the sum of the item weights. In the
straw bucket case, the weights were updated but the corresponding straw
scalers were not being recalculated. The result is that there was not
effect on placement in adjusted buckets until the next time a bucket item's
weight was adjusted.
Backport: giant, firefly
Signed-off-by: Sage Weil <sage@redhat.com>
|
| |
| |
| |
| |
| |
| | |
These are compiled out by default.
Signed-off-by: Sage Weil <sage@redhat.com>
|
|\ \
| |/
|/|
| |
| |
| | |
crush: fix weight underfloat issue
Reviewed-by: Loic Dachary <ldachary@redhat.com>
|
| |
| |
| |
| |
| |
| |
| |
| | |
It is possible to set a bucket weight that is not the sum of the item
weights if you manually modify/build the CRUSH map. Protect against any
underflow on the bucket weight when removing items.
Signed-off-by: Sage Weil <sage@redhat.com>
|
| |
| |
| |
| |
| |
| |
| |
| |
| | |
There are incorrect nodes' weight in tree bucket when construct tree
bucket. The tree bucket don't store item id in items array, so the tree
bucket will not work correctly. The patch fix above bugs and add a
simple test for tree bucket.
Signed-off-by: Rongze Zhu <zrzhit@gmail.com>
|
|/
|
|
|
|
| |
This mirrors mapper.c.
Signed-off-by: Sage Weil <sage@redhat.com>
|
|
|
|
|
|
|
|
|
|
|
|
| |
Originally in the add_simple_ruleset funtion, the ruleset_id
is not reused but rule_id is reused. So after some add/remove
against rules, the newly created rule likely to have
ruleset!=rule_id.
We dont want this happen because we are trying to hold the constraint
that ruleset == rule_id.
Signed-off-by: Xiaoxi Chen <xiaoxi.chen@intel.com>
|
|
|
|
| |
Signed-off-by: Danny Al-Gaaf <danny.al-gaaf@bisect.de>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The current crush_choose_firstn code will re-use the same 'r' value for
the recursive call. That means that if we are hitting a collision or
rejection for some reason (say, an OSD that is marked out) and need to
retry, we will keep making the same (bad) choice in that recursive
selection.
Introduce a tunable that fixes that behavior by incorporating the parent
'r' value into the recursive starting point, so that a different path
will be taken in subsequent placement attempts.
Note that this was done from the get-go for the new crush_choose_indep
algorithm.
This was exposed by a user who was seeing PGs stuck in active+remapped
after reweight-by-utilization because the up set mapped to a single OSD.
Signed-off-by: Sage Weil <sage@inktank.com>
|
|
|
|
|
|
|
|
| |
Working around missing integer types is pretty easy. For example, the
__u32 family are Linux-specific types, and using these in Ceph
internally is fine because we can typedef them.
Signed-off-by: Noah Watkins <noahwatkins@gmail.com>
|
|
|
|
|
|
|
|
|
|
|
| |
Return -EEXIST on duplicate ID
BUG FIX: crush_add_bucket() mixes error returns and IDs
Add optional argument to return generated ID
Fixes: #6246
Signed-off-by: David Zafman <david.zafman@inktank.com>
Reviewed-by: Sage Weil <sage@inktank.com>
|
|
|
|
| |
Signed-off-by: Danny Al-Gaaf <danny.al-gaaf@bisect.de>
|
|
|
|
|
|
|
|
|
|
|
| |
Fix handling of realloc. If realloc() fails it returns NULL, assigning
the return value of realloc() directly to the pointer without checking
for the result will lead to a memory leak in error case.
Use a temporary pointer to hold the result of realloc(). In error case
return -ENOMEM, otherwise assign it to the pointer we want to realloc.
Signed-off-by: Danny Al-Gaaf <danny.al-gaaf@bisect.de>
|
|
|
|
|
|
|
|
| |
Fix sizeof handling for realloc/malloc of bucket->h.items. items are
of type __s32 and not __u32 (sizeof gives the same size, but fix it
to represent the correct type).
Signed-off-by: Danny Al-Gaaf <danny.al-gaaf@bisect.de>
|
|
|
|
| |
Signed-off-by: Danny Al-Gaaf <danny.al-gaaf@bisect.de>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Consider the CRUSH rule
step chooseleaf firstn 0 type <node_type>
This rule means that <n> replicas will be chosen in a manner such that
each chosen leaf's branch will contain a unique instance of <node_type>.
When an object is re-replicated after a leaf failure, if the CRUSH map uses
a chooseleaf rule the remapped replica ends up under the <node_type> bucket
that held the failed leaf. This causes uneven data distribution across the
storage cluster, to the point that when all the leaves but one fail under a
particular <node_type> bucket, that remaining leaf holds all the data from
its failed peers.
This behavior also limits the number of peers that can participate in the
re-replication of the data held by the failed leaf, which increases the
time required to re-replicate after a failure.
For a chooseleaf CRUSH rule, the tree descent has two steps: call them the
inner and outer descents.
If the tree descent down to <node_type> is the outer descent, and the descent
from <node_type> down to a leaf is the inner descent, the issue is that a
down leaf is detected on the inner descent, so only the inner descent is
retried.
In order to disperse re-replicated data as widely as possible across a
storage cluster after a failure, we want to retry the outer descent. So,
fix up crush_choose() to allow the inner descent to return immediately on
choosing a failed leaf. Wire this up as a new CRUSH tunable.
Note that after this change, for a chooseleaf rule, if the primary OSD
in a placement group has failed, choosing a replacement may result in
one of the other OSDs in the PG colliding with the new primary. This
requires that OSD's data for that PG to need moving as well. This
seems unavoidable but should be relatively rare.
Signed-off-by: Jim Schutt <jaschut@sandia.gov>
|
|
|
|
|
|
|
| |
Disallow setting OSD weights to a value over 10,000 and cap bucket weight
at 10,000,000 in a CRUSH map. Addresses issue #2101.
Signed-off-by: caleb miles <caleb.miles@inktank.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
We have three magic numbers in crush_choose that are now tunable. The
first two control the local retry behavior, including fallback to a
permutation. The last is the total map descent attempts.
We can avoid a drastic incompatibility by making these tunable and encoded
in the map. That means users can enable/disable local retry, for example,
without changing the code. As long as the clients understand the tunables,
they can be adjusted.
This patch doesn't address the compatibility and feature bit issue. We may
want to roll that into a larger revision with more drastic changes, once
we know what those changes will look like. However, a careful user can
use the new code and modify the behavior.
Signed-off-by: Sage Weil <sage@inktank.com>
|
|
|
|
| |
Signed-off-by: Sage Weil <sage@inktank.com>
|