summaryrefslogtreecommitdiffstats
path: root/src/crush/builder.c (follow)
Commit message (Collapse)AuthorAgeFilesLines
* crush: eliminate min_size and max_sizeSage Weil2021-06-291-5/+4
| | | | | | | | | | | We don't use min/max_size for anything any more. Switch to encoding it universally as 1 and 100, and ignoring when we decode. We keep it around for backward compat when (re)encoding without the SERVER_QUINCY feature. Drop the crush_rule_mask struct since the only variable piece now is the type. Signed-off-by: Sage Weil <sage@newdream.net>
* crush: remove last traces of 'ruleset'Sage Weil2021-06-291-2/+2
| | | | | | | | | | Remove last traces of rulesets from crush. Mostly this means removing the checks that rule masks don't overlap, since no such rules should exist at this point, and the compiler no longer allows them. Scrub all mention of 'ruleset' from other code. Signed-off-by: Sage Weil <sage@newdream.net>
* crush: no need to call malloc is request is zeroChangcheng Liu2020-08-201-0/+12
| | | | | | | | | | | | | | | 1. malloc(0) depends on libary's behavior: https://pubs.opengroup.org/onlinepubs/009695399/functions/malloc.html 2. There's below frame path which call malloc(0) # build_simple_crush_map # |--> crush.add_bucket(0, 0, CRUSH_HASH_DEFAULT, root_type, 0, NULL, NULL, &rootid); # |--> crush_bucket *b = crush_make_bucket(crush, alg, hash, type, size, items, weights); # |--> crush_make_straw2_bucket(map, hash, type, size, items, weights); # |--> bucket->h.items = malloc(sizeof(__s32)*size); # |--> ceph_assert(b); If the library return NULL(malloc(0)), ceph_assert(b) will trigger abort Signed-off-by: Changcheng Liu <changcheng.liu@aliyun.com>
* crush: avoid redundant value assignementChangcheng Liu2020-08-201-6/+0
| | | | | | | memset(bucket, 0, sizeof(*bucket)) has set all bucket field to be zero Signed-off-by: Changcheng Liu <changcheng.liu@aliyun.com>
* common: fix typosKefu Chai2018-09-211-1/+1
| | | | Signed-off-by: Kefu Chai <kchai@redhat.com>
* crush: Quell unused variable warningsAdam C. Emerson2018-09-121-3/+3
| | | | Signed-off-by: Adam C. Emerson <aemerson@redhat.com>
* crush: weight_set_size -> weight_set_positionsSage Weil2018-05-181-1/+1
| | | | | | | This naming was confusing! This is the number of positions we have weight_sets for. Signed-off-by: Sage Weil <sage@redhat.com>
* crush/builder: fix ENOENT when removing last bucket itemSage Weil2017-08-121-5/+13
| | | | | | | | | | | | | | | We were decrementing size and then breaking out ENOENT condition check. Fix by decrementing size only after we break out of the loop and verify we found the item. Fix a follow-on bug by avoiding realloc when we have 0 items left. This case was never exercised before due to the ENOENT issue; now we return explicitly. It's really not necessary to realloc at all, probably, since these are very small arrays, but in any case leaving a single item allocation there in place of a 0-length allocation is fine. (And the 0-length allocation behvaior on realloc is undefined.. may either return a pointer or NULL.) Signed-off-by: Sage Weil <sage@redhat.com>
* crush/CrushWrapper: crush_choose_arg::ids should be __s32Ilya Dryomov2017-06-301-3/+3
| | | | | | | | | | crush_choose_arg::ids array is encoded on the wire. int is not fixed size -- use __s32 instead (crush_bucket::id is __s32). This was introduced in commit dbe36e08be00 ("crush: compile/decompile crush_choose_arg_map") under SERVER_LUMINOUS bit. Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
* common: Remove redundant includesBrad Hubbard2017-05-131-4/+1
| | | | | | Fixes: http://tracker.ceph.com/issues/19883 (Partially) Signed-off-by: Brad Hubbard <bhubbard@redhat.com>
* crush: builder: legacy has chooseleaf_stable = 0Loic Dachary2017-04-201-0/+1
| | | | | | | The legacy behavior has chooseleaf_stable = 0 (it was introduced in 2016). Signed-off-by: Loic Dachary <loic@dachary.org>
* crush: implement weight and id overrides for straw2Loic Dachary2017-04-181-0/+61
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | bucket_straw2_choose needs to use weights that may be different from weight_items. For instance to compensate for an uneven distribution caused by a low number of values. Or to fix the probability biais introduced by conditional probabilities (see http://tracker.ceph.com/issues/15653 for more information). We introduce a weight_set for each straw2 bucket to set the desired weight for a given item at a given position. The weight of a given item when picking the first replica (first position) may be different from the weight the second replica (second position). For instance the weight matrix for a given bucket containing items 3, 7 and 13 could be as follows: position 0 position 1 item 3 0x10000 0x100000 item 7 0x40000 0x10000 item 13 0x40000 0x10000 When crush_do_rule picks the first of two replicas (position 0), item 7, 3 are four times more likely to be choosen by bucket_straw2_choose than item 13. When choosing the second replica (position 1), item 3 is ten times more likely to be choosen than item 7, 13. By default the weight_set of each bucket exactly matches the content of item_weights for each position to ensure backward compatibility. bucket_straw2_choose compares items by using their id. The same ids are also used to index buckets and they must be unique. For each item in a bucket an array of ids can be provided for placement purposes and they are used instead of the ids. If no replacement ids are provided, the legacy behavior is preserved. Signed-off-by: Loic Dachary <loic@dachary.org>
* Merge pull request #14208 from dachary/wip-crush-uniform-weightsYuri Weinstein2017-04-101-1/+10
|\ | | | | | | | | crush: bucket: crush_add_uniform_bucket_item should check for uniformity Reviewed-by: Sage Weil <sage@redhat.com>
| * crush: bucket: crush_add_uniform_bucket_item should check for uniformitySahid Orentino Ferdjaoui2017-03-281-1/+10
| | | | | | | | | | | | | | | | When defining a bucket using alg CRUSH_BUCKET_UNIFORM, the item weight set during creation of bucket should be used to ensure that for every new items added, the weights are the same. Signed-off-by: Sahid Orentino Ferdjaoui <sahid.ferdjaoui@redhat.com>
* | crush: builder: creating crush map with optimal configurationsSahid Orentino Ferdjaoui2017-03-281-11/+33
|/ | | | | | | | | In this commit we update the behavior of crush_create(...) to return a crush_map initialized with optimal configurations. We also provide a function set_legacy_crush_map(...) to allow users comming back to the legacy configurations. Signed-off-by: Sahid Orentino Ferdjaoui <sahid.ferdjaoui@redhat.com>
* crush: builder: clean the arguments of crush_reweight* methodsSahid Orentino Ferdjaoui2017-03-231-22/+22
| | | | | | | This commit is just a cleanup to make the arguments of the method around crush_reweight all coherent. Signed-off-by: Sahid Orentino Ferdjaoui <sahid.ferdjaoui@redhat.com>
* crush: allow uniform buckets with no itemsLoic Dachary2017-02-191-0/+2
| | | | | | | | When a uniform bucket is created with zero size, it returns on error because crush_multiplication_is_unsafe returns false. Since the goal of this safegard is to avoid overflow, it should not fail. Signed-off-by: Loic Dachary <ldachary@redhat.com>
* crush: Remove mutable part of CRUSH mapAdam C. Emerson2016-11-091-82/+19
| | | | | | | | | Then add it to the working state. It would be very nice if we didn't have to take a lock to calculate a crush placement. By moving the permutation array into the working data, we can treat the CRUSH map as immutable. Signed-off-by: Adam C. Emerson <aemerson@redhat.com>
* crush: reset bucket->h.items[i] when removing tree itemKefu Chai2016-07-011-1/+2
| | | | | | | | | | * crush: so we don't see the reference after the removing, this keeps check_item_loc() happy, and move_bucket() use check_item_loc() to see if the removed bucket disappears after the removal. * test: also add unittest_crush_wrapper::CrushWrapper.insert_item Fixes: http://tracker.ceph.com/issues/16525 Signed-off-by: Kefu Chai <kchai@redhat.com>
* Crush: add some safe judgmentsongbaisen2016-01-261-1/+1
| | | | | Fixes: #14496 Signed-off-by: song baisen song.baisen@zte.com.cn
* crush: move safe arithmetic functions to buider.cIlya Dryomov2015-06-151-0/+20
| | | | | | | | Given that crush_{addition,multiplication}_is_unsafe() are only used for compiling maps, they have no business in crush.c which is shared with the kernel. Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
* crush: add allowed_bucket_algs tunableSage Weil2015-01-221-0/+4
| | | | | | | | | | | | | | | | | | | | This tunable is a bitmask indicating which bucket algorithms are allowed. For now, the only purpose is to affect get_default_bucket_alg(), which will try to pick a type that is supported, with a preference for straw2 or straw buckets. In the future, we likely want something a bit more sophisticated that reflects whether the bucket is expected to be fixed-size or not. We could have added a default_bucket_type field to accomplish this, but I think that would prove to be too limiting in the future, and this accomplishes the same thing. Note that if the admin selects the 'hammer' tunables, that means clients and servers will support straw2 and that will be the preferred choice. The default is still bobtail tunables, which means new clusters created on hammer will not use straw2 just yet. Signed-off-by: Sage Weil <sage@redhat.com>
* crush: add a straw2 bucket typeSage Weil2015-01-221-0/+180
| | | | | | | | | | | | | | | | This is an improved straw bucket that correctly avoids any data movement between items A and B when neither A nor B's weights are changed. Said differently, if we adjust the weight of item C (including adding it anew or removing it completely), we will only see inputs move to or from C, never between other items in the bucket. Notably, there is not intermediate scaling factor that needs to be calculated. The mapping function is a simple function of the item weights. Unfortunately, we need a natural log, which is expensive. We use a lookup table here. Signed-off-by: Sage Weil <sage@redhat.com>
* crush/builder: fix warningsSage Weil2015-01-221-2/+2
| | | | | | | | | | | | | crush/builder.c: In function 'crush_remove_list_bucket_item': crush/builder.c:977:13: warning: comparison between signed and unsigned integer expressions [-Wsign-compare] if (weight < bucket->h.weight) ^ crush/builder.c: In function 'crush_remove_tree_bucket_item': crush/builder.c:1031:14: warning: comparison between signed and unsigned integer expressions [-Wsign-compare] if (weight < bucket->h.weight) ^ Signed-off-by: Sage Weil <sage@redhat.com>
* Merge pull request #3057 from ceph/wip-crush-strawSage Weil2014-12-181-47/+128
|\ | | | | | | | | crush: fix straw bucket scaler bugs Reviewed-by: Greg Farnum <gfarnum@redhat.com>
| * crush/builder: a note about the original crush_calc_straw()Sage Weil2014-12-101-0/+27
| | | | | | | | Signed-off-by: Sage Weil <sage@redhat.com>
| * crush: fix crush_calc_straw() scalers when there are duplicate weightsSage Weil2014-12-041-11/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The straw bucket was originally tested with uniform weights and with a few more complicated patterns, like a stair step (1,2,3,4,5,6,7,8,9). And it worked! However, it does not behave with a pattern like 1, 2, 2, 3, 3, 4, 4 Strangely, it does behave with 1, 1, 2, 2, 3, 3, 4, 4 and more usefully it does behave with 1, 2, 2.001, 3, 3.001, 4, 4.001 That is, the logic that explicitly copes with weights that are duplicates is broken. The fix is to simply remove the special handling for duplicate weights -- it isn't necessary and doesn't work correctly anyway. Add a test that compares the mapping result of [1, 2, 2, 3, 3, ...] with [1, 2, 2.001, 3, 3.001, ...] and verifies that the difference is small. With the fix, we get .00012, whereas the original implementation gets .015. Note that this changes the straw bucket scalar *precalculated* values that are encoded with the map, and only when the admin opts into the new behavior. Backport: giant, firefly Signed-off-by: Sage Weil <sage@redhat.com>
| * crush: fix distortion of straw scalers by 0-weight itemsSage Weil2014-12-041-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | The presence of a 0-weight item in a straw bucket should have no effect on the placement of other items. Add a test validating that and fix crush_calc_straw() to fix the distortion. Note that this effects the *precalculation* of the straw bucket inputs and does not effect the actually mapping process given a compiled or encoded CRUSH map, and only when straw_calc_version == 1 (i.e., the admin opted in to the new behavior). Backport: giant, firefly Signed-off-by: Sage Weil <sage@redhat.com>
| * crush/builder: break out new version 1 of crush_calc_strawSage Weil2014-12-031-31/+79
| | | | | | | | | | | | No change, yet. Signed-off-by: Sage Weil <sage@redhat.com>
| * crush: pass crush_map * to various builder methodsSage Weil2014-12-031-20/+31
| | | | | | | | | | | | In particular, we will need it for crush_calc_straw(). Signed-off-by: Sage Weil <sage@redhat.com>
| * crush: add straw_calc_version tunableSage Weil2014-12-031-0/+1
| | | | | | | | | | | | It doesn't do anything, yet. Signed-off-by: Sage Weil <sage@redhat.com>
| * crush: recalculate straw scalers during a reweightSage Weil2014-12-031-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | The crushtool --reweight function triggers a fresh calculation of bucket weights so that they are always the sum of the item weights. In the straw bucket case, the weights were updated but the corresponding straw scalers were not being recalculated. The result is that there was not effect on placement in adjusted buckets until the next time a bucket item's weight was adjusted. Backport: giant, firefly Signed-off-by: Sage Weil <sage@redhat.com>
| * crush: add dprintk's for crush_calc_strawSage Weil2014-12-021-4/+6
| | | | | | | | | | | | These are compiled out by default. Signed-off-by: Sage Weil <sage@redhat.com>
* | Merge pull request #2986 from ceph/wip-9998Loic Dachary2014-12-141-4/+16
|\ \ | |/ |/| | | | | | | crush: fix weight underfloat issue Reviewed-by: Loic Dachary <ldachary@redhat.com>
| * crush/builder: prevent bucket weight underflow on item removalSage Weil2014-12-091-4/+16
| | | | | | | | | | | | | | | | It is possible to set a bucket weight that is not the sum of the item weights if you manually modify/build the CRUSH map. Protect against any underflow on the bucket weight when removing items. Signed-off-by: Sage Weil <sage@redhat.com>
* | crush: fix tree bucket functionsRongze Zhu2014-11-111-1/+25
| | | | | | | | | | | | | | | | | | There are incorrect nodes' weight in tree bucket when construct tree bucket. The tree bucket don't store item id in items array, so the tree bucket will not work correctly. The patch fix above bugs and add a simple test for tree bucket. Signed-off-by: Rongze Zhu <zrzhit@gmail.com>
* | crush/builder: replace printf with an empty dprintk macroSage Weil2014-11-111-9/+11
|/ | | | | | This mirrors mapper.c. Signed-off-by: Sage Weil <sage@redhat.com>
* CrushWrapper: pick a ruleset same as rule_idXiaoxi Chen2014-08-221-1/+3
| | | | | | | | | | | | Originally in the add_simple_ruleset funtion, the ruleset_id is not reused but rule_id is reused. So after some add/remove against rules, the newly created rule likely to have ruleset!=rule_id. We dont want this happen because we are trying to hold the constraint that ruleset == rule_id. Signed-off-by: Xiaoxi Chen <xiaoxi.chen@intel.com>
* crush/builder.c: remove some unreachable return statementsDanny Al-Gaaf2014-05-081-4/+0
| | | | Signed-off-by: Danny Al-Gaaf <danny.al-gaaf@bisect.de>
* crush: add chooseleaf_vary_r tunableSage Weil2014-02-081-0/+1
| | | | | | | | | | | | | | | | | | | | The current crush_choose_firstn code will re-use the same 'r' value for the recursive call. That means that if we are hitting a collision or rejection for some reason (say, an OSD that is marked out) and need to retry, we will keep making the same (bad) choice in that recursive selection. Introduce a tunable that fixes that behavior by incorporating the parent 'r' value into the recursive starting point, so that a different path will be taken in subsequent placement attempts. Note that this was done from the get-go for the new crush_choose_indep algorithm. This was exposed by a user who was seeing PGs stuck in active+remapped after reweight-by-utilization because the up set mapped to a single OSD. Signed-off-by: Sage Weil <sage@inktank.com>
* inttypes: detect and define missing integer typesNoah Watkins2014-01-021-0/+2
| | | | | | | | Working around missing integer types is pretty easy. For example, the __u32 family are Linux-specific types, and using these in Ceph internally is fine because we can typedef them. Signed-off-by: Noah Watkins <noahwatkins@gmail.com>
* crushtool: do not dump core with non-unique bucket IDsDavid Zafman2013-09-101-3/+7
| | | | | | | | | | | Return -EEXIST on duplicate ID BUG FIX: crush_add_bucket() mixes error returns and IDs Add optional argument to return generated ID Fixes: #6246 Signed-off-by: David Zafman <david.zafman@inktank.com> Reviewed-by: Sage Weil <sage@inktank.com>
* crush/builder.c: reduce scope of oldsize in crush_add_bucket()Danny Al-Gaaf2013-05-101-2/+1
| | | | Signed-off-by: Danny Al-Gaaf <danny.al-gaaf@bisect.de>
* crush/builder.c: fix realloc handlingDanny Al-Gaaf2013-02-281-31/+158
| | | | | | | | | | | Fix handling of realloc. If realloc() fails it returns NULL, assigning the return value of realloc() directly to the pointer without checking for the result will lead to a memory leak in error case. Use a temporary pointer to hold the result of realloc(). In error case return -ENOMEM, otherwise assign it to the pointer we want to realloc. Signed-off-by: Danny Al-Gaaf <danny.al-gaaf@bisect.de>
* crush/builder.c: fix sizeof handling of bucket->h.itemsDanny Al-Gaaf2013-02-281-13/+13
| | | | | | | | Fix sizeof handling for realloc/malloc of bucket->h.items. items are of type __s32 and not __u32 (sizeof gives the same size, but fix it to represent the correct type). Signed-off-by: Danny Al-Gaaf <danny.al-gaaf@bisect.de>
* crush/builder.c: reduce scope of oldsize in crush_add_rule()Danny Al-Gaaf2013-02-271-1/+1
| | | | Signed-off-by: Danny Al-Gaaf <danny.al-gaaf@bisect.de>
* crush: for chooseleaf rules, retry CRUSH map descent from root if leaf is failedJim Schutt2012-11-271-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Consider the CRUSH rule step chooseleaf firstn 0 type <node_type> This rule means that <n> replicas will be chosen in a manner such that each chosen leaf's branch will contain a unique instance of <node_type>. When an object is re-replicated after a leaf failure, if the CRUSH map uses a chooseleaf rule the remapped replica ends up under the <node_type> bucket that held the failed leaf. This causes uneven data distribution across the storage cluster, to the point that when all the leaves but one fail under a particular <node_type> bucket, that remaining leaf holds all the data from its failed peers. This behavior also limits the number of peers that can participate in the re-replication of the data held by the failed leaf, which increases the time required to re-replicate after a failure. For a chooseleaf CRUSH rule, the tree descent has two steps: call them the inner and outer descents. If the tree descent down to <node_type> is the outer descent, and the descent from <node_type> down to a leaf is the inner descent, the issue is that a down leaf is detected on the inner descent, so only the inner descent is retried. In order to disperse re-replicated data as widely as possible across a storage cluster after a failure, we want to retry the outer descent. So, fix up crush_choose() to allow the inner descent to return immediately on choosing a failed leaf. Wire this up as a new CRUSH tunable. Note that after this change, for a chooseleaf rule, if the primary OSD in a placement group has failed, choosing a replacement may result in one of the other OSDs in the PG colliding with the new primary. This requires that OSD's data for that PG to need moving as well. This seems unavoidable but should be relatively rare. Signed-off-by: Jim Schutt <jaschut@sandia.gov>
* crush: prevent integer overflow on reweightcaleb miles2012-07-121-60/+124
| | | | | | | Disallow setting OSD weights to a value over 10,000 and cap bucket weight at 10,000,000 in a CRUSH map. Addresses issue #2101. Signed-off-by: caleb miles <caleb.miles@inktank.com>
* crush: make magic numbers tunableSage Weil2012-06-081-0/+5
| | | | | | | | | | | | | | | | | | We have three magic numbers in crush_choose that are now tunable. The first two control the local retry behavior, including fallback to a permutation. The last is the total map descent attempts. We can avoid a drastic incompatibility by making these tunable and encoded in the map. That means users can enable/disable local retry, for example, without changing the code. As long as the clients understand the tunables, they can be adjusted. This patch doesn't address the compatibility and feature bit issue. We may want to roll that into a larger revision with more drastic changes, once we know what those changes will look like. However, a careful user can use the new code and modify the behavior. Signed-off-by: Sage Weil <sage@inktank.com>
* builder: make reweight helpers static, voidSage Weil2012-05-211-13/+12
| | | | Signed-off-by: Sage Weil <sage@inktank.com>