summaryrefslogtreecommitdiffstats
path: root/object.h (follow)
Commit message (Collapse)AuthorAgeFilesLines
* Merge branch 'tb/path-filter-fix'Junio C Hamano2024-07-081-1/+2
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The Bloom filter used for path limited history traversal was broken on systems whose "char" is unsigned; update the implementation and bump the format version to 2. * tb/path-filter-fix: bloom: introduce `deinit_bloom_filters()` commit-graph: reuse existing Bloom filters where possible object.h: fix mis-aligned flag bits table commit-graph: new Bloom filter version that fixes murmur3 commit-graph: unconditionally load Bloom filters bloom: prepare to discard incompatible Bloom filters bloom: annotate filters with hash version repo-settings: introduce commitgraph.changedPathsVersion t4216: test changed path filters with high bit paths t/helper/test-read-graph: implement `bloom-filters` mode bloom.h: make `load_bloom_filter_from_graph()` public t/helper/test-read-graph.c: extract `dump_graph_info()` gitformat-commit-graph: describe version 2 of BDAT commit-graph: ensure Bloom filters are read with consistent settings revision.c: consult Bloom filters for root commits t/t4216-log-bloom.sh: harden `test_bloom_filters_not_used()`
| * commit-graph: reuse existing Bloom filters where possibleTaylor Blau2024-06-251-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | In an earlier commit, a bug was described where it's possible for Git to produce non-murmur3 hashes when the platform's "char" type is signed, and there are paths with characters whose highest bit is set (i.e. all characters >= 0x80). That patch allows the caller to control which version of Bloom filters are read and written. However, even on platforms with a signed "char" type, it is possible to reuse existing Bloom filters if and only if there are no changed paths in any commit's first parent tree-diff whose characters have their highest bit set. When this is the case, we can reuse the existing filter without having to compute a new one. This is done by marking trees which are known to have (or not have) any such paths. When a commit's root tree is verified to not have any such paths, we mark it as such and declare that the commit's Bloom filter is reusable. Note that this heuristic only goes in one direction. If neither a commit nor its first parent have any paths in their trees with non-ASCII characters, then we know for certain that a path with non-ASCII characters will not appear in a tree-diff against that commit's first parent. The reverse isn't necessarily true: just because the tree-diff doesn't contain any such paths does not imply that no such paths exist in either tree. So we end up recomputing some Bloom filters that we don't strictly have to (i.e. their bits are the same no matter which version of murmur3 we use). But culling these out is impossible, since we'd have to perform the full tree-diff, which is the same effort as computing the Bloom filter from scratch. But because we can cache our results in each tree's flag bits, we can often avoid recomputing many filters, thereby reducing the time it takes to run $ git commit-graph write --changed-paths --reachable when upgrading from v1 to v2 Bloom filters. To benchmark this, let's generate a commit-graph in linux.git with v1 changed-paths in generation order[^1]: $ git clone git@github.com:torvalds/linux.git $ cd linux $ git commit-graph write --reachable --changed-paths $ graph=".git/objects/info/commit-graph" $ mv $graph{,.bak} Then let's time how long it takes to go from v1 to v2 filters (with and without the upgrade path enabled), resetting the state of the commit-graph each time: $ git config commitGraph.changedPathsVersion 2 $ hyperfine -p 'cp -f $graph.bak $graph' -L v 0,1 \ 'GIT_TEST_UPGRADE_BLOOM_FILTERS={v} git.compile commit-graph write --reachable --changed-paths' On linux.git (where there aren't any non-ASCII paths), the timings indicate that this patch represents a speed-up over recomputing all Bloom filters from scratch: Benchmark 1: GIT_TEST_UPGRADE_BLOOM_FILTERS=0 git.compile commit-graph write --reachable --changed-paths Time (mean ± σ): 124.873 s ± 0.316 s [User: 124.081 s, System: 0.643 s] Range (min … max): 124.621 s … 125.227 s 3 runs Benchmark 2: GIT_TEST_UPGRADE_BLOOM_FILTERS=1 git.compile commit-graph write --reachable --changed-paths Time (mean ± σ): 79.271 s ± 0.163 s [User: 74.611 s, System: 4.521 s] Range (min … max): 79.112 s … 79.437 s 3 runs Summary 'GIT_TEST_UPGRADE_BLOOM_FILTERS=1 git.compile commit-graph write --reachable --changed-paths' ran 1.58 ± 0.01 times faster than 'GIT_TEST_UPGRADE_BLOOM_FILTERS=0 git.compile commit-graph write --reachable --changed-paths' On git.git, we do have some non-ASCII paths, giving us a more modest improvement from 4.163 seconds to 3.348 seconds, for a 1.24x speed-up. On my machine, the stats for git.git are: - 8,285 Bloom filters computed from scratch - 10 Bloom filters generated as empty - 4 Bloom filters generated as truncated due to too many changed paths - 65,114 Bloom filters were reused when transitioning from v1 to v2. [^1]: Note that this is is important, since `--stdin-packs` or `--stdin-commits` orders commits in the commit-graph by their pack position (with `--stdin-packs`) or in the raw input (with `--stdin-commits`). Since we compute Bloom filters in the same order that commits appear in the graph, we must see a commit's (first) parent before we process the commit itself. This is only guaranteed to happen when sorting commits by their generation number. Signed-off-by: Taylor Blau <me@ttaylorr.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
| * object.h: fix mis-aligned flag bits tableTaylor Blau2024-06-251-1/+1
| | | | | | | | | | | | | | Bit position 23 is one column too far to the left. Signed-off-by: Taylor Blau <me@ttaylorr.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
* | Merge branch 'ps/use-the-repository'Junio C Hamano2024-07-021-1/+1
|\ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | A CPP macro USE_THE_REPOSITORY_VARIABLE is introduced to help transition the codebase to rely less on the availability of the singleton the_repository instance. * ps/use-the-repository: hex: guard declarations with `USE_THE_REPOSITORY_VARIABLE` t/helper: remove dependency on `the_repository` in "proc-receive" t/helper: fix segfault in "oid-array" command without repository t/helper: use correct object hash in partial-clone helper compat/fsmonitor: fix socket path in networked SHA256 repos replace-object: use hash algorithm from passed-in repository protocol-caps: use hash algorithm from passed-in repository oidset: pass hash algorithm when parsing file http-fetch: don't crash when parsing packfile without a repo hash-ll: merge with "hash.h" refs: avoid include cycle with "repository.h" global: introduce `USE_THE_REPOSITORY_VARIABLE` macro hash: require hash algorithm in `empty_tree_oid_hex()` hash: require hash algorithm in `is_empty_{blob,tree}_oid()` hash: make `is_null_oid()` independent of `the_repository` hash: convert `oidcmp()` and `oideq()` to compare whole hash global: ensure that object IDs are always padded hash: require hash algorithm in `oidread()` and `oidclr()` hash: require hash algorithm in `hasheq()`, `hashcmp()` and `hashclr()` hash: drop (mostly) unused `is_empty_{blob,tree}_sha1()` functions
| * | hash-ll: merge with "hash.h"Patrick Steinhardt2024-06-141-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The "hash-ll.h" header was introduced via d1cbe1e6d8 (hash-ll.h: split out of hash.h to remove dependency on repository.h, 2023-04-22) to make explicit the split between hash-related functions that rely on the global `the_repository`, and those that don't. This split is no longer necessary now that we we have removed the reliance on `the_repository`. Merge "hash-ll.h" back into "hash.h". This causes some code units to not include "repository.h" anymore, which requires us to add some forward declarations. Signed-off-by: Patrick Steinhardt <ps@pks.im> Signed-off-by: Junio C Hamano <gitster@pobox.com>
* | | Merge branch 'tb/pseudo-merge-reachability-bitmap'Junio C Hamano2024-06-251-1/+1
|\ \ \ | |/ / |/| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The pseudo-merge reachability bitmap to help more efficient storage of the reachability bitmap in a repository with too many refs has been added. * tb/pseudo-merge-reachability-bitmap: (26 commits) pack-bitmap.c: ensure pseudo-merge offset reads are bounded Documentation/technical/bitmap-format.txt: add missing position table t/perf: implement performance tests for pseudo-merge bitmaps pseudo-merge: implement support for finding existing merges ewah: `bitmap_equals_ewah()` pack-bitmap: extra trace2 information pack-bitmap.c: use pseudo-merges during traversal t/test-lib-functions.sh: support `--notick` in `test_commit_bulk()` pack-bitmap: implement test helpers for pseudo-merge ewah: implement `ewah_bitmap_popcount()` pseudo-merge: implement support for reading pseudo-merge commits pack-bitmap.c: read pseudo-merge extension pseudo-merge: scaffolding for reads pack-bitmap: extract `read_bitmap()` function pack-bitmap-write.c: write pseudo-merge table pseudo-merge: implement support for selecting pseudo-merge commits config: introduce `git_config_double()` pack-bitmap: make `bitmap_writer_push_bitmapped_commit()` public pack-bitmap: implement `bitmap_writer_has_bitmapped_object_id()` pack-bitmap-write: support storing pseudo-merge commits ...
| * | pack-bitmap-write: support storing pseudo-merge commitsTaylor Blau2024-05-241-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Prepare to write pseudo-merge bitmaps by annotating individual bitmapped commits (which are represented by the `bitmapped_commit` structure) with an extra bit indicating whether or not they are a pseudo-merge. In subsequent commits, pseudo-merge bitmaps will be generated by allocating a fake commit node with parents covering the full set of commits represented by the pseudo-merge bitmap. These commits will be added to the set of "selected" commits as usual, but will be written specially instead of being included with the rest of the selected commits. Mechanically speaking, there are two parts of this change: - The bitmapped_commit struct gets a new bit indicating whether it is a pseudo-merge, or an ordinary commit selected for bitmaps. - A handful of changes to only write out the non-pseudo-merge commits when enumerating through the selected array (see the new `bitmap_writer_selected_nr()` function). Pseudo-merge commits appear after all non-pseudo-merge commits, so it is safe to enumerate through the selected array like so: for (i = 0; i < bitmap_writer_selected_nr(); i++) if (writer.selected[i].pseudo_merge) BUG("unexpected pseudo-merge"); without encountering the BUG(). Signed-off-by: Taylor Blau <me@ttaylorr.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
* | | Merge branch 'ps/refs-without-the-repository-updates'Junio C Hamano2024-05-301-0/+35
|\ \ \ | |/ / |/| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Further clean-up the refs subsystem to stop relying on the_repository, and instead use the repository associated to the ref_store object. * ps/refs-without-the-repository-updates: refs/packed: remove references to `the_hash_algo` refs/files: remove references to `the_hash_algo` refs/files: use correct repository refs: remove `dwim_log()` refs: drop `git_default_branch_name()` refs: pass repo when peeling objects refs: move object peeling into "object.c" refs: pass ref store when detecting dangling symrefs refs: convert iteration over replace refs to accept ref store refs: retrieve worktree ref stores via associated repository refs: refactor `resolve_gitlink_ref()` to accept a repository refs: pass repo when retrieving submodule ref store refs: track ref stores via strmap refs: implement releasing ref storages refs: rename `init_db` callback to avoid confusion refs: adjust names for `init` and `init_db` callbacks
| * | refs: pass repo when peeling objectsPatrick Steinhardt2024-05-171-1/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Both `peel_object()` and `peel_iterated_oid()` implicitly rely on `the_repository` to look up objects. Despite the fact that we want to get rid of `the_repository`, it also leads to some restrictions in our ref iterators when trying to retrieve the peeled value for a repository other than `the_repository`. Refactor these functions such that both take a repository as argument and remove the now-unnecessary restrictions. Signed-off-by: Patrick Steinhardt <ps@pks.im> Signed-off-by: Junio C Hamano <gitster@pobox.com>
| * | refs: move object peeling into "object.c"Patrick Steinhardt2024-05-171-0/+34
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Peeling an object has nothing to do with refs, but we still have the code in "refs.c". Move it over into "object.c", which is a more natural place to put it. Ideally, we'd also move `peel_iterated_oid()` over into "object.c". But this function is tied to the refs interfaces because it uses a global ref iterator variable to optimize peeling when the iterator already has the peeled object ID readily available. Signed-off-by: Patrick Steinhardt <ps@pks.im> Signed-off-by: Junio C Hamano <gitster@pobox.com>
* | | object.h: add flags allocated by pack-bitmap.hTaylor Blau2024-05-151-0/+1
|/ / | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | In commit 7cc8f971085 (pack-objects: implement bitmap writing, 2013-12-21) the NEEDS_BITMAP flag was introduced into pack-bitmap.h, but no object flags allocation table existed at the time. In 208acbfb82f (object.h: centralize object flag allocation, 2014-03-25) when that table was first introduced, we never added the flags from 7cc8f971085, which has remained the case since. Rectify this by including the flag bit used by pack-bitmap.h into the centralized table in object.h. Signed-off-by: Taylor Blau <me@ttaylorr.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
* | Merge branch 'eb/hash-transition'Junio C Hamano2024-03-281-0/+18
|\ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Work to support a repository that work with both SHA-1 and SHA-256 hash algorithms has started. * eb/hash-transition: (30 commits) t1016-compatObjectFormat: add tests to verify the conversion between objects t1006: test oid compatibility with cat-file t1006: rename sha1 to oid test-lib: compute the compatibility hash so tests may use it builtin/ls-tree: let the oid determine the output algorithm object-file: handle compat objects in check_object_signature tree-walk: init_tree_desc take an oid to get the hash algorithm builtin/cat-file: let the oid determine the output algorithm rev-parse: add an --output-object-format parameter repository: implement extensions.compatObjectFormat object-file: update object_info_extended to reencode objects object-file-convert: convert commits that embed signed tags object-file-convert: convert commit objects when writing object-file-convert: don't leak when converting tag objects object-file-convert: convert tag objects when writing object-file-convert: add a function to convert trees between algorithms object: factor out parse_mode out of fast-import and tree-walk into in object.h cache: add a function to read an OID of a specific algorithm tag: sign both hashes commit: export add_header_signature to support handling signatures on tags ...
| * | object: factor out parse_mode out of fast-import and tree-walk into in object.hEric W. Biederman2023-10-021-0/+18
| |/ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | builtin/fast-import.c and tree-walk.c have almost identical version of get_mode. The two functions started out the same but have diverged slightly. The version in fast-import changed mode to a uint16_t to save memory. The version in tree-walk started erroring if no mode was present. As far as I can tell both of these changes are valid for both of the callers, so add the both changes and place the common parsing helper in object.h Rename the helper from get_mode to parse_mode so it does not conflict with another helper named get_mode in diff-no-index.c This will be used shortly in a new helper decode_tree_entry_raw which is used to compute cmpatibility objects as part of the sha256 transition. Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
* / upload-pack: free tree buffers after parsingJeff King2024-02-281-0/+1
|/ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When a client sends us a "want" or "have" line, we call parse_object() to get an object struct. If the object is a tree, then the parsed state means that tree->buffer points to the uncompressed contents of the tree. But we don't really care about it. We only really need to parse commits and tags; for trees and blobs, the important output is just a "struct object" with the correct type. But much worse, we do not ever free that tree buffer. It's not leaked in the traditional sense, in that we still have a pointer to it from the global object hash. But if the client requests many trees, we'll hold all of their contents in memory at the same time. Nobody really noticed because it's rare for clients to directly request a tree. It might happen for a lightweight tag pointing straight at a tree, or it might happen for a "tree:depth" partial clone filling in missing trees. But it's also possible for a malicious client to request a lot of trees, causing upload-pack's memory to balloon. For example, without this patch, requesting every tree in git.git like: pktline() { local msg="$*" printf "%04x%s\n" $((1+4+${#msg})) "$msg" } want_trees() { pktline command=fetch printf 0001 git cat-file --batch-all-objects --batch-check='%(objectname) %(objecttype)' | while read oid type; do test "$type" = "tree" || continue pktline want $oid done pktline done printf 0000 } want_trees | GIT_PROTOCOL=version=2 valgrind --tool=massif ./git upload-pack . >/dev/null shows a peak heap usage of ~3.7GB. Which is just about the sum of the sizes of all of the uncompressed trees. For linux.git, it's closer to 17GB. So the obvious thing to do is to call free_tree_buffer() after we realize that we've parsed a tree. We know that upload-pack won't need it later. But let's push the logic into parse_object_with_flags(), telling it to discard the tree buffer immediately. There are two reasons for this. One, all of the relevant call-sites already call the with_options variant to pass the SKIP_HASH flag. So it actually ends up as less code than manually free-ing in each spot. And two, it enables an extra optimization that I'll discuss below. I've touched all of the sites that currently use SKIP_HASH in upload-pack. That drops the peak heap of the upload-pack invocation above from 3.7GB to ~24MB. I've also modified the caller in get_reference(); a partial clone benefits from its use in pack-objects for the reasons given in 0bc2557951 (upload-pack: skip parse-object re-hashing of "want" objects, 2022-09-06), where we were measuring blob requests. But note that the results of get_reference() are used for traversing, as well; so we really would _eventually_ use the tree contents. That makes this at first glance a space/time tradeoff: we won't hold all of the trees in memory at once, but we'll have to reload them each when it comes time to traverse. And here's where our extra optimization comes in. If the caller is not going to immediately look at the tree contents, and it doesn't care about checking the hash, then parse_object() can simply skip loading the tree entirely, just like we do for blobs! And now it's not a space/time tradeoff in get_reference() anymore. It's just a lazy-load: we're delaying reading the tree contents until it's time to actually traverse them one by one. And of course for upload-pack, this optimization means we never load the trees at all, saving lots of CPU time. Timing the "every tree from git.git" request above shows upload-pack dropping from 32 seconds of CPU to 19 (the remainder is mostly due to pack-objects actually sending the pack; timing just the upload-pack portion shows we go from 13s to ~0.28s). These are all highly gamed numbers, of course. For real-world partial-clone requests we're saving only a small bit of time in practice. But it does help harden upload-pack against malicious denial-of-service attacks. Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
* Merge branch 'tb/pack-bitmap-traversal-with-boundary'Junio C Hamano2023-06-231-0/+2
|\ | | | | | | | | | | | | | | | | | | | | | | The object traversal using reachability bitmap done by "pack-object" has been tweaked to take advantage of the fact that using "boundary" commits as representative of all the uninteresting ones can save quite a lot of object enumeration. * tb/pack-bitmap-traversal-with-boundary: pack-bitmap.c: use commit boundary during bitmap traversal pack-bitmap.c: extract `fill_in_bitmap()` object: add object_array initializer helper function
| * object: add object_array initializer helper functionTaylor Blau2023-05-081-0/+2
| | | | | | | | | | | | | | | | | | | | The object_array API has an OBJECT_ARRAY_INIT macro, but lacks a function to initialize an object_array at a given location in memory. Introduce `object_array_init()` to implement such a function. Signed-off-by: Taylor Blau <me@ttaylorr.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
* | hash-ll.h: split out of hash.h to remove dependency on repository.hElijah Newren2023-04-241-1/+2
|/ | | | | | | | | | | | | | | | | | | hash.h depends upon and includes repository.h, due to the definition and use of the_hash_algo (defined as the_repository->hash_algo). However, most headers trying to include hash.h are only interested in the layout of the structs like object_id. Move the parts of hash.h that do not depend upon repository.h into a new file hash-ll.h (the "low level" parts of hash.h), and adjust other files to use this new header where the convenience inline functions aren't needed. This allows hash.h and object.h to be fairly small, minimal headers. It also exposes a lot of hidden dependencies on both path.h (which was brought in by repository.h) and repository.h (which was previously implicitly brought in by object.h), so also adjust other files to be more explicit about what they depend upon. Signed-off-by: Elijah Newren <newren@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
* object.h: move some inline functions and defines from cache.hElijah Newren2023-04-111-0/+44
| | | | | | | | | | | | | The object_type() inline function is very tied to the enum object_type declaration within object.h, and just seemed to make more sense to live there. That makes S_ISGITLINK and some other defines make sense to go with it, as well as the create_ce_mode() and canon_mode() inline functions. Move all these inline functions and defines from cache.h to object.h. Signed-off-by: Elijah Newren <newren@gmail.com> Acked-by: Calvin Wan <calvinwan@google.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
* object.h: stop depending on cache.h; make cache.h depend on object.hElijah Newren2023-02-241-1/+21
| | | | | | | | | | | | | | | | | Things should be able to depend on object.h without pulling in all of cache.h. Move an enum to allow this. Note that a couple files previously depended on things brought in through cache.h indirectly (revision.h -> commit.h -> object.h -> cache.h). As such, this change requires making existing dependencies more explicit in half a dozen files. The inclusion of strbuf.h in some headers if of particular note: these headers directly embedded a strbuf in some new structs, meaning they should have been including strbuf.h all along but were indirectly getting the necessary definitions. Signed-off-by: Elijah Newren <newren@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
* parse_object(): allow skipping hash checkJeff King2022-09-071-0/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The parse_object() function checks the object hash of any object it parses. This is a nice feature, as it means we may catch bit corruption during normal use, rather than waiting for specific fsck operations. But it also can be slow. It's particularly noticeable for blobs, where except for the hash check, we could return without loading the object contents at all. Now one may wonder what is the point of calling parse_object() on a blob in the first place then, but usually it's not intentional: we were fed an oid from somewhere, don't know the type, and want an object struct. For commits and trees, the parsing is usually helpful; we're about to look at the contents anyway. But this is less true for blobs, where we may be collecting them as part of a reachability traversal, etc, and don't actually care what's in them. And blobs, of course, tend to be larger. We don't want to just throw out the hash-checks for blobs, though. We do depend on them in some circumstances (e.g., rev-list --verify-objects uses parse_object() to check them). It's only the callers that know how they're going to use the result. And so we can help them by providing a special flag to skip the hash check. We could just apply this to blobs, as they're going to be the main source of performance improvement. But if a caller doesn't care about checking the hash, we might as well skip it for other object types, too. Even though we can't avoid reading the object contents, we can still skip the actual hash computation. If this seems like it is making Git a little bit less safe against corruption, it may be. But it's part of a series of tradeoffs we're already making. For instance, "rev-list --objects" does not open the contents of blobs it prints. And when a commit graph is present, we skip opening most commits entirely. The important thing will be to use this flag in cases where it's safe to skip the check. For instance, when serving a pack for a fetch, we know the client will fully index the objects and do a connectivity check itself. There's little to be gained from the server side re-hashing a blob itself. And indeed, most of the time we don't! The revision machinery won't open up a blob reached by traversal, but only one requested directly with a "want" line. So applied properly, this new feature shouldn't make anything less safe in practice. Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
* revision: allow --ancestry-path to take an argumentElijah Newren2022-08-191-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | We have long allowed users to run e.g. git log --ancestry-path master..seen which shows all commits which satisfy all three of these criteria: * are an ancestor of seen * are not an ancestor of master * have master as an ancestor This commit allows another variant: git log --ancestry-path=$TOPIC master..seen which shows all commits which satisfy all of these criteria: * are an ancestor of seen * are not an ancestor of master * have $TOPIC in their ancestry-path that last bullet can be defined as commits meeting any of these criteria: * are an ancestor of $TOPIC * have $TOPIC as an ancestor * are $TOPIC This also allows multiple --ancestry-path arguments, which can be used to find commits with any of the given topics in their ancestry path. Signed-off-by: Elijah Newren <newren@gmail.com> Acked-by: Derrick Stolee <derrickstolee@github.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
* reflog: libify delete reflog function and helpersJohn Cai2022-03-031-1/+1
| | | | | | | | | | | | | | | | Currently stash shells out to reflog in order to delete refs. In an effort to reduce how much we shell out to a subprocess, libify the functionality that stash needs into reflog.c. Add a reflog_delete function that is pretty much the logic in the while loop in builtin/reflog.c cmd_reflog_delete(). This is a function that builtin/reflog.c and builtin/stash.c can both call. Also move functions needed by reflog_delete and export them. Helped-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com> Signed-off-by: John Cai <johncai86@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
* *.[ch] *_INIT macros: use { 0 } for a "zero out" idiomÆvar Arnfjörð Bjarmason2021-09-271-1/+1
| | | | | | | | | | | | | | | | In C it isn't required to specify that all members of a struct are zero'd out to 0, NULL or '\0', just providing a "{ 0 }" will accomplish that. Let's also change code that provided N zero'd fields to just provide one, and change e.g. "{ NULL }" to "{ 0 }" for consistency. I.e. even if the first member is a pointer let's use "0" instead of "NULL". The point of using "0" consistently is to pick one, and to not have the reader wonder why we're not using the same pattern everywhere. Signed-off-by: Ævar Arnfjörð Bjarmason <avarab@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
* builtin/pack-objects.c: remove duplicate hash lookupTaylor Blau2021-08-301-1/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | In the original code from 08cdfb1337 (pack-objects --keep-unreachable, 2007-09-16), we add each object to the packing list with type `obj->type`, where `obj` comes from `lookup_unknown_object()`. Unless we had already looked up and parsed the object, this will be `OBJ_NONE`. That's fine, since oe_set_type() sets the type_valid bit to '0', and we determine the real type later on. So the only thing we need from the object lookup is access to the `flags` field so that we can mark that we've added the object with `OBJECT_ADDED` to avoid adding it again (we can just pass `OBJ_NONE` directly instead of grabbing it from the object). But add_object_entry() already rejects duplicates! This has been the behavior since 7a979d99ba (Thin pack - create packfile with missing delta base., 2006-02-19), but 08cdfb1337 didn't take advantage of it. Moreover, to do the OBJECT_ADDED check, we have to do a hash lookup in `obj_hash`. So we can drop the lookup_unknown_object() call completely, *and* the OBJECT_ADDED flag, too, since the spot we're touching here is the only location that checks it. In the end, we perform the same number of hash lookups, but with the added bonus that we don't waste memory allocating an OBJ_NONE object (if we were traversing, we'd need it eventually, but the whole point of this code path is not to traverse). Signed-off-by: Taylor Blau <me@ttaylorr.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
* object.h: add lookup_object_by_type() functionJeff King2021-06-291-0/+7
| | | | | | | | | | | | | | | | | | | | | | | | | | | | In some cases it's useful for efficiency reasons to get the type of an object before deciding whether to parse it, but we still want an object struct. E.g., in reachable.c, bitmaps give us the type, but we just want to mark flags on each object. Likewise, we may loop over every object and only parse tags in order to peel them; checking the type first lets us avoid parsing the non-tags. But our lookup_blob(), etc, functions make getting an object struct annoying: we have to call the right function for every type. And we cannot just use the generic lookup_object(), because it only returns an already-seen object; it won't allocate a new object struct. Let's provide a function that dispatches to the correct lookup_* function based on a run-time type. In fact, reachable.c already has such a helper, so we'll just make that public. I did change the return type from "void *" to "struct object *". While the former is a clever way to avoid casting inside the function, it's less safe and less informative to people reading the function declaration. The next commit will add a new caller. Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
* object.h: expand docstring for lookup_unknown_object()Jeff King2021-06-291-1/+12
| | | | | | | | | | The lookup_unknown_object() system is not often used and is somewhat confusing. Let's try to explain it a bit more (which is especially important as I'm adding a related but slightly different function in the next commit). Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
* Merge branch 'jt/push-negotiation'Junio C Hamano2021-05-161-1/+1
|\ | | | | | | | | | | | | | | | | | | | | | | "git push" learns to discover common ancestor with the receiving end over protocol v2. * jt/push-negotiation: send-pack: support push negotiation fetch: teach independent negotiation (no packfile) fetch-pack: refactor command and capability write fetch-pack: refactor add_haves() fetch-pack: refactor process_acks()
| * fetch: teach independent negotiation (no packfile)Jonathan Tan2021-05-051-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Currently, the packfile negotiation step within a Git fetch cannot be done independent of sending the packfile, even though there is at least one application wherein this is useful. Therefore, make it possible for this negotiation step to be done independently. A subsequent commit will use this for one such application - push negotiation. This feature is for protocol v2 only. (An implementation for protocol v0 would require a separate implementation in the fetch, transport, and transport helper code.) In the protocol, the main hindrance towards independent negotiation is that the server can unilaterally decide to send the packfile. This is solved by a "wait-for-done" argument: the server will then wait for the client to say "done". In practice, the client will never say it; instead it will cease requests once it is satisfied. In the client, the main change lies in the transport and transport helper code. fetch_refs_via_pack() performs everything needed - protocol version and capability checks, and the negotiation itself. There are 2 code paths that do not go through fetch_refs_via_pack() that needed to be individually excluded: the bundle transport (excluded through requiring smart_options, which the bundle transport doesn't support) and transport helpers that do not support takeover. If or when we support independent negotiation for protocol v0, we will need to modify these 2 code paths to support it. But for now, report failure if independent negotiation is requested in these cases. Signed-off-by: Jonathan Tan <jonathantanmy@google.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
* | lookup_unknown_object(): take a repository argumentJeff King2021-04-131-1/+1
|/ | | | | | | | | | | | | | | | All of the other lookup_foo() functions take a repository argument, but lookup_unknown_object() was never converted, and it uses the_repository internally. Let's fix that. We could leave a wrapper that uses the_repository, but there aren't that many calls, so we'll just convert them all. I looked briefly at each site to see if we had a repository struct (besides the_repository) we could pass, but none of them do (so this conversion to pass the_repository is a pure noop in each case, though it does take us one step closer to eventually getting rid of the_repository). Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
* object: allow clear_commit_marks_all to handle any repoRené Scharfe2020-10-311-2/+3
| | | | | | | | | Allow callers to specify the repository to use. Rename the function to repo_clear_commit_marks to document its new scope. No functional change intended. Signed-off-by: René Scharfe <l.s.r@web.de> Signed-off-by: Junio C Hamano <gitster@pobox.com>
* maintenance: add auto condition for commit-graph taskDerrick Stolee2020-09-171-0/+1
| | | | | | | | | | | | | | | | | | | | | | Instead of writing a new commit-graph in every 'git maintenance run --auto' process (when maintenance.commit-graph.enalbed is configured to be true), only write when there are "enough" commits not in a commit-graph file. This count is controlled by the maintenance.commit-graph.auto config option. To compute the count, use a depth-first search starting at each ref, and leaving markers using the SEEN flag. If this count reaches the limit, then terminate early and start the task. Otherwise, this operation will peel every ref and parse the commit it points to. If these are all in the commit-graph, then this is typically a very fast operation. Users with many refs might feel a slow-down, and hence could consider updating their limit to be very small. A negative value will force the step to run every time. Signed-off-by: Derrick Stolee <dstolee@microsoft.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
* Merge branch 'tb/fix-persistent-shallow' into masterJunio C Hamano2020-07-091-0/+1
|\ | | | | | | | | | | | | | | | | | | When "fetch.writeCommitGraph" configuration is set in a shallow repository and a fetch moves the shallow boundary, we wrote out broken commit-graph files that do not match the reality, which has been corrected. * tb/fix-persistent-shallow: commit.c: don't persist substituted parents when unshallowing
| * commit.c: don't persist substituted parents when unshallowingTaylor Blau2020-07-091-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Since 37b9dcabfc (shallow.c: use '{commit,rollback}_shallow_file', 2020-04-22), Git knows how to reset stat-validity checks for the $GIT_DIR/shallow file, allowing it to change between a shallow and non-shallow state in the same process (e.g., in the case of 'git fetch --unshallow'). However, when $GIT_DIR/shallow changes, Git does not alter or remove any grafts (nor substituted parents) in memory. This comes up in a "git fetch --unshallow" with fetch.writeCommitGraph set to true. Ordinarily in a shallow repository (and before 37b9dcabfc, even in this case), commit_graph_compatible() would return false, indicating that the repository should not be used to write a commit-graphs (since commit-graph files cannot represent a shallow history). But since 37b9dcabfc, in an --unshallow operation that check succeeds. Thus even though the repository isn't shallow any longer (that is, we have all of the objects), the in-core representation of those objects still has munged parents at the shallow boundaries. When the commit-graph write proceeds, we use the incorrect parentage, producing wrong results. There are two ways for a user to work around this: either (1) set 'fetch.writeCommitGraph' to 'false', or (2) drop the commit-graph after unshallowing. One way to fix this would be to reset the parsed object pool entirely (flushing the cache and thus preventing subsequent reads from modifying their parents) after unshallowing. That would produce a problem when callers have a now-stale reference to the old pool, and so this patch implements a different approach. Instead, attach a new bit to the pool, 'substituted_parent', which indicates if the repository *ever* stored a commit which had its parents modified (i.e., the shallow boundary prior to unshallowing). This bit needs to be sticky because all reads subsequent to modifying a commit's parents are unreliable when unshallowing. Modify the check in 'commit_graph_compatible' to take this bit into account, and correctly avoid generating commit-graphs in this case, thus solving the bug. Helped-by: Derrick Stolee <dstolee@microsoft.com> Helped-by: Jonathan Nieder <jrnieder@gmail.com> Reported-by: Jay Conrod <jayconrod@google.com> Reviewed-by: Jonathan Nieder <jrnieder@gmail.com> Signed-off-by: Taylor Blau <me@ttaylorr.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
* | Merge branch 'rs/pack-bits-in-object-better'Junio C Hamano2020-07-071-2/+2
|\ \ | | | | | | | | | | | | | | | | | | | | | By renumbering object flag bits, "struct object" managed to lose bloated inter-field padding. * rs/pack-bits-in-object-better: revision: reallocate TOPO_WALK object flags
| * | revision: reallocate TOPO_WALK object flagsRené Scharfe2020-06-241-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The bit fields in struct object have an unfortunate layout. Here's what pahole reports on x86_64 GNU/Linux: struct object { unsigned int parsed:1; /* 0: 0 4 */ unsigned int type:3; /* 0: 1 4 */ /* XXX 28 bits hole, try to pack */ /* Force alignment to the next boundary: */ unsigned int :0; unsigned int flags:29; /* 4: 0 4 */ /* XXX 3 bits hole, try to pack */ struct object_id oid; /* 8 32 */ /* size: 40, cachelines: 1, members: 4 */ /* sum members: 32 */ /* sum bitfield members: 33 bits, bit holes: 2, sum bit holes: 31 bits */ /* last cacheline: 40 bytes */ }; Notice the 1+3+29=33 bits in bit fields and 28+3=31 bits in holes. There are holes inside the flags bit field as well -- while some object flags are used for more than one purpose, 22, 23 and 24 are still free. Use 23 and 24 instead of 27 and 28 for TOPO_WALK_EXPLORED and TOPO_WALK_INDEGREE. This allows us to reduce FLAG_BITS by one so that all bitfields combined fit into a single 32-bit slot: struct object { unsigned int parsed:1; /* 0: 0 4 */ unsigned int type:3; /* 0: 1 4 */ unsigned int flags:28; /* 0: 4 4 */ struct object_id oid; /* 4 32 */ /* size: 36, cachelines: 1, members: 4 */ /* last cacheline: 36 bytes */ }; With this tight packing the size of struct object is reduced by 10%. Other architectures probably benefit as well. Signed-off-by: René Scharfe <l.s.r@web.de> Signed-off-by: Junio C Hamano <gitster@pobox.com>
* | | Merge branch 'bc/http-push-flagsfix'Junio C Hamano2020-07-071-1/+1
|\ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The code to push changes over "dumb" HTTP had a bad interaction with the commit reachability code due to incorrect allocation of object flag bits, which has been corrected. * bc/http-push-flagsfix: http-push: ensure unforced pushes fail when data would be lost
| * | | http-push: ensure unforced pushes fail when data would be lostbrian m. carlson2020-06-241-1/+1
| | |/ | |/| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When we push using the DAV-based protocol, the client is the one that performs the ref updates and therefore makes the checks to see whether an unforced push should be allowed. We make this check by determining if either (a) we lack the object file for the old value of the ref or (b) the new value of the ref is not newer than the old value, and in either case, reject the push. However, the ref_newer function, which performs this latter check, has an odd behavior due to the reuse of certain object flags. Specifically, it will incorrectly return false in its first invocation and then correctly return true on a subsequent invocation. This occurs because the object flags used by http-push.c are the same as those used by commit-reach.c, which implements ref_newer, and one piece of code misinterprets the flags set by the other. Note that this does not occur in all cases. For example, if the example used in the tests is changed to use one repository instead of two and rewind the head to add a commit, the test passes and we correctly reject the push. However, the example provided does trigger this behavior, and the code has been broken in this way since at least Git 2.0.0. To solve this problem, let's move the two sets of object flags so that they don't overlap, since we're clearly using them at the same time. The new set should not conflict with other usage because other users are either builtin code (which is not compiled into git http-push) or upload-pack (which we similarly do not use here). Reported-by: Michael Ward <mward@smartsoftwareinc.com> Signed-off-by: René Scharfe <l.s.r@web.de> Signed-off-by: brian m. carlson <sandals@crustytoothpaste.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
* | | object: drop parsed_object_pool->commit_countAbhishek Kumar2020-06-171-2/+1
| |/ |/| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | 14ba97f8 (alloc: allow arbitrary repositories for alloc functions, 2018-05-15) introduced parsed_object_pool->commit_count to keep count of commits per repository and was used to assign commit->index. However, commit-slab code requires commit->index values to be unique and a global count would be correct, rather than a per-repo count. Let's introduce a static counter variable, `parsed_commits_count` to keep track of parsed commits so far. As commit_count has no use anymore, let's also drop it from the struct. Signed-off-by: Abhishek Kumar <abhishekkumar8222@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
* | revision: --show-pulls adds helpful mergesDerrick Stolee2020-04-101-1/+1
|/ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The default file history simplification of "git log -- <path>" or "git rev-list -- <path>" focuses on providing the smallest set of commits that first contributed a change. The revision walk greatly restricts the set of walked commits by visiting only the first TREESAME parent of a merge commit, when one exists. This means that portions of the commit-graph are not walked, which can be a performance benefit, but can also "hide" commits that added changes but were ignored by a merge resolution. The --full-history option modifies this by walking all commits and reporting a merge commit as "interesting" if it has _any_ parent that is not TREESAME. This tends to be an over-representation of important commits, especially in an environment where most merge commits are created by pull request completion. Suppose we have a commit A and we create a commit B on top that changes our file. When we merge the pull request, we create a merge commit M. If no one else changed the file in the first-parent history between M and A, then M will not be TREESAME to its first parent, but will be TREESAME to B. Thus, the simplified history will be "B". However, M will appear in the --full-history mode. However, suppose that a number of topics T1, T2, ..., Tn were created based on commits C1, C2, ..., Cn between A and M as follows: A----C1----C2--- ... ---Cn----M------P1---P2--- ... ---Pn \ \ \ \ / / / / \ \__.. \ \/ ..__T1 / Tn \ \__.. /\ ..__T2 / \_____________________B \____________________/ If the commits T1, T2, ... Tn did not change the file, then all of P1 through Pn will be TREESAME to their first parent, but not TREESAME to their second. This means that all of those merge commits appear in the --full-history view, with edges that immediately collapse into the lower history without introducing interesting single-parent commits. The --simplify-merges option was introduced to remove these extra merge commits. By noticing that the rewritten parents are reachable from their first parents, those edges can be simplified away. Finally, the commits now look like single-parent commits that are TREESAME to their "only" parent. Thus, they are removed and this issue does not cause issues anymore. However, this also ends up removing the commit M from the history view! Even worse, the --simplify-merges option requires walking the entire history before returning a single result. Many Git users are using Git alongside a Git service that provides code storage alongside a code review tool commonly called "Pull Requests" or "Merge Requests" against a target branch. When these requests are accepted and merged, they typically create a merge commit whose first parent is the previous branch tip and the second parent is the tip of the topic branch used for the request. This presents a valuable order to the parents, but also makes that merge commit slightly special. Users may want to see not only which commits changed a file, but which pull requests merged those commits into their branch. In the previous example, this would mean the users want to see the merge commit "M" in addition to the single- parent commit "C". Users are even more likely to want these merge commits when they use pull requests to merge into a feature branch before merging that feature branch into their trunk. In some sense, users are asking for the "first" merge commit to bring in the change to their branch. As long as the parent order is consistent, this can be handled with the following rule: Include a merge commit if it is not TREESAME to its first parent, but is TREESAME to a later parent. These merges look like the merge commits that would result from running "git pull <topic>" on a main branch. Thus, the option to show these commits is called "--show-pulls". This has the added benefit of showing the commits created by closing a pull request or merge request on any of the Git hosting and code review platforms. To test these options, extend the standard test example to include a merge commit that is not TREESAME to its first parent. It is surprising that that option was not already in the example, as it is instructive. In particular, this extension demonstrates a common issue with file history simplification. When a user resolves a merge conflict using "-Xours" or otherwise ignoring one side of the conflict, they create a TREESAME edge that probably should not be TREESAME. This leads users to become frustrated and complain that "my change disappeared!" In my experience, showing them history with --full-history and --simplify-merges quickly reveals the problematic merge. As mentioned, this option is expensive to compute. The --show-pulls option _might_ show the merge commit (usually titled "resolving conflicts") more quickly. Of course, this depends on the user having the correct parent order, which is backwards when using "git pull master" from a topic branch. There are some special considerations when combining the --show-pulls option with --simplify-merges. This requires adding a new PULL_MERGE object flag to store the information from the initial TREESAME comparisons. This helps avoid dropping those commits in later filters. This is covered by a test, including how the parents can be simplified. Since "struct object" has already ruined its 32-bit alignment by using 33 bits across parsed, type, and flags member, let's not make it worse. PULL_MERGE is used in revision.c with the same value (1u<<15) as REACHABLE in commit-graph.c. The REACHABLE flag is only used when writing a commit-graph file, and a revision walk using --show-pulls does not happen in the same process. Care must be taken in the future to ensure this remains the case. Update Documentation/rev-list-options.txt with significant details around this option. This requires updating the example in the History Simplification section to demonstrate some of the problems with TREESAME second parents. Signed-off-by: Derrick Stolee <dstolee@microsoft.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
* pack-bitmap: fix leak of haves/wants object listsJeff King2020-02-131-0/+2
| | | | | | | | | | When we do a bitmap-aware revision traversal, we create an object_list for each of the "haves" and "wants" tips. After creating the result bitmaps these are no longer needed or used, but we never free the list memory. Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
* commit-graph: fix writing first commit-graph during fetchDerrick Stolee2019-10-251-1/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The previous commit includes a failing test for an issue around fetch.writeCommitGraph and fetching in a repo with a submodule. Here, we fix that bug and set the test to "test_expect_success". The problem arises with this set of commands when the remote repo at <url> has a submodule. Note that --recurse-submodules is not needed to demonstrate the bug. $ git clone <url> test $ cd test $ git -c fetch.writeCommitGraph=true fetch origin Computing commit graph generation numbers: 100% (12/12), done. BUG: commit-graph.c:886: missing parent <hash1> for commit <hash2> Aborted (core dumped) As an initial fix, I converted the code in builtin/fetch.c that calls write_commit_graph_reachable() to instead launch a "git commit-graph write --reachable --split" process. That code worked, but is not how we want the feature to work long-term. That test did demonstrate that the issue must be something to do with internal state of the 'git fetch' process. The write_commit_graph() method in commit-graph.c ensures the commits we plan to write are "closed under reachability" using close_reachable(). This method walks from the input commits, and uses the UNINTERESTING flag to mark which commits have already been visited. This allows the walk to take O(N) time, where N is the number of commits, instead of O(P) time, where P is the number of paths. (The number of paths can be exponential in the number of commits.) However, the UNINTERESTING flag is used in lots of places in the codebase. This flag usually means some barrier to stop a commit walk, such as in revision-walking to compare histories. It is not often cleared after the walk completes because the starting points of those walks do not have the UNINTERESTING flag, and clear_commit_marks() would stop immediately. This is happening during a 'git fetch' call with a remote. The fetch negotiation is comparing the remote refs with the local refs and marking some commits as UNINTERESTING. I tested running clear_commit_marks_many() to clear the UNINTERESTING flag inside close_reachable(), but the tips did not have the flag, so that did nothing. It turns out that the calculate_changed_submodule_paths() method is at fault. Thanks, Peff, for pointing out this detail! More specifically, for each submodule, the collect_changed_submodules() runs a revision walk to essentially do file-history on the list of submodules. That revision walk marks commits UNININTERESTING if they are simplified away by not changing the submodule. Instead, I finally arrived on the conclusion that I should use a flag that is not used in any other part of the code. In commit-reach.c, a number of flags were defined for commit walk algorithms. The REACHABLE flag seemed like it made the most sense, and it seems it was not actually used in the file. The REACHABLE flag was used in early versions of commit-reach.c, but was removed by 4fbcca4 (commit-reach: make can_all_from_reach... linear, 2018-07-20). Add the REACHABLE flag to commit-graph.c and use it instead of UNINTERESTING in close_reachable(). This fixes the bug in manual testing. Reported-by: Johannes Schindelin <johannes.schindelin@gmx.de> Helped-by: Jeff King <peff@peff.net> Helped-by: Szeder Gábor <szeder.dev@gmail.com> Signed-off-by: Derrick Stolee <dstolee@microsoft.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
* object: convert create_object() to use object_idJeff King2019-06-201-1/+1
| | | | | | | | | There are no callers left of create_object() that aren't just passing us the "hash" member of a "struct object_id". Let's take the whole struct, which gets us closer to removing all raw sha1 variables. Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
* object: convert lookup_object() to use object_idJeff King2019-06-201-1/+1
| | | | | | | | | | | | | There are no callers left of lookup_object() that aren't just passing us the "hash" member of a "struct object_id". Let's take the whole struct, which gets us closer to removing all raw sha1 variables. It also matches the existing conversions of lookup_blob(), etc. The conversions of callers were done by hand, but they're all mechanical one-liners. Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
* object: convert lookup_unknown_object() to use object_idJeff King2019-06-201-1/+1
| | | | | | | | | | | | | There are no callers left of lookup_unknown_object() that aren't just passing us the "hash" member of a "struct object_id". Let's take the whole struct, which gets us closer to removing all raw sha1 variables. It also matches the existing conversions of lookup_blob(), etc. The conversions of callers were done by hand, but they're all mechanical one-liners. Signed-off-by: Jeff King <peff@peff.net> Signed-off-by: Junio C Hamano <gitster@pobox.com>
* *.[ch]: remove extern from function declarations using spatchDenton Liu2019-05-051-6/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | There has been a push to remove extern from function declarations. Remove some instances of "extern" for function declarations which are caught by Coccinelle. Note that Coccinelle has some difficulty with processing functions with `__attribute__` or varargs so some `extern` declarations are left behind to be dealt with in a future patch. This was the Coccinelle patch used: @@ type T; identifier f; @@ - extern T f(...); and it was run with: $ git ls-files \*.{c,h} | grep -v ^compat/ | xargs spatch --sp-file contrib/coccinelle/noextern.cocci --in-place Files under `compat/` are intentionally excluded as some are directly copied from external sources and we should avoid churning them as much as possible. Signed-off-by: Denton Liu <liu.denton@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
* revision.c: generation-based topo-order algorithmDerrick Stolee2018-11-021-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The current --topo-order algorithm requires walking all reachable commits up front, topo-sorting them, all before outputting the first value. This patch introduces a new algorithm which uses stored generation numbers to incrementally walk in topo-order, outputting commits as we go. This can dramatically reduce the computation time to write a fixed number of commits, such as when limiting with "-n <N>" or filling the first page of a pager. When running a command like 'git rev-list --topo-order HEAD', Git performed the following steps: 1. Run limit_list(), which parses all reachable commits, adds them to a linked list, and distributes UNINTERESTING flags. If all unprocessed commits are UNINTERESTING, then it may terminate without walking all reachable commits. This does not occur if we do not specify UNINTERESTING commits. 2. Run sort_in_topological_order(), which is an implementation of Kahn's algorithm. It first iterates through the entire set of important commits and computes the in-degree of each (plus one, as we use 'zero' as a special value here). Then, we walk the commits in priority order, adding them to the priority queue if and only if their in-degree is one. As we remove commits from this priority queue, we decrement the in-degree of their parents. 3. While we are peeling commits for output, get_revision_1() uses pop_commit on the full list of commits computed by sort_in_topological_order(). In the new algorithm, these three steps correspond to three different commit walks. We run these walks simultaneously, and advance each only as far as necessary to satisfy the requirements of the 'higher order' walk. We know when we can pause each walk by using generation numbers from the commit- graph feature. Recall that the generation number of a commit satisfies: * If the commit has at least one parent, then the generation number is one more than the maximum generation number among its parents. * If the commit has no parent, then the generation number is one. There are two special generation numbers: * GENERATION_NUMBER_INFINITY: this value is 0xffffffff and indicates that the commit is not stored in the commit-graph and the generation number was not previously calculated. * GENERATION_NUMBER_ZERO: this value (0) is a special indicator to say that the commit-graph was generated by a version of Git that does not compute generation numbers (such as v2.18.0). Since we use generation_numbers_enabled() before using the new algorithm, we do not need to worry about GENERATION_NUMBER_ZERO. However, the existence of GENERATION_NUMBER_INFINITY implies the following weaker statement than the usual we expect from generation numbers: If A and B are commits with generation numbers gen(A) and gen(B) and gen(A) < gen(B), then A cannot reach B. Thus, we will walk in each of our stages until the "maximum unexpanded generation number" is strictly lower than the generation number of a commit we are about to use. The walks are as follows: 1. EXPLORE: using the explore_queue priority queue (ordered by maximizing the generation number), parse each reachable commit until all commits in the queue have generation number strictly lower than needed. During this walk, update the UNINTERESTING flags as necessary. 2. INDEGREE: using the indegree_queue priority queue (ordered by maximizing the generation number), add one to the in- degree of each parent for each commit that is walked. Since we walk in order of decreasing generation number, we know that discovering an in-degree value of 0 means the value for that commit was not initialized, so should be initialized to two. (Recall that in-degree value "1" is what we use to say a commit is ready for output.) As we iterate the parents of a commit during this walk, ensure the EXPLORE walk has walked beyond their generation numbers. 3. TOPO: using the topo_queue priority queue (ordered based on the sort_order given, which could be commit-date, author- date, or typical topo-order which treats the queue as a LIFO stack), remove a commit from the queue and decrement the in-degree of each parent. If a parent has an in-degree of one, then we add it to the topo_queue. Before we decrement the in-degree, however, ensure the INDEGREE walk has walked beyond that generation number. The implementations of these walks are in the following methods: * explore_walk_step and explore_to_depth * indegree_walk_step and compute_indegrees_to_depth * next_topo_commit and expand_topo_walk These methods have some patterns that may seem strange at first, but they are probably carry-overs from their equivalents in limit_list and sort_in_topological_order. One thing that is missing from this implementation is a proper way to stop walking when the entire queue is UNINTERESTING, so this implementation is not enabled by comparisions, such as in 'git rev-list --topo-order A..B'. This can be updated in the future. In my local testing, I used the following Git commands on the Linux repository in three modes: HEAD~1 with no commit-graph, HEAD~1 with a commit-graph, and HEAD with a commit-graph. This allows comparing the benefits we get from parsing commits from the commit-graph and then again the benefits we get by restricting the set of commits we walk. Test: git rev-list --topo-order -100 HEAD HEAD~1, no commit-graph: 6.80 s HEAD~1, w/ commit-graph: 0.77 s HEAD, w/ commit-graph: 0.02 s Test: git rev-list --topo-order -100 HEAD -- tools HEAD~1, no commit-graph: 9.63 s HEAD~1, w/ commit-graph: 6.06 s HEAD, w/ commit-graph: 0.06 s This speedup is due to a few things. First, the new generation- number-enabled algorithm walks commits on order of the number of results output (subject to some branching structure expectations). Since we limit to 100 results, we are running a query similar to filling a single page of results. Second, when specifying a path, we must parse the root tree object for each commit we walk. The previous benefits from the commit-graph are entirely from reading the commit-graph instead of parsing commits. Since we need to parse trees for the same number of commits as before, we slow down significantly from the non-path-based query. For the test above, I specifically selected a path that is changed frequently, including by merge commits. A less-frequently-changed path (such as 'README') has similar end-to-end time since we need to walk the same number of commits (before determining we do not have 100 hits). However, get the benefit that the output is presented to the user as it is discovered, much the same as a normal 'git log' command (no '--topo-order'). This is an improved user experience, even if the command has the same runtime. Helped-by: Jeff King <peff@peff.net> Signed-off-by: Derrick Stolee <dstolee@microsoft.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
* Merge branch 'ds/reachable'Junio C Hamano2018-09-171-2/+2
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The code for computing history reachability has been shuffled, obtained a bunch of new tests to cover them, and then being improved. * ds/reachable: commit-reach: correct accidental #include of C file commit-reach: use can_all_from_reach commit-reach: make can_all_from_reach... linear commit-reach: replace ref_newer logic test-reach: test commit_contains test-reach: test can_all_from_reach_with_flags test-reach: test reduce_heads test-reach: test get_merge_bases_many test-reach: test is_descendant_of test-reach: test in_merge_bases test-reach: create new test tool for ref_newer commit-reach: move can_all_from_reach_with_flags upload-pack: generalize commit date cutoff upload-pack: refactor ok_to_give_up() upload-pack: make reachable() more generic commit-reach: move commit_contains from ref-filter commit-reach: move ref_newer from remote.c commit.h: remove method declarations commit-reach: move walk methods from commit.c
| * commit-reach: move can_all_from_reach_with_flagsDerrick Stolee2018-07-211-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | There are several commit walks in the codebase. Group them together into a new commit-reach.c file and corresponding header. After we group these walks into one place, we can reduce duplicate logic by calling equivalent methods. The can_all_from_reach_with_flags method is used in a stateful way by upload-pack.c. The parameters are very flexible, so we will be able to use its commit walking logic for many other callers. Signed-off-by: Derrick Stolee <dstolee@microsoft.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
| * commit-reach: move walk methods from commit.cDerrick Stolee2018-07-211-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | There are several commit walks in the codebase. Group them together into a new commit-reach.c file and corresponding header. After we group these walks into one place, we can reduce duplicate logic by calling equivalent methods. The method declarations in commit.h are not touched by this commit and will be moved in a following commit. Many consumers need to point to commit-reach.h and that would bloat this commit. Signed-off-by: Derrick Stolee <dstolee@microsoft.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
* | Add missing includes and forward declarationsElijah Newren2018-08-151-0/+2
| | | | | | | | | | | | | | | | | | | | | | | | I looped over the toplevel header files, creating a temporary two-line C program for each consisting of #include "git-compat-util.h" #include $HEADER This patch is the result of manually fixing errors in compiling those tiny programs. Signed-off-by: Elijah Newren <newren@gmail.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>