midx: don't reuse corrupt MIDXs when writing

When writing a new multi-pack index, Git tries to reuse as much of the data from an existing MIDX as possible, like object offsets. This is done to avoid re-opening a bunch of *.idx files unnecessarily, but can lead to problems if the data we are reusing is corrupt. That's because we'll blindly reuse data from an existing MIDX without checking its trailing checksum for validity. So if there is memory corruption while writing a MIDX, or disk corruption in the intervening period between writing and reuse, we'll blindly propagate those bad values forward. Suppose we experience a memory corruption while writing a MIDX such that we write an incorrect object offset (or alternatively, the disk corrupts the data after being written, but before being reused). Then when we go to write a new MIDX, we'll reuse the bad object offset without checking its validity. This means that the MIDX we just wrote is broken, but its trailing checksum is in-tact, since we never bothered to look at the values before writing. In the above, a "git multi-pack-index verify" would have caught the problem before writing, but writing a new MIDX wouldn't have noticed anything wrong, blindly carrying forward the corrupt offset. Individual pack indexes check their validity by verifying the crc32 attached to each entry when carrying data forward during a repack. We could solve this problem for MIDXs in the same way, but individual crc32's don't make much sense, since their entries are so small. Likewise, checking the whole file on every read may be prohibitively expensive if a repository has a lot of objects, packs, or both. But we can check the trailing checksum when reusing an existing MIDX when writing a new one. And a corrupt MIDX need not stop us from writing a new one, since we can just avoid reusing the existing one at all and pretend as if we are writing a new MIDX from scratch. Suggested-by: Derrick Stolee <dstolee@microsoft.com> Signed-off-by: Taylor Blau <me@ttaylorr.com> Signed-off-by: Junio C Hamano <gitster@pobox.com>
author: Taylor Blau <me@ttaylorr.com> 2021-06-23 20:39:12 +0200
committer: Junio C Hamano <gitster@pobox.com> 2021-06-29 05:36:17 +0200
commit: ec1e28ef9c30468d2e76e41c88a1611e63047f61 (patch)
tree: 22b8ed8b8995c49dc4e566290a424ae359f205fd /midx.c
parent: commit-graph: rewrite to use checksum_valid() (diff)
download: git-ec1e28ef9c30468d2e76e41c88a1611e63047f61.tar.xz
git-ec1e28ef9c30468d2e76e41c88a1611e63047f61.zip
1 files changed, 10 insertions, 0 deletions
diff --git a/midx.c b/midx.c
index 21d6a05e88..a12cbbf928 100644
--- a/midx.c
+++ b/midx.c
@@ -885,6 +885,11 @@ static void write_midx_reverse_index(char *midx_name, unsigned char *midx_hash,
 static void clear_midx_files_ext(struct repository *r, const char *ext,
 				 unsigned char *keep_hash);
 
+static int midx_checksum_valid(struct multi_pack_index *m)
+{
+	return hashfile_checksum_valid(m->data, m->data_len);
+}
+
 static int write_midx_internal(const char *object_dir, struct multi_pack_index *m,
 			       struct string_list *packs_to_drop,
 			       const char *preferred_pack_name,
@@ -911,6 +916,11 @@ static int write_midx_internal(const char *object_dir, struct multi_pack_index *
 	else
 		ctx.m = load_multi_pack_index(object_dir, 1);
 
+	if (ctx.m && !midx_checksum_valid(ctx.m)) {
+		warning(_("ignoring existing multi-pack-index; checksum mismatch"));
+		ctx.m = NULL;
+	}
+
 	ctx.nr = 0;
 	ctx.alloc = ctx.m ? ctx.m->num_packs : 16;
 	ctx.info = NULL;
author	Taylor Blau <me@ttaylorr.com>	2021-06-23 20:39:12 +0200
committer	Junio C Hamano <gitster@pobox.com>	2021-06-29 05:36:17 +0200
commit	ec1e28ef9c30468d2e76e41c88a1611e63047f61 (patch)
tree	22b8ed8b8995c49dc4e566290a424ae359f205fd /midx.c
parent	commit-graph: rewrite to use checksum_valid() (diff)
download	git-ec1e28ef9c30468d2e76e41c88a1611e63047f61.tar.xz git-ec1e28ef9c30468d2e76e41c88a1611e63047f61.zip