Git export substitutions and reproducibility

2022 June 10, 19:34

There is an issue that I have encountered some months ago (and mentioned on the Fediverse back then), but I am still occasionally reminded of it when troubleshooting weird behavior with Nix, so it might be interesting to take a closer look at it. In short: when using Git's export substitutions, a tarball exported by Git from the repository at a particular revision may not always be the exact same, depending on external factors.

To understand how this quirk affects Nix, we have to both understand what Nix's fixed-output derivations are, as well as how some rather obscure Git features interact with Git internals.

What's a derivation

If you have had passing contact with Nix and Nixpkgs, you may have heard people refer to packages from Nixpkgs as derivations. A derivation is, broadly, a thing that describes a build action that Nix can take.

Derivations take as inputs whatever is needed to carry out the build action, contain a script that specifies how to do the build, and register whatever that script produced as outputs. Suppose a simple C program: the inputs to its derivation would include things like the source code of the program, a C compiler, Make, perhaps some library that the program needs; the build script would call make and make install; and the output produced would be the executable binary that make install copied out.

Diagram of a derivation with several inputs and one output — The derivation for `simple-c-program`, depicted as a diagram

Inputs to our derivation are outputs of other derivations. Those derivations can, of course, have their own inputs, which are other derivations, and so on—this is how we build our dependency graph.

Diagram of several derivations, with outputs of some connected to inputs of others — Several derivations and how they depend on each other. Note that the output of `gcc` is fed into two different derivation

This is also where Nix's functional nature comes in: a derivation whose input derivations are unchanged, and whose build script is unchanged is assumed to always produce the same output—like a pure function. This is what allows for a Nix binary cache: instead of building the derivation ourselves, we can download the output of the same derivation built on some other machine, because that output is assumed to be the same.

What's a fixed output derivation

In our previous example, one of the inputs to the simple C program derivation was its source code. This presents a problem: source code is not really the output of a build process that takes other inputs. Instead, source code is an input from the outside world.

To address this, Nix has a special type of a derivation: a fixed-output derivation (often abbreviated as FOD). Like with ordinary derivations, fixed-output ones have inputs, a build script, and produce an output, but in addition they also contain the expected cryptographic hash value of the produced output.

After Nix carries out the build action specified by a fixed-output derivation's build script, it hashes the produced output, and checks it against the pre-recorded expected hash. If the hashes do not match, the build is considered to have failed. Unlike with ordinary derivations, a fixed-output derivation is considered unchanged if its expected hash remains unchanged—its script and inputs can change, as long as it produces the same exact thing as before.

While builds carried out from ordinary derivations do not have network access, fixed-output derivation builds do. In practice, fixed-output derivations are thus often used to fetch source code from the Internet. Because we have to specify the hash of the source code to be fetched, we can be reasonably certain that we are fetching the same source code every time we rebuild the derivation.

Git export substitutions

Git has a handy command for exporting the tree at a given commit to a single archive file (tar or ZIP): git archive. This is useful for situations such as distributing a source code release: simply git archive the repository at the relevant tag, and publish the resulting tarball.

Not having the Git repository around can pose some problems, though. For example, some build processes expect to be able to discover the current Git commit hash, because they embed it in the version information of the binary they produce. To address this, Git provides means of doing export substitutions. Using .gitattributes, one can specify a list files that should have placeholder substitution performed on them when the repository is exported to archive. These placeholders can be for things like the current commit hash (including in its abbreviated version), the output of git describe, or metadata like the commit author and date.

Abbreviated commit hashes

In many places in Git's UI, it uses abbreviated commit hashes. While a full Git commit hash is 40 hexadecimal digits long (as SHA-1 hashes generally are), commits can usually be unambiguously referred to with some smaller amount of digits.

The bigger a repository—or, more precisely, the more objects a repository contains—the larger the probability that two object hashes will share the same prefix of some length. A long time ago (before 2016), Git used to default to 7 digits. As time went on and some repositories grew in size, it turned out that 7 digits, or even 8, 9, and larger amounts were not enough to unambiguously refer to objects in those repositories. Because of this, a heuristic was added to Git to estimate how long an abbreviated hash needs to be, based on the estimate of the object count of the repository. This is not an exact determination of the minimum unambiguous length of a hash, but rather a relatively fast guess. It is used by default in various outputs of Git, though a fixed length can be set via configuration.

Github tarball exports and `fetchFromGitHub`

Github repository pages offer an option of downloading the current tree as an archive file.

This functionality internally uses git export, which means that the downloaded ZIP has the relevant placeholders substituted with actual values. The archive downloads are also available under a predictable URL, which is convenient for scripting.

Nixpkgs contains a fetchFromGitHub function for—predictably enough—creating fixed-output derivations that fetch source code from Github. As an optimization, when possible, fetchFromGitHub will opt for downloading an archive tarball (Github does provide both tar.gz and zip archives). This is desirable, because we usually do not need a Git repository clone, and the tarball is a compressed (thus smaller) archive that can be quickly and easily downloaded over HTTPS.

The problem

Consider a repository that uses Git export substitutions to place the abbreviated hash somewhere in the exported archive file. If we use this exported file in a fixed-output derivation, it will work as long as the length of the abbreviated hash does not change. If it does change, the cryptographic hash of the archive will be different, and our fixed-output derivation's build will start failing.

The length Git picks for the abbreviated hash can change over time—if people keep making new commits in the repository, the amount of objects Git keeps track of will increase, and the heuristic will pick more digits for the abbreviated hash. Under this assumption, our fetchFromGitHub derivation can become invalid at some point after we write it, even if we are still fetching the same revision.

In practice, however, this is more complicated. Practical tests show that repeatedly downloading a repository archive from the same URL, within a short span of time, can result in getting versions that include both shorter and longer substituted hashes, seemingly at random.

We can speculate on why that is. Consider that the hash length estimation is based on the number of all objects in the Git repository, which does not necessarily include only the objects within the current commit tree (which is to say objects associated with the current and previous commits). Such a Git repository can include other things, such as orphaned commits which have not been garbage collected yet, or branches for things like pending pull requests. Github could, conceivably, have several servers which hold copies of a given repository, with each containing the entire main branch, but some of these servers could have differing subsets of other objects.

If our request can potentially go to any of those servers for load-balancing purposes, then we could end up with different tarballs based on how many objects the given server holds. Yet another reason would be some form of caching, where one cache server holds a tarball generated at a time when a shorter hash sufficed, while another has to regenerate the tarball from the current repository state. The details are opaque to us since we are (presumably) not Github, but we can certainly come up with plausible scenarios.

Solutions

The obvious solution is to not use abbreviated hashes in export substitutions. If the commit hash is to be embedded somewhere in the built artifacts, using the full hash ensures a far smaller chance of a collision in the future either way (if the full SHA-1 hash collides, then we are really in trouble).

Outside of that, fetchFromGitHub can be forced to download via Git, rather than via an exported archive. This can complicate the build process, but a Git checkout should be entirely reproducible (Git commits are, after all, referenced by cryptographic hashes themselves).

Generally, when building from a tagged release, embedding Git revision hashes may not be necessary. The tag will exist in the repository and point to the given revision, and in general should not be moved anyway. Packages in Nixpkgs are most commonly tagged releases, rather than arbitrary commits from the trunk branch, and those commonly embed version numbers, rather than the precise commit hash.