Git as a database, kind of

Published: 2024 June 14, 00:51
Updated: 2024 June 17, 14:39

I run NixOS on multiple machines. I store configuration for all of them in a Git monorepo, and deploy to them via some bespoke scripts (despite there being multiple mature deployment tools out there for the purpose). One problem I have with this setup is that I do not keep close track of what revision my hosts are on. Not all of the machines are on at all times, which means that some cannot be simply ssh'd into to see, and if they have not been turned on in a while, they may be more out of date than others.

One way to solve this would be to have the update scripts record every update action in some out-of-band log. Another way would be to have Git branches which track the revision that each machine is on, similar to how the channel branches in Nixpkgs keep track of the channel state.

A less reasonable way would be to use log files that are tracked by Git. A script could easily modify a JSON file kept in the main branch of the repo, but keeping it there would lead to a lot of noise in the commit log. A better solution would be a separate, detached branch. There actually is precedent for keeping data in a separate Git branch, with the likes of git-bug or git-annex!

Keeping a whole working tree checked out for the purpose of tracking the JSON branch would get annoying, however. A neater solution would be to access Git at a lower level, generating commits without checking anything out. While there are libraries for a variety of programming languages that can be used for this sort of lower-level access, it can also be done using various git commands, and so it is possible to do from a shell script. So, let's do that, using Nushell to make things fun.

Git internals

To accomplish our task, we will first need a quick recap (or perhaps a rapid introduction) of Git internals.

A central piece of how Git works is the content-addressable store. The store stores objects, addressed by their SHA-1 hashes (or SHA-256, when using that particular experimental feature). The object store exists inside the .git/ directory of a repository, but the internal implementation details are not particularly relevant, as Git provides abstracted access to the object store, even for lower level use.

There are several types of objects that can be stored. Relevant here are blobs, trees, and commits. A blob is the simplest—it is the contents of a file that has been put into Git. A blob object does not contain the file's metadata, like the filename or modification date, just the contents.

A tree is a directory tree. It is, essentially, a list of paths (names), each mapping to a hash address of either a blob, or another tree. Having a name point to the hash of blob allows us to represent a file; having a name point to another tree is the way to represent a subdirectory. It is possible for multiple different trees in the store to point to the same file (blob) or subdirectory (tree).

Lastly, a commit is, well, a Git commit. The object contains some metadata, like the author and committer (which can be two different people in Git), date, and the commit message. The object also contains the hash of a tree; the tree records how the tracked directory looked at commit time.

Additionally, the commit object can also contain hashes of its parent commits. The root commits (often there is just one) in a commit history will have no parents, an ordinary commit will have one parent, and commits representing merges have two or more parents. All together, this looks something like this:

Chart of an arrangement of a commit object, a tree object, and two blob objects, depicting relationships as described in article. — An example commit, and the objects it references. The commit has no parents, so it would be a root (first) commit in the commit history graph. In reality, tree and commit objects contain more metadata than this.

So, to insert a commit we need to first put some blobs in the store. Then, we can use the hashes of those blobs to write a tree to the store, which will represent the state we want to commit. Then, we need to write a commit that points to that tree, possibly indicating some other commit as its parent.

First commit

For a simple start, let's make a new branch with a single commit, that contains a single JSON file. Assuming we are starting out with some Nushell structured data, the first step is to turn it into JSON. Then, we can add it to the repo's object store with git hash-object. git hash-object only hashes an object, unless we give it -w, in which case it will also actually write the object to the store.

┃ > git init
Initialized empty Git repository in /wherever/.git/
┃ > [{
┇     date: "2024-06-01T00:00:00Z",
┇     revision: 1,
┇     result: "fire"
┇   }] | to json -i 2 | git hash-object --stdin -t blob -w
c8718e86195539c5ab6d85e9b019056f0d80587d

Adding some structured data to the object store, and getting a hash back.

We explicitly specified we want our JSON to be indented by two spaces (as opposed to minified). The indented form makes plain text diffs a bit nicer, and while two spaces is the default, specifying it explicitly helps ensure consistency if that ever changes.

git hash-object received the JSON via standard input, and gave us back a hash. We can check that the file has, indeed, been put in the object store by asking Git to retrieve the blob that is under that hash:

┃ > git cat-file blob c8718e86195539c5ab6d85e9b019056f0d80587d
[
  {
    "date": "2024-06-01T00:00:00Z",
    "revision": 1,
    "result": "fire"
  }
]
┃ > git cat-file blob c8718e86195539c5ab6d85e9b019056f0d80587d | from json
+---+----------------------+----------+--------+
| # |         date         | revision | result |
| 0 | 2024-06-01T00:00:00Z |        1 | fire   |
+---+----------------------+----------+--------+

We can get the JSON back, and we can turn it back into structured data.

Next, we need to construct a tree object. This can be done with git mktree.

git mktree expects a listing of files in a specific format. This format is the same one that git ls-tree outputs. Our case is very simple, since we only have a single file, so we can write it by hand:

┃ > "100644 blob c8718e86195539c5ab6d85e9b019056f0d80587d\tsome_computer.json\u{0}" | git mktree -z
7f29b01e020c2b9ae6c82bbf1d8a8513a8a75e80

Manually-written, single entry for a single file, passed to git mktree.

The more obvious things in that string: blob identifies the entry as pointing to a blob (recall that we could also point at a tree), c8718… is the hash we got earlier, then a tab character, followed by the filename—some_computer.json, which we came up with—and a terminating null byte (since we passed -z to git mktree, to signal we are using null bytes).

But, there is also that 100644. In short, this means regular file, with permissions set to 0644 (i.e., equivalent to -rw-r--r--). Curiously enough, modern Git can only store permissions of 0644 and 0744 (the latter being equivalent of -rwxr--r--), so for normal files we can generally go with 100644 without thinking about it too much. One place this mode field is described is the documentation on the index format.

git mktree also gave us a hash for the tree. This will be useful, since next we will be making a commit with git commit-tree:

┃ > git commit-tree -m 'add some_computer.json' 7f29b01e020c2b9ae6c82bbf1d8a8513a8a75e80
4ac389255b9c56833577fd114c696eed52ecc647

Creating a commit that points at the tree we made.

And now we have a commit! Its hash was printed by git commit-tree. The commit uses our configured name and email address, as well as the current timestamp, so it will be different on every run, unlike the previous hashes from the examples.

The commit is eligible for being garbage collected, since it is just floating in the store, unconnected to anything. It might be useful to point a branch at it using git update-ref:

┃ > git update-ref -m 'create data branch manually' refs/heads/data 4ac389255b9c56833577fd114c696eed52ecc647
┃ > git branch -v
  data 4ac3892 add some_computer.json

Making a branch to point at our new commit.

git update-ref can be used to drop a reference anywhere in the .git directory. It so happens that Git keeps references that it sees as branches under refs/heads, so if we put a reference there, it will behave just as if we made a branch the regular way. We could use git update-ref to put a reference somewhere else under refs/. git-bug, for example, uses refs/bugs/ to keep references to bugs. These other references do not show up as branches, and various tools may or may not ignore them. However, none of the references under refs/ get garbage collected. git gc will keep anything it finds under refs/, even if it is a type of reference Git does not know about (like git-bug's bugs).

Note, also, that we provided a message to update-ref via -m. This is not a commit message—it is a log message, stored under .git/logs/, which will show up if we git reflog our new branch. Git will normally produce reflog entries that tell us why a reference was updated, like a commit, a rebase, or a reset. We can provide our own message to note that the reference was updated by us, manually. Reference logs are not synced to remote.

Updating the data

A useful thing to be able to do now is to update the existing data. We can use Git's revision:file syntax to find the current version of the JSON file, deserialize it into Nushell, and then manipulate it as needed:

┃ > let new_data = (git cat-file blob data:some_computer.json | 
┇     from json |
┇     append { 
┇       date: "2024-06-01T00:45:00Z", 
┇       revision: 5, 
┇       result: "minor explosion"
┇     })
┃ > $new_data
+---+----------------------+----------+-----------------+
| # |         date         | revision |     result      |
| 0 | 2024-06-01T00:00:00Z |        1 | fire            |
| 1 | 2024-06-01T00:45:00Z |        5 | minor explosion |
+---+----------------------+----------+-----------------+
┃ > let new_object = ($new_data | to json -i 2 | git hash-object -t blob --stdin -w)
┃ > $new_object 
33277d56cc235d7ec79a4f83bc9fba8dfe6226af

Retrieving previously committed list, appending an entry to it, turning it back into JSON, and putting that new version in the object store.

Let us assume, however, that we have more than one file tracked in the branch now. This means we cannot simply write a new tree with only our new file, as that would discard the other files.

┃ > git ls-tree data
100644 blob d624c7294af935e8a00acfc03126c1789b5f5df5    another_computer.json
100644 blob c8718e86195539c5ab6d85e9b019056f0d80587d    some_computer.json

git ls-tree shows us that we have two files now.

Nushell's ability to modify structured data can come in handy here as well. The tree objects are, essentially, tables, and Nushell is good at modifying tables. So, what we can do is take the tree from the current commit, parse it into Nushell structured data, and modify it with our new file.

┃ > let prev_tree = (git ls-tree -z data |
┇     split row "\u{0}" |
┇     compact --empty |
┇     each {|it| $it |
┇       parse --regex '(?<objectmode>\d+) (?<objecttype>\w+) (?<objectname>[0-9a-f]+)\t(?P<path>[^\x00/]+)'} |
┇     flatten)
┃ > $prev_tree 
+---+------------+------------+------------------------------------------+-----------------------+
| # | objectmode | objecttype |                objectname                |         path          |
| 0 | 100644     | blob       | d624c7294af935e8a00acfc03126c1789b5f5df5 | another_computer.json |
| 1 | 100644     | blob       | c8718e86195539c5ab6d85e9b019056f0d80587d | some_computer.json    |
+---+------------+------------+------------------------------------------+-----------------------+
┃ > let new_tree = (
┇     $prev_tree |
┇     where {|it| $it.path != 'some_computer.json'} |
┇     append { 
┇       objectmode: "100644",
┇       objecttype: "blob",
┇       objectname: $new_object,
┇       path: 'some_computer.json'
┇     })
┃ > $new_tree 
+---+------------+------------+------------------------------------------+-----------------------+
| # | objectmode | objecttype |                objectname                |         path          |
| 0 | 100644     | blob       | d624c7294af935e8a00acfc03126c1789b5f5df5 | another_computer.json |
| 1 | 100644     | blob       | 33277d56cc235d7ec79a4f83bc9fba8dfe6226af | some_computer.json    |
+---+------------+------------+------------------------------------------+-----------------------+

Parsing the output of git ls-tree into structured data, and then modifying it.

The column names here are the terminology that Git uses, which can be a bit confusing: objectname means hash, and path means name.

We have updated our structured data by dropping the old version of the file from the list and adding the new version back in. Now we can turn that structured data back into a string that git mktree will understand, and do the rest of our commit process as before.

┃ > $new_tree |
┇     each {|it| $"($it.objectmode) ($it.objecttype) ($it.objectname)\t($it.path)\u{0}"} |
┇     str join |
┇     git mktree -z
e1e6b041d361bcc94bed76cbd987c06c1d93ef49
┃ > git commit-tree -m 'update some_computer.json' -p data e1e6b041d361bcc94bed76cbd987c06c1d93ef49
05024814c4708130bb1071ba9e6a4e854ca70090
┃ > git update-ref -m 'update data' refs/heads/data 05024814c4708130bb1071ba9e6a4e854ca7009
┃ > git log --pretty=oneline data
05024814c4708130bb1071ba9e6a4e854ca70090 (data) update some_computer.json
4b38d584d066a952b298612ff38f5590f19b57ce add another_computer.json
4ac389255b9c56833577fd114c696eed52ecc647 add some_computer.json

Putting our new tree in the object store, then using it to make a commit, then updating the branch to point to that commit.

Note that, this time, git commit-tree was provided with a parent through -p data. Git will dereference the data branch here, and record the commit's parent as whatever data is pointing to at the moment. This way, our branch will have a commit history, instead of just one commit.

Putting it together

With this approach, we now have a way of storing arbitrary Nushell structured data in a Git branch, with the branch being maintained without being explicitly checked out in Git.

I wrote a Nushell module called wugdb for this purpose. I do not, however, encourage its serious use in production. While this sort of use of Git is viable, a script that puts JSON files into Git branches is not the most robust way of doing a database. Nevertheless, Git is a tool that can be used for unconventional and interesting stuff of this sort, and there are examples of more mature projects that do so, like the aforementioned git-bug and git-annex. Git also makes lower-level tooling available right from its command line interface, so it is possible to experiment with these sorts of things from a shell, without writing more elaborate code.

Git internals

First commit

Updating the data

Putting it together

Further reading