Git as a database, kind of
I run NixOS on multiple machines. I store configuration for all of them in a Git monorepo, and deploy to them via some bespoke scripts (despite there being multiple mature deployment tools out there for the purpose). One problem I have with this setup is that I do not keep close track of what revision my hosts are on. Not all of the machines are on at all times, which means that some cannot be simply ssh'd into to see, and if they have not been turned on in a while, they may be more out of date than others.
One way to solve this would be to have the update scripts record every update action in some out-of-band log. Another way would be to have Git branches which track the revision that each machine is on, similar to how the channel branches in Nixpkgs keep track of the channel state.
A less reasonable way would be to use log files that are tracked by Git. A script could easily modify a JSON file kept in the main branch of the repo, but keeping it there would lead to a lot of noise in the commit log. A better solution would be a separate, detached branch. There actually is precedent for keeping data in a separate Git branch, with the likes of git-bug or git-annex!
Keeping a whole working tree checked out for the purpose of tracking the JSON branch would get annoying, however. A neater solution would be to access Git at a lower level, generating commits without checking anything out. While there are libraries for a variety of programming languages that can be used for this sort of lower-level access, it can also be done using various git
commands, and so it is possible to do from a shell script. So, let's do that, using Nushell to make things fun.
Git internals
To accomplish our task, we will first need a quick recap (or perhaps a rapid introduction) of Git internals.
A central piece of how Git works is the content-addressable store. The store stores objects, addressed by their SHA-1 hashes (or SHA-256, when using that particular experimental feature). The object store exists inside the .git/
directory of a repository, but the internal implementation details are not particularly relevant, as Git provides abstracted access to the object store, even for lower level use.
There are several types of objects that can be stored. Relevant here are blobs, trees, and commits. A blob is the simplest—it is the contents of a file that has been put into Git. A blob object does not contain the file's metadata, like the filename or modification date, just the contents.
A tree is a directory tree. It is, essentially, a list of paths (names), each mapping to a hash address of either a blob, or another tree. Having a name point to the hash of blob allows us to represent a file; having a name point to another tree is the way to represent a subdirectory. It is possible for multiple different trees in the store to point to the same file (blob) or subdirectory (tree).
Lastly, a commit is, well, a Git commit. The object contains some metadata, like the author and committer (which can be two different people in Git), date, and the commit message. The object also contains the hash of a tree; the tree records how the tracked directory looked at commit time.
Additionally, the commit object can also contain hashes of its parent commits. The root commits (often there is just one) in a commit history will have no parents, an ordinary commit will have one parent, and commits representing merges have two or more parents. All together, this looks something like this:
So, to insert a commit we need to first put some blobs in the store. Then, we can use the hashes of those blobs to write a tree to the store, which will represent the state we want to commit. Then, we need to write a commit that points to that tree, possibly indicating some other commit as its parent.
First commit
For a simple start, let's make a new branch with a single commit, that contains a single JSON file. Assuming we are starting out with some Nushell structured data, the first step is to turn it into JSON. Then, we can add it to the repo's object store with git hash-object
. git hash-object
only hashes an object, unless we give it -w
, in which case it will also actually write the object to the store.
We explicitly specified we want our JSON to be indented by two spaces (as opposed to minified). The indented form makes plain text diffs a bit nicer, and while two spaces is the default, specifying it explicitly helps ensure consistency if that ever changes.
git hash-object
received the JSON via standard input, and gave us back a hash. We can check that the file has, indeed, been put in the object store by asking Git to retrieve the blob that is under that hash:
Next, we need to construct a tree object. This can be done with git mktree
.
git mktree
expects a listing of files in a specific format. This format is the same one that git ls-tree
outputs. Our case is very simple, since we only have a single file, so we can write it by hand:
The more obvious things in that string: blob
identifies the entry as pointing to a blob (recall that we could also point at a tree), c8718…
is the hash we got earlier, then a tab character, followed by the filename—some_computer.json
, which we came up with—and a terminating null byte (since we passed -z
to git mktree
, to signal we are using null bytes).
But, there is also that 100644
. In short, this means regular file, with permissions set to 0644 (i.e., equivalent to -rw-r--r--
). Curiously enough, modern Git can only store permissions of 0644 and 0744 (the latter being equivalent of -rwxr--r--
), so for normal files we can generally go with 100644
without thinking about it too much. One place this mode field is described is the documentation on the index format.
git mktree
also gave us a hash for the tree. This will be useful, since next we will be making a commit with git commit-tree
:
And now we have a commit! Its hash was printed by git commit-tree
. The commit uses our configured name and email address, as well as the current timestamp, so it will be different on every run, unlike the previous hashes from the examples.
The commit is eligible for being garbage collected, since it is just floating in the store, unconnected to anything. It might be useful to point a branch at it using git update-ref
:
git update-ref
can be used to drop a reference anywhere in the .git
directory. It so happens that Git keeps references that it sees as branches under refs/heads
, so if we put a reference there, it will behave just as if we made a branch the regular way. We could use git update-ref
to put a reference somewhere else under refs/
. git-bug, for example, uses refs/bugs/
to keep references to bugs. These other references do not show up as branches, and various tools may or may not ignore them. However, none of the references under refs/
get garbage collected. git gc
will keep anything it finds under refs/
, even if it is a type of reference Git does not know about (like git-bug's bugs).
Note, also, that we provided a message to update-ref
via -m
. This is not a commit message—it is a log message, stored under .git/logs/
, which will show up if we git reflog
our new branch. Git will normally produce reflog entries that tell us why a reference was updated, like a commit, a rebase, or a reset. We can provide our own message to note that the reference was updated by us, manually. Reference logs are not synced to remote.
Updating the data
A useful thing to be able to do now is to update the existing data. We can use Git's revision:file
syntax to find the current version of the JSON file, deserialize it into Nushell, and then manipulate it as needed:
Let us assume, however, that we have more than one file tracked in the branch now. This means we cannot simply write a new tree with only our new file, as that would discard the other files.
Nushell's ability to modify structured data can come in handy here as well. The tree objects are, essentially, tables, and Nushell is good at modifying tables. So, what we can do is take the tree from the current commit, parse it into Nushell structured data, and modify it with our new file.
The column names here are the terminology that Git uses, which can be a bit confusing: objectname
means hash, and path
means name.
We have updated our structured data by dropping the old version of the file from the list and adding the new version back in. Now we can turn that structured data back into a string that git mktree
will understand, and do the rest of our commit process as before.
Note that, this time, git commit-tree
was provided with a parent through -p data
. Git will dereference the data
branch here, and record the commit's parent as whatever data
is pointing to at the moment. This way, our branch will have a commit history, instead of just one commit.
Putting it together
With this approach, we now have a way of storing arbitrary Nushell structured data in a Git branch, with the branch being maintained without being explicitly checked out in Git.
I wrote a Nushell module called wugdb for this purpose. I do not, however, encourage its serious use in production. While this sort of use of Git is viable, a script that puts JSON files into Git branches is not the most robust way of doing a database. Nevertheless, Git is a tool that can be used for unconventional and interesting stuff of this sort, and there are examples of more mature projects that do so, like the aforementioned git-bug and git-annex. Git also makes lower-level tooling available right from its command line interface, so it is possible to experiment with these sorts of things from a shell, without writing more elaborate code.
Further reading
- "Git Internals - Git Objects", from Pro Git – an overview of Git internals, with some examples of manual object manipulation, from the Pro Git book (which is available for reading on the Git website).
- "git-bug's reusable entity data model" – overview of git-bug's more complex model for storing data using Git objects and commit DAGs.