Keeping things in Perkeep

Perkeep (previously called Camlistore) introduces itself as "a set of open source formats, protocols, and software for modeling, storing, searching, sharing and synchronizing data in the post-PC era". Less vaguely, the purpose of Perkeep is chiefly to archive an individual's data, which can include both traditional files, as well as data that is not exactly files, like online posts.

The Perkeep daemon comes with a bunch of tools and importers, but it does also expose HTTP APIs which can be used to interact with the storage system from outside. Since I use Mastodon, I wanted to stick the contents of a Mastodon user data dump into Perkeep, using those APIs.

This is an overview of how one would go about it. If you would like the concrete tool itself, check out ap-perkeep-uploader.

Keeping permanently

In order to figure out how to stick our stuff in Perkeep, we need to figure out how Perkeep actually works. It starts with blobs.

A colorful, cartoonish bird
This is Keepy, the Perkeep parakeet. It is not an oddly-colored chicken, as I initially thought.

Blobs

On the lowest level, Perkeep is a content-addressable storage system. There are several storage backends for the content-addressed blobs of immutable data, including several ways of storing them on a local disk (packed together or in separate files), as well as backends that upload blobs to cloud storage. Perkeep receives your data blob, hashes it, and then stores it under that hash (called blobref). You can subsequently get it back by providing the hash. All other functionality is built on top of this base.

Special blobs

Metadata for all those blobs floating around in the blobstore is provided in the shape of so-called schema blobs—blobs which contain plain text JSON objects with some specific fields. The Perkeep daemon comes with an indexer, which goes through all the blobs that go into the store, figures out which ones are relevant, and saves the data in its database. The index database is redundant with the data stored in blobs and can be recreated at any time.

Using that index, Perkeep can provide a higher-level interface to the data put in it. Files, for example, are often stored over several blobs, and a file metadata blob tells us what blobs we need to fetch, and in what order to concatenate them together to get the original file back. The indexer will store that metadata, and if we ask it, it will fetch the file and glue it back together.

Structured data is not limited to file metadata only, though—Perkeep can store arbitrary key-value pairs through permanodes. Permanodes and claims are Perkeep's answer to the problem of storing mutable data in an append-only data store. A permanode is essentially an anchor—its chief purpose is to provide an address in the form of its hash.

Claims, on the other hand, are basically transactions: they are schema blobs which reference a permanode and say things like "set the field foo to value bar" or "delete field foo". The Perkeep indexer can then collect all the claims which reference a particular permanode, replay them in sequence, and figure out what attributes (key-value pairs) the permanode has. Of course, the permanode blob itself has not been changed—the indexer just created a virtual view of the permanode and its attributes.

Note that permanodes and claims are cryptographically signed. There is an API endpoint for doing the signing, but signing can also be done locally, by following the algorithm.

Scheming

Internal Perkeep importers generally work by creating a permanode for each item imported, and using the permanode's attributes to store the item's data. In practical terms, this means that something like the Twitter importer creates a permanode for every tweet, and the Atom/RSS importer creates a permanode for every feed item.

At this point, we might want to glance at our ActivityStreams data, to figure out what we want to grab out of it. Here is a very simple example of one ActivityStreams activity:

{
    "id": "https://example.com/Alice/activities/1",
    "type": "Create",
    "actor": "https://example.com/Alice",
    "published": "2010-01-01T00:10:00Z",
    "to": [
        "https://www.w3.org/ns/activitystreams#Public"
    ],
    "object": {
        "id": "https://example.com/Alice/notes/1",
        "type": "Note",
        "content": "Hello, world!",
        "to": [
            "https://www.w3.org/ns/activitystreams#Public"
        ]
    }
}

Without getting too deep into how ActivityStreams/ActivityPub works (a question that is otherwise interesting, if you wanted to build your own fediverse thing), we can see that we have an activity, as well as the object of that activity.

So, to start, we make a permanode. This takes no data from our dump, just an arbitrary random string, and some signing data (faked here, longer in practice):

{
    "camliVersion": 1, 
    "camliType": "permanode", 
    "random": "IAmReallyRandom",
    "camliSigner": "sha224-aaabbbcccddd"
,"camliSig": "bm90aGluZyB0byBzZWUgaGVyZQo="}

When we push the above permanode, it should be stored under the blobref sha224-a45054e92836a6f646cdbd31dd178af6578db6fed71ed65758081d97. We can then push some claim blobs to create attributes on that permanode:

{
    "camliVersion": 1,
    "camliType": "claim",
    "camliSigner": "sha224-aaabbbcccddd",
    "claimDate": "2018-05-15T10:20:30Z",
    "permaNode": "sha224-a45054e92836a6f646cdbd31dd178af6578db6fed71ed65758081d97",       
    "claimType": "set-attribute",
    "attribute": "activityId",
    "value": "https://example.com/Alice/activities/1"
,"camliSig": "8J+kt/CfpLfwn6S3Cg=="}

As you can see, we are setting the attribute called activityId to the value https://example.com/Alice/activities/1. Note that we had to provide a date, since the indexer will replay the claims sequentially, and so needs to know which comes before which. We also have the permanode blobref, an attribute, an action (set-attribute here, but we could also do a del-attribute, for example), and a value.

We now have two new blobs in the blobstore: one for the permanode, and one for the claim. The permanode blob remained the same as it was when we first put it there. However, the indexer will now see that the permanode sha224-a45054e92836a6f646cdbd31dd178af6578db6fed71ed65758081d97 has an attribute activityId, and make it available for searching by that attribute. This ID is a useful thing to search by, since per ActivityPub it will be a globally unique identifier for this particular activity.

We will need a claim for every attribute we want to set:

{
    "camliVersion": 1,
    "camliType": "claim",
    "camliSigner": "sha224-aaabbbcccddd",
    "claimDate": "2018-05-15T10:20:40Z",
    "permaNode": "sha224-a45054e92836a6f646cdbd31dd178af6578db6fed71ed65758081d97",       
    "claimType": "set-attribute",
    "attribute": "content",
    "value": "Hello, world!"
,"camliSig": "c3RpbGwgbm90aGluZyB0byBzZWUK"}

...and so on, until we've slurped everything we wanted to slurp.

For more complicated data, we can have attributes with values that are blobrefs of further data—either more permanodes, or things like file nodes. This allows creating a tree of permanodes, but also allows for representing things like image attachments by referencing a file schema blob.

Caveats

Data modeling is tricky business

Of course, when saving your data for all eternity, it is somewhat important that you pick a suitable schema. With the ActivityStreams example, you could decide that you only want minimal metadata in the permanode, and that you are happy saving the rest as JSON in a separate file (that the indexer will not understand). On the other end of the spectrum, you may want to serialize the entire thing into a set of Perkeep permanodes, possibly slowing things down and increasing on-disk size, but at the same time making everything very friendly to the Perkeep indexer.

This becomes more evident when we deal with things like ActivityStreams with their JSON-LD, which is meant to facilitate interlinking of structured data. Of course, Perkeep is generally oriented more towards storing of personal archives instead of exchanging and linking data with others, so the schema has different requirements

Showing data

Perkeep web UI, showing a bunch of unlabeled folders.
The web UI has no idea what these are.

While we can shove all we want into Perkeep in the ways described above, the highest level parts of Perkeep will likely not understand our data. We can use standardized attributes for things like geolocation and time, which the indexer will understand. We can also issue specific search queries, based on fields we know our data has. However, the Perkeep web UI, only knows about the schema which the Perkeep importers use, and is not easily extensible.

This is probably the major incentive for actually hacking on Perkeep proper, instead of creating solution which interact with it via the API. One has to dive into Perkeep's guts anyway, if one wants to their data presented fancily.

Forever is a long time

Personally, I have not been using Perkeep for a long time. This is unlike some of its developers, who have been using it for years. Nevertheless, it cannot hurt to use something designed to slurp and archive things that usually remain otherwise unslurepd and unarchived. And, should Perkeep turn out to be terrible in some way down the line, blob storage model is sufficiently simple to make migrating data out of it not seem like too daunting of a problem.