Part 1 - Git Architecture

programming p2p git rust decentralisation

To create a system designed for peer-to-peer git repositories, we first must understand the ideas involved in, and the structures used by git. The gritty details can be found in the reference and the excellent book this links to the page for git internals , but for now we’re covering the basic concepts.

What is a git repository, really?

The answer to this question is simple, actually. A git repository is a combination of:

All of this is in the hidden .git folder inside any git repository created with something like git clone or git init.

$ touch initial-file.txt
$ git add initial-file.txt 
$ ls
initial-file.txt
$ git status
On branch master

No commits yet

Changes to be committed:
  (use "git rm --cached <file>..." to unstage)
	new file:   initial-file.txt

$ git commit -m "Added a new file"
[master (root-commit) 43e3d17] Added a new file
 1 file changed, 0 insertions(+), 0 deletions(-)
 create mode 100644 initial-file.txt
$ cd .git/
$ cd objects/
$ tree
.
├── 43
│   └── e3d17fb06dd803bc5d978608633d603fc6d762
├── 6a
│   └── 28c5ace5d1eb910fa77881cd68a29eed5f3e7e
├── e6
│   └── 9de29bb2d1d6434b8b29ae775ad8c2e48c5391
├── info
└── pack

5 directories, 3 files

In this series of commands, we’ve taken an empty git repository, added an empty file, and committed it to the repository - then listed the internal git objects which are named via their hashes in the file system acting as a substrate for git’s own FS.

Git Objects

Git objects are stored in several forms on your filesystem, the simplest being in the .git/objects folder, but in larger repositories they can also be stored in packfiles, in .git/pack, which provide highly efficient deduplication and compression on top of that already implicit in a content-addressed filesystem.

Continuing in the repository created just before this section, objects can be pretty-printed with git show by providing their hash:

$ git show 43e3d17fb06dd803bc5d978608633d603fc6d762
commit 43e3d17fb06dd803bc5d978608633d603fc6d762 (HEAD -> master)
Author: sapient_cogbag <sapient_cogbag@protonmail.com>
Date:   Sun Mar 13 18:06:24 2022 +0000

    Added a new file

diff --git a/initial-file.txt b/initial-file.txt
new file mode 100644
index 0000000..e69de29

By default, objects have additional information printed with git show - for instance, commits have a diff printed out - but they also lack certain other information, which is useful in understanding how git objects work.

If we instead do the following:

$ git show --pretty=raw 43e3d17fb06dd803bc5d978608633d603fc6d762  
commit 43e3d17fb06dd803bc5d978608633d603fc6d762
tree 6a28c5ace5d1eb910fa77881cd68a29eed5f3e7e
author sapient_cogbag <sapient_cogbag@protonmail.com> 1647194784 +0000
committer sapient_cogbag <sapient_cogbag@protonmail.com> 1647194784 +0000
gpgsig -----BEGIN PGP SIGNATURE-----
 
 iIkEABYIADEWIQS4qb7l+XIqM5tpfeu4vaEap8xnngUCYi4yoBMcYW5vbnltb3Vz
 QG5vLmVtYWlsAAoJELi9oRqnzGeemOQA/2Wu/S9TEbwhnYXXd9zntxlVinf8DwlH
 KmM+DBbKvXP/AQDW1rW0w+VTCrPeSVfGeKMXD++Ond4e1xCYV5Fouu3CAA==
 =js4q
 -----END PGP SIGNATURE-----

    Added a new file

diff --git a/initial-file.txt b/initial-file.txt
new file mode 100644
index 0000000..e69de29

We see an important aspect of how commits conceptually work in git - namely, that they are simply a reference to another git object (in particular, a git object that is the filesystem root directory) with additional information like committer, author, message, and if configured, a cryptographic signature - which we can see in the line tree 6a28c5ace5d1eb910fa77881cd68a29eed5f3e7e. You’ll note that if you jump up in the document, that is one of the objects seen in the directory listing:

├── 6a
│   └── 28c5ace5d1eb910fa77881cd68a29eed5f3e7e

Now, because this was the first commit in the repository, it lacks parent commits, but usually a commit will have one (and sometimes more, in the case of commits that induce a merge) commits that are it’s parents, which are in the commit simply additional hashes. For example, making a new commit and examining it’s own object representation with git show1:

$ echo "hello world" > ../../initial-file.txt
$ cd ../..
$ git commit -am "Commit with parent!"
[master 3ddaa4c] Commit with parent!
 1 file changed, 1 insertion(+)
$ git show --pretty=raw 3ddaa4c
commit 3ddaa4ce063a3829732692dd8a46ad435c5471b1
tree 4109c394ed1a5a582835d1c487190480ac8832af
parent 43e3d17fb06dd803bc5d978608633d603fc6d762
author sapient_cogbag <sapient_cogbag@protonmail.com> 1647196360 +0000
committer sapient_cogbag <sapient_cogbag@protonmail.com> 1647196360 +0000
gpgsig -----BEGIN PGP SIGNATURE-----
 
 iIkEABYIADEWIQS4qb7l+XIqM5tpfeu4vaEap8xnngUCYi44yBMcYW5vbnltb3Vz
 QG5vLmVtYWlsAAoJELi9oRqnzGeeJ90A/RoZHpRRfDa2x8lN9ALWdZ5vX2QuxQjC
 43h57CuWBxwxAP47bW6f3K1VaSrqtxdc1iUw8zf/MN7FgkpVxY+PubAiBg==
 =frrB
 -----END PGP SIGNATURE-----

    Commit with parent!

diff --git a/initial-file.txt b/initial-file.txt
index e69de29..3b18e51 100644
--- a/initial-file.txt
+++ b/initial-file.txt
@@ -0,0 +1 @@
+hello world

We can see the parent commit’s git object name (it’s hash) - 43e3d17fb06dd803bc5d978608633d603fc6d762, which indeed matches the previous commit. We can also see - and this is something very important to note about git - that it points to an entirely different tree object. Conceptually speaking, a commit references a complete copy of the entire directory tree (at least the parts included with git add) at the time git commit is called, as well as a complete copy of the entire directory tree for every parent commit - recursively. This is critical because it means that you only need to have the hash of a commit to retrieve the entire history of the repository up to that commit - this includes when fetching parts of a repository from a remote, as well.

We should probably examine directory tree objects before we go further, though, and for that we need a special command git ls-tree because git show only shows filenames:

$ git ls-tree 4109c394ed1a5a582835d1c487190480ac8832af
100644 blob 3b18e512dba79e4c8300dd08aeb37f8e728b8dad	initial-file.txt

And we can see that tree objects encode:

Git Object Types

Git objects come in four types - though we’ve only seen three of them. The four types are as follows:

Git Refs

So far we’ve covered the content-addressable filesystem component of git, and how they are composed into commits. However, none of this explains the common tools of git and how they work, like git branches and tags, nor does it explain, for instance, how the git commit command knows which commits to use as a parent when constructing a new commit object.

The answer, of course, is git refs. Git refs are, essentially, names associated with a particular git object - most importantly, names which can change. For example, in the little repository we made earlier, we can see the ref that defines the master branch, which is just the latest commit.

$ cat .git/refs/heads/master
3ddaa4ce063a3829732692dd8a46ad435c5471b1
$ git show 3ddaa4ce063a3829732692dd8a46ad435c5471b1
commit 3ddaa4ce063a3829732692dd8a46ad435c5471b1 (HEAD -> master)
Author: sapient_cogbag <sapient_cogbag@protonmail.com>
Date:   Sun Mar 13 18:32:40 2022 +0000

    Commit with parent!

diff --git a/initial-file.txt b/initial-file.txt
index e69de29..3b18e51 100644
--- a/initial-file.txt
+++ b/initial-file.txt
@@ -0,0 +1 @@
+hello world

There are also symbolic references, which are git references that contain the path to another reference within them. For instance, in the example repository, the special HEAD ref is symbolically linked to refs/heads/master:

$ cat .git/HEAD
ref: refs/heads/master

When you run git commit, it fully traverses the depth of symbolic references until it finds one that holds an actual git object hash, then it uses that object hash as a parent commit to the new, generated commit. Then, if HEAD is symbolic - as opposed to an object hash which is known as the detached HEAD state - the linked reference is updated to the new commit hash.

Types of Refs

There are 3 (main) types of ref in git:

This means that when using a peer-to-peer repository system, the main difficulty is propagating updates to refs in a signed format and a well-ordered fashion and then syncing this across all copiers of a repository.

Furthermore, git also has various other arbitrary refs - see here for one example - and syncing does essentially require that relevant refs are pulled along. In practice, heads and tags are by far the most important, but a system should try and pull along any other refs as necessary. It’s a bit finnicky, really.

git pull/push/fetch/clone

In essence, all these commands do is synchronise local refs/remotes/* and remote references as appropriate then ensure that all objects required to have these references be pointing to something valid are present on the side of the connection being modified (for pull, it also attempts to merge in the local refs/remotes/<remote>/heads/<some branches> to the true local associated branches in refs/heads, as opposed to fetch which allows specification of mapping remote refs to local refs manually but can also do various things by default - for specifics see man git-fetch or the online docs).

Git Namespaces2

In git, there is a little-known feature that allows for namespacing refs. For most casual users of git, this is fairly irrelevant, but for us it is an incredible opportunity. This feature works by essentially defining different collections of refs - like tags, heads, HEAD, remotes - at various namespace paths. For instance, you could start a repository with GIT_NAMESPACE=user1 (or passing --namespace=user1) to start a project, then if a new user wants to branch off, they can copy these refs to GIT_NAMESPACE=user2 and perform all operations with that in the same repository, and it will affect none of user1’s refs.

Furthermore, they are completely recursive, which means that user1 and user2 can all manage their own namespaces as they see fit, which can be accessed directly in operations like git pull/fetch, git clone, etc. via user1/<subnamespace path> or user2/<subnamespace path>. And, because these all use the same git object collection, the cost of “branching” with namespaced refs like this is merely that of any extra commits or other git objects added to the repository and all common commits/files are deduplicated. In the context of a system with potentially thousands of users and hundreds of repository branches, this is a vast efficiency gain that also allows easy discovery of alternate versions of a repository (like github concepts of forking) with comparable ease.

All that would need to be done, then, on the ref side for a fully p2p system is have every version of a repository containerized in a namespace prefix with some cryptographic public key as the first element in the path, and only update refs when an update is signed with the namespace path prefix public key. This essentially automatically induces repository-wide deduplication.

Git namespaces also - at least theoretically - provide a useful way to associate other items with a git repository (like issues) as git objects within a repository, without conflicting with existing branches, tags, and other such things.

Summary

We essentially went over the core concepts of git as well as a particular obscure feature that readily enables per-user repository contents to be collected together, among other things.

The next post in this series will probably be about git fetch vs git push on the githooks and efficiency front and why it should hopefully still be possible to use git fetch for parallel remote fetching in combination with the reference-transaction hook for the usage of github/gitlab/etc as backup commit sources - as well as means for using git push for cleaner mechanics via multi-remote pushing.

Depending on what I feel like, it may also be about anything in the list below (or maybe something else):


  1. When using most git commands, including git show, it is usually possible to reference a git object via shortened versions of it’s hash, as well as by named ref↩︎

  2. The manpage for this is at man gitnamespaces (no space), or here ↩︎