To create a system designed for peer-to-peer git repositories, we first must understand the ideas involved in, and the structures used by git. The gritty details can be found in the reference and the excellent book this links to the page for git internals , but for now we’re covering the basic concepts.
What is a git repository, really?
The answer to this question is simple, actually. A git repository is a combination of:
- An efficiently-deduplicated content-addressable filesystem - that is, a filesystem where files and directories and other objects are retrieved by providing - and referred to by - a cryptographic hash of their contents.
- References (
refs) to certain special objects in this filesystem, of various types.
- Configurations of the repository - remote repositories to fetch/pull/push from and to, things like username and email if locally specified, hooks for different actions like merging changes or pushing, and more stuff like this.
All of this is in the hidden
.git folder inside any git repository created with something like
git clone or
$ touch initial-file.txt $ git add initial-file.txt $ ls initial-file.txt $ git status On branch master No commits yet Changes to be committed: (use "git rm --cached <file>..." to unstage) new file: initial-file.txt $ git commit -m "Added a new file" [master (root-commit) 43e3d17] Added a new file 1 file changed, 0 insertions(+), 0 deletions(-) create mode 100644 initial-file.txt $ cd .git/ $ cd objects/ $ tree . ├── 43 │ └── e3d17fb06dd803bc5d978608633d603fc6d762 ├── 6a │ └── 28c5ace5d1eb910fa77881cd68a29eed5f3e7e ├── e6 │ └── 9de29bb2d1d6434b8b29ae775ad8c2e48c5391 ├── info └── pack 5 directories, 3 files
In this series of commands, we’ve taken an empty git repository, added an empty file, and committed it to the repository - then listed the internal git objects which are named via their hashes in the file system acting as a substrate for git’s own FS.
Git objects are stored in several forms on your filesystem, the simplest being in the
.git/objects folder, but in larger repositories they can also be stored in packfiles, in
.git/pack, which provide highly efficient deduplication and compression on top of that already implicit in a content-addressed filesystem.
Continuing in the repository created just before this section, objects can be pretty-printed with
git show by providing their hash:
$ git show 43e3d17fb06dd803bc5d978608633d603fc6d762 commit 43e3d17fb06dd803bc5d978608633d603fc6d762 (HEAD -> master) Author: sapient_cogbag <firstname.lastname@example.org> Date: Sun Mar 13 18:06:24 2022 +0000 Added a new file diff --git a/initial-file.txt b/initial-file.txt new file mode 100644 index 0000000..e69de29
By default, objects have additional information printed with
git show - for instance, commits have a diff printed out - but they also lack certain other information, which is useful in understanding how git objects work.
If we instead do the following:
$ git show --pretty=raw 43e3d17fb06dd803bc5d978608633d603fc6d762 commit 43e3d17fb06dd803bc5d978608633d603fc6d762 tree 6a28c5ace5d1eb910fa77881cd68a29eed5f3e7e author sapient_cogbag <email@example.com> 1647194784 +0000 committer sapient_cogbag <firstname.lastname@example.org> 1647194784 +0000 gpgsig -----BEGIN PGP SIGNATURE----- iIkEABYIADEWIQS4qb7l+XIqM5tpfeu4vaEap8xnngUCYi4yoBMcYW5vbnltb3Vz QG5vLmVtYWlsAAoJELi9oRqnzGeemOQA/2Wu/S9TEbwhnYXXd9zntxlVinf8DwlH KmM+DBbKvXP/AQDW1rW0w+VTCrPeSVfGeKMXD++Ond4e1xCYV5Fouu3CAA== =js4q -----END PGP SIGNATURE----- Added a new file diff --git a/initial-file.txt b/initial-file.txt new file mode 100644 index 0000000..e69de29
We see an important aspect of how commits conceptually work in git - namely, that they are simply a reference to another git object (in particular, a git object that is the filesystem root directory) with additional information like committer, author, message, and if configured, a cryptographic signature - which we can see in the line
tree 6a28c5ace5d1eb910fa77881cd68a29eed5f3e7e. You’ll note that if you jump up in the document, that is one of the objects seen in the directory listing:
├── 6a │ └── 28c5ace5d1eb910fa77881cd68a29eed5f3e7e
Now, because this was the first commit in the repository, it lacks parent commits, but usually a commit will have one (and sometimes more, in the case of commits that induce a merge) commits that are it’s parents, which are in the commit simply additional hashes. For example, making a new commit and examining it’s own object representation with
$ echo "hello world" > ../../initial-file.txt $ cd ../.. $ git commit -am "Commit with parent!" [master 3ddaa4c] Commit with parent! 1 file changed, 1 insertion(+) $ git show --pretty=raw 3ddaa4c commit 3ddaa4ce063a3829732692dd8a46ad435c5471b1 tree 4109c394ed1a5a582835d1c487190480ac8832af parent 43e3d17fb06dd803bc5d978608633d603fc6d762 author sapient_cogbag <email@example.com> 1647196360 +0000 committer sapient_cogbag <firstname.lastname@example.org> 1647196360 +0000 gpgsig -----BEGIN PGP SIGNATURE----- iIkEABYIADEWIQS4qb7l+XIqM5tpfeu4vaEap8xnngUCYi44yBMcYW5vbnltb3Vz QG5vLmVtYWlsAAoJELi9oRqnzGeeJ90A/RoZHpRRfDa2x8lN9ALWdZ5vX2QuxQjC 43h57CuWBxwxAP47bW6f3K1VaSrqtxdc1iUw8zf/MN7FgkpVxY+PubAiBg== =frrB -----END PGP SIGNATURE----- Commit with parent! diff --git a/initial-file.txt b/initial-file.txt index e69de29..3b18e51 100644 --- a/initial-file.txt +++ b/initial-file.txt @@ -0,0 +1 @@ +hello world
We can see the parent commit’s git object name (it’s hash) -
43e3d17fb06dd803bc5d978608633d603fc6d762, which indeed matches the previous commit. We can also see - and this is something very important to note about git - that it points to an entirely different tree object. Conceptually speaking, a commit references a complete copy of the entire directory tree (at least the parts included with
git add) at the time
git commit is called, as well as a complete copy of the entire directory tree for every parent commit - recursively. This is critical because it means that you only need to have the hash of a commit to retrieve the entire history of the repository up to that commit - this includes when fetching parts of a repository from a remote, as well.
We should probably examine directory tree objects before we go further, though, and for that we need a special command
git ls-tree because
git show only shows filenames:
$ git ls-tree 4109c394ed1a5a582835d1c487190480ac8832af 100644 blob 3b18e512dba79e4c8300dd08aeb37f8e728b8dad initial-file.txt
And we can see that tree objects encode:
- permission bits -
- type of object -
blob, here - in git terminology, a
blobis just a simple file, a series of bytes. If
initial-file.txtwas a subdirectory in the git repository rather than a text file, then the type would be
- the name of the object in the git filesystem, i.e. it’s hash -
- the name of the object in the directory -
Git Object Types
Git objects come in four types - though we’ve only seen three of them. The four types are as follows:
blob- a simple collection of bytes i.e. a file.
tree- a collection of other git objects of various types with associated names - other
blobobjects - functionally encoding a directory structure.
commit- references a
treethat is the directory state of the git repository at time of commit, and contains information like parent commit(s), author and committer, commit messages, cryptographic signatures, and other such things.
tag- this is one that hasn’t been seen yet and has special utility as an intermediary reference that can contain extra information about git tags, which will be covered in the section on
So far we’ve covered the content-addressable filesystem component of git, and how they are composed into commits. However, none of this explains the common tools of
git and how they work, like git branches and tags, nor does it explain, for instance, how the
git commit command knows which commits to use as a parent when constructing a new commit object.
The answer, of course, is git
refs are, essentially, names associated with a particular git object - most importantly, names which can change. For example, in the little repository we made earlier, we can see the ref that defines the
master branch, which is just the latest commit.
$ cat .git/refs/heads/master 3ddaa4ce063a3829732692dd8a46ad435c5471b1 $ git show 3ddaa4ce063a3829732692dd8a46ad435c5471b1 commit 3ddaa4ce063a3829732692dd8a46ad435c5471b1 (HEAD -> master) Author: sapient_cogbag <email@example.com> Date: Sun Mar 13 18:32:40 2022 +0000 Commit with parent! diff --git a/initial-file.txt b/initial-file.txt index e69de29..3b18e51 100644 --- a/initial-file.txt +++ b/initial-file.txt @@ -0,0 +1 @@ +hello world
There are also symbolic references, which are git references that contain the path to another reference within them. For instance, in the example repository, the special
HEAD ref is symbolically linked to
$ cat .git/HEAD ref: refs/heads/master
When you run
git commit, it fully traverses the depth of symbolic references until it finds one that holds an actual git object hash, then it uses that object hash as a parent commit to the new, generated commit. Then, if
HEAD is symbolic - as opposed to an object hash which is known as the
detached HEAD state - the linked reference is updated to the new commit hash.
Types of Refs
There are 3 (main) types of ref in git:
heads- These point to the tips of various branches and essentially define branches. When
HEADsymbolically references one of these,
git commitupdates the reference automatically when making a new commit.
tags- These point either directly to a commit, or to a
taggit object as discussed above. Tags provide clean names and information for various commits and can be manually updated. For instance, many peices of software have
refs/tags/v<version number here>as a tag. For more complex bits of software, though, a branch is often used for this task instead.
remotes- When you define a named remote git repository for
git pull/fetchto pull from - for instance,
origin- these act as a read-only cache of the last known contents of refs on the remote server. For instance, if you
git push <origin>, the server will change any
refs/heads/*references to the appropriate local values (usually from the local reference of the same name, but it is possible to have a local
headsreference be pushed to a different remote
headsreference). If this is successful, the local
gitwill then update
refs/remotes/<origin>/heads/*to match the contents of the remote git repository’s
refs/heads/*folder. In practise this also works for
This means that when using a peer-to-peer repository system, the main difficulty is propagating updates to
refs in a signed format and a well-ordered fashion and then syncing this across all copiers of a repository.
Furthermore, git also has various other arbitrary refs - see here for one example - and syncing does essentially require that relevant refs are pulled along. In practice,
tags are by far the most important, but a system should try and pull along any other refs as necessary. It’s a bit finnicky, really.
In essence, all these commands do is synchronise local
refs/remotes/* and remote references as appropriate then ensure that all objects required to have these references be pointing to something valid are present on the side of the connection being modified (for
pull, it also attempts to merge in the local
refs/remotes/<remote>/heads/<some branches> to the true local associated branches in
refs/heads, as opposed to
fetch which allows specification of mapping remote refs to local refs manually but can also do various things by default - for specifics see
man git-fetch or the online docs).
In git, there is a little-known feature that allows for namespacing refs. For most casual users of git, this is fairly irrelevant, but for us it is an incredible opportunity. This feature works by essentially defining different collections of refs - like
remotes - at various namespace paths. For instance, you could start a repository with
GIT_NAMESPACE=user1 (or passing
--namespace=user1) to start a project, then if a new user wants to branch off, they can copy these refs to
GIT_NAMESPACE=user2 and perform all operations with that in the same repository, and it will affect none of
Furthermore, they are completely recursive, which means that
user2 can all manage their own namespaces as they see fit, which can be accessed directly in operations like
git clone, etc. via
user1/<subnamespace path> or
user2/<subnamespace path>. And, because these all use the same git object collection, the cost of “branching” with namespaced refs like this is merely that of any extra commits or other git objects added to the repository and all common commits/files are deduplicated. In the context of a system with potentially thousands of users and hundreds of repository branches, this is a vast efficiency gain that also allows easy discovery of alternate versions of a repository (like github concepts of forking) with comparable ease.
All that would need to be done, then, on the
ref side for a fully p2p system is have every version of a repository containerized in a
namespace prefix with some cryptographic public key as the first element in the path, and only update refs when an update is signed with the namespace path prefix public key. This essentially automatically induces repository-wide deduplication.
Git namespaces also - at least theoretically - provide a useful way to associate other items with a git repository (like issues) as git objects within a repository, without conflicting with existing branches, tags, and other such things.
We essentially went over the core concepts of git as well as a particular obscure feature that readily enables per-user repository contents to be collected together, among other things.
The next post in this series will probably be about
git fetch vs
git push on the
githooks and efficiency front and why it should hopefully still be possible to use
git fetch for parallel remote fetching in combination with the
reference-transaction hook for the usage of github/gitlab/etc as backup commit sources - as well as means for using
git push for cleaner mechanics via multi-remote pushing.
Depending on what I feel like, it may also be about anything in the list below (or maybe something else):
- Enabling true anonymity in a p2p git system while still allowing cross-exchanges between subnetworks (e.g. a tor subnetwork or an i2p subnetwork), and discussing the various libraries and tools available for developing p2p projects in Rust.
- Allowing clean aliases for long and unpleasant public keys (e.g. by using centralised services to publicise public keys, or providing non-proprietary identity services as well, or both).
- Handling (or not handling) the git hashing problem in specifics (e.g. the fact that git hashes are not purely of the contents of the object), as well as potential means of providing stronger hashes for cryptographic signing than the hardened sha1 - after all this system should be safely usable even against nationstate-level adversaries.
- Creating true multi-user approval/council systems for truly decentralised and democratic repository development on top of a per-user-repository system.
- Managing notions of trust for repository subbranches - e.g. to reduce the risk of hostile and large commits consuming space, how the hell you even count space when dealing with large numbers of forks of a repo with much common history, etc.