To create a system designed for peer-to-peer git repositories, we first must understand the ideas involved in, and the structures used by git. The gritty details can be found in the reference and the excellent book this links to the page for git internals , but for now we’re covering the basic concepts.
What is a git repository, really?
The answer to this question is simple, actually. A git repository is a combination of:
- An efficiently-deduplicated content-addressable filesystem - that is, a filesystem where files and directories and other objects are retrieved by providing - and referred to by - a cryptographic hash of their contents.
- References (
refs
) to certain special objects in this filesystem, of various types. - Configurations of the repository - remote repositories to fetch/pull/push from and to, things like username and email if locally specified, hooks for different actions like merging changes or pushing, and more stuff like this.
All of this is in the hidden .git
folder inside any git repository created with something like git clone
or git init
.
$ touch initial-file.txt
$ git add initial-file.txt
$ ls
initial-file.txt
$ git status
On branch master
No commits yet
Changes to be committed:
(use "git rm --cached <file>..." to unstage)
new file: initial-file.txt
$ git commit -m "Added a new file"
[master (root-commit) 43e3d17] Added a new file
1 file changed, 0 insertions(+), 0 deletions(-)
create mode 100644 initial-file.txt
$ cd .git/
$ cd objects/
$ tree
.
├── 43
│ └── e3d17fb06dd803bc5d978608633d603fc6d762
├── 6a
│ └── 28c5ace5d1eb910fa77881cd68a29eed5f3e7e
├── e6
│ └── 9de29bb2d1d6434b8b29ae775ad8c2e48c5391
├── info
└── pack
5 directories, 3 files
In this series of commands, we’ve taken an empty git repository, added an empty file, and committed it to the repository - then listed the internal git objects which are named via their hashes in the file system acting as a substrate for git’s own FS.
Git Objects
Git objects are stored in several forms on your filesystem, the simplest being in the .git/objects
folder, but in larger repositories they can also be stored in packfiles, in .git/pack
, which provide highly efficient deduplication and compression on top of that already implicit in a content-addressed filesystem.
Continuing in the repository created just before this section, objects can be pretty-printed with git show
by providing their hash:
$ git show 43e3d17fb06dd803bc5d978608633d603fc6d762
commit 43e3d17fb06dd803bc5d978608633d603fc6d762 (HEAD -> master)
Author: sapient_cogbag <sapient_cogbag@protonmail.com>
Date: Sun Mar 13 18:06:24 2022 +0000
Added a new file
diff --git a/initial-file.txt b/initial-file.txt
new file mode 100644
index 0000000..e69de29
By default, objects have additional information printed with git show
- for instance, commits have a diff printed out - but they also lack certain other information, which is useful in understanding how git objects work.
If we instead do the following:
$ git show --pretty=raw 43e3d17fb06dd803bc5d978608633d603fc6d762
commit 43e3d17fb06dd803bc5d978608633d603fc6d762
tree 6a28c5ace5d1eb910fa77881cd68a29eed5f3e7e
author sapient_cogbag <sapient_cogbag@protonmail.com> 1647194784 +0000
committer sapient_cogbag <sapient_cogbag@protonmail.com> 1647194784 +0000
gpgsig -----BEGIN PGP SIGNATURE-----
iIkEABYIADEWIQS4qb7l+XIqM5tpfeu4vaEap8xnngUCYi4yoBMcYW5vbnltb3Vz
QG5vLmVtYWlsAAoJELi9oRqnzGeemOQA/2Wu/S9TEbwhnYXXd9zntxlVinf8DwlH
KmM+DBbKvXP/AQDW1rW0w+VTCrPeSVfGeKMXD++Ond4e1xCYV5Fouu3CAA==
=js4q
-----END PGP SIGNATURE-----
Added a new file
diff --git a/initial-file.txt b/initial-file.txt
new file mode 100644
index 0000000..e69de29
We see an important aspect of how commits conceptually work in git - namely, that they are simply a reference to another git object (in particular, a git object that is the filesystem root directory) with additional information like committer, author, message, and if configured, a cryptographic signature - which we can see in the line tree 6a28c5ace5d1eb910fa77881cd68a29eed5f3e7e
. You’ll note that if you jump up in the document, that is one of the objects seen in the directory listing:
├── 6a
│ └── 28c5ace5d1eb910fa77881cd68a29eed5f3e7e
Now, because this was the first commit in the repository, it lacks parent commits, but usually a commit will have one (and sometimes more, in the case of commits that induce a merge) commits that are it’s parents, which are in the commit simply additional hashes. For example, making a new commit and examining it’s own object representation with git show
1:
$ echo "hello world" > ../../initial-file.txt
$ cd ../..
$ git commit -am "Commit with parent!"
[master 3ddaa4c] Commit with parent!
1 file changed, 1 insertion(+)
$ git show --pretty=raw 3ddaa4c
commit 3ddaa4ce063a3829732692dd8a46ad435c5471b1
tree 4109c394ed1a5a582835d1c487190480ac8832af
parent 43e3d17fb06dd803bc5d978608633d603fc6d762
author sapient_cogbag <sapient_cogbag@protonmail.com> 1647196360 +0000
committer sapient_cogbag <sapient_cogbag@protonmail.com> 1647196360 +0000
gpgsig -----BEGIN PGP SIGNATURE-----
iIkEABYIADEWIQS4qb7l+XIqM5tpfeu4vaEap8xnngUCYi44yBMcYW5vbnltb3Vz
QG5vLmVtYWlsAAoJELi9oRqnzGeeJ90A/RoZHpRRfDa2x8lN9ALWdZ5vX2QuxQjC
43h57CuWBxwxAP47bW6f3K1VaSrqtxdc1iUw8zf/MN7FgkpVxY+PubAiBg==
=frrB
-----END PGP SIGNATURE-----
Commit with parent!
diff --git a/initial-file.txt b/initial-file.txt
index e69de29..3b18e51 100644
--- a/initial-file.txt
+++ b/initial-file.txt
@@ -0,0 +1 @@
+hello world
We can see the parent commit’s git object name (it’s hash) - 43e3d17fb06dd803bc5d978608633d603fc6d762
, which indeed matches the previous commit. We can also see - and this is something very important to note about git - that it points to an entirely different tree object. Conceptually speaking, a commit references a complete copy of the entire directory tree (at least the parts included with git add
) at the time git commit
is called, as well as a complete copy of the entire directory tree for every parent commit - recursively. This is critical because it means that you only need to have the hash of a commit to retrieve the entire history of the repository up to that commit - this includes when fetching parts of a repository from a remote, as well.
We should probably examine directory tree objects before we go further, though, and for that we need a special command git ls-tree
because git show
only shows filenames:
$ git ls-tree 4109c394ed1a5a582835d1c487190480ac8832af
100644 blob 3b18e512dba79e4c8300dd08aeb37f8e728b8dad initial-file.txt
And we can see that tree objects encode:
- permission bits -
100644
- type of object -
blob
, here - in git terminology, ablob
is just a simple file, a series of bytes. Ifinitial-file.txt
was a subdirectory in the git repository rather than a text file, then the type would betree
instead - the name of the object in the git filesystem, i.e. it’s hash -
3b18e512dba79e4c8300dd08aeb37f8e728b8dad
- the name of the object in the directory -
initial-file.txt
Git Object Types
Git objects come in four types - though we’ve only seen three of them. The four types are as follows:
blob
- a simple collection of bytes i.e. a file.tree
- a collection of other git objects of various types with associated names - othertree
andblob
objects - functionally encoding a directory structure.commit
- references atree
that is the directory state of the git repository at time of commit, and contains information like parent commit(s), author and committer, commit messages, cryptographic signatures, and other such things.tag
- this is one that hasn’t been seen yet and has special utility as an intermediary reference that can contain extra information about git tags, which will be covered in the section onrefs
.
Git Refs
So far we’ve covered the content-addressable filesystem component of git, and how they are composed into commits. However, none of this explains the common tools of git
and how they work, like git branches and tags, nor does it explain, for instance, how the git commit
command knows which commits to use as a parent when constructing a new commit object.
The answer, of course, is git refs
. Git refs
are, essentially, names associated with a particular git object - most importantly, names which can change. For example, in the little repository we made earlier, we can see the ref that defines the master
branch, which is just the latest commit.
$ cat .git/refs/heads/master
3ddaa4ce063a3829732692dd8a46ad435c5471b1
$ git show 3ddaa4ce063a3829732692dd8a46ad435c5471b1
commit 3ddaa4ce063a3829732692dd8a46ad435c5471b1 (HEAD -> master)
Author: sapient_cogbag <sapient_cogbag@protonmail.com>
Date: Sun Mar 13 18:32:40 2022 +0000
Commit with parent!
diff --git a/initial-file.txt b/initial-file.txt
index e69de29..3b18e51 100644
--- a/initial-file.txt
+++ b/initial-file.txt
@@ -0,0 +1 @@
+hello world
There are also symbolic references, which are git references that contain the path to another reference within them. For instance, in the example repository, the special HEAD
ref is symbolically linked to refs/heads/master
:
$ cat .git/HEAD
ref: refs/heads/master
When you run git commit
, it fully traverses the depth of symbolic references until it finds one that holds an actual git object hash, then it uses that object hash as a parent commit to the new, generated commit. Then, if HEAD
is symbolic - as opposed to an object hash which is known as the detached HEAD
state - the linked reference is updated to the new commit hash.
Types of Refs
There are 3 (main) types of ref in git:
heads
- These point to the tips of various branches and essentially define branches. WhenHEAD
symbolically references one of these,git commit
updates the reference automatically when making a new commit.tags
- These point either directly to a commit, or to atag
git object as discussed above. Tags provide clean names and information for various commits and can be manually updated. For instance, many peices of software haverefs/tags/v<version number here>
as a tag. For more complex bits of software, though, a branch is often used for this task instead.remotes
- When you define a named remote git repository forgit pull/fetch
to pull from - for instance,origin
- these act as a read-only cache of the last known contents of refs on the remote server. For instance, if yougit push <origin>
, the server will change anyrefs/heads/*
references to the appropriate local values (usually from the local reference of the same name, but it is possible to have a localheads
reference be pushed to a different remoteheads
reference). If this is successful, the localgit
will then updaterefs/remotes/<origin>/heads/*
to match the contents of the remote git repository’srefs/heads/*
folder. In practise this also works fortag
references.
This means that when using a peer-to-peer repository system, the main difficulty is propagating updates to refs
in a signed format and a well-ordered fashion and then syncing this across all copiers of a repository.
Furthermore, git also has various other arbitrary refs - see here for one example - and syncing does essentially require that relevant refs are pulled along. In practice, heads
and tags
are by far the most important, but a system should try and pull along any other refs as necessary. It’s a bit finnicky, really.
git pull/push/fetch/clone
In essence, all these commands do is synchronise local refs/remotes/*
and remote references as appropriate then ensure that all objects required to have these references be pointing to something valid are present on the side of the connection being modified (for pull
, it also attempts to merge in the local refs/remotes/<remote>/heads/<some branches>
to the true local associated branches in refs/heads
, as opposed to fetch
which allows specification of mapping remote refs to local refs manually but can also do various things by default - for specifics see man git-fetch
or the online docs).
Git Namespaces2
In git, there is a little-known feature that allows for namespacing refs. For most casual users of git, this is fairly irrelevant, but for us it is an incredible opportunity. This feature works by essentially defining different collections of refs - like tags
, heads
, HEAD
, remotes
- at various namespace paths. For instance, you could start a repository with GIT_NAMESPACE=user1
(or passing --namespace=user1
) to start a project, then if a new user wants to branch off, they can copy these refs to GIT_NAMESPACE=user2
and perform all operations with that in the same repository, and it will affect none of user1
’s refs.
Furthermore, they are completely recursive, which means that user1
and user2
can all manage their own namespaces as they see fit, which can be accessed directly in operations like git pull/fetch
, git clone
, etc. via user1/<subnamespace path>
or user2/<subnamespace path>
. And, because these all use the same git object collection, the cost of “branching” with namespaced refs like this is merely that of any extra commits or other git objects added to the repository and all common commits/files are deduplicated. In the context of a system with potentially thousands of users and hundreds of repository branches, this is a vast efficiency gain that also allows easy discovery of alternate versions of a repository (like github concepts of forking) with comparable ease.
All that would need to be done, then, on the ref
side for a fully p2p system is have every version of a repository containerized in a namespace
prefix with some cryptographic public key as the first element in the path, and only update refs when an update is signed with the namespace path prefix public key. This essentially automatically induces repository-wide deduplication.
Git namespaces also - at least theoretically - provide a useful way to associate other items with a git repository (like issues) as git objects within a repository, without conflicting with existing branches, tags, and other such things.
Summary
We essentially went over the core concepts of git as well as a particular obscure feature that readily enables per-user repository contents to be collected together, among other things.
The next post in this series will probably be about git fetch
vs git push
on the githooks
and efficiency front and why it should hopefully still be possible to use git fetch
for parallel remote fetching in combination with the reference-transaction
hook for the usage of github/gitlab/etc as backup commit sources - as well as means for using git push
for cleaner mechanics via multi-remote pushing.
Depending on what I feel like, it may also be about anything in the list below (or maybe something else):
- Enabling true anonymity in a p2p git system while still allowing cross-exchanges between subnetworks (e.g. a tor subnetwork or an i2p subnetwork), and discussing the various libraries and tools available for developing p2p projects in Rust.
- Allowing clean aliases for long and unpleasant public keys (e.g. by using centralised services to publicise public keys, or providing non-proprietary identity services as well, or both).
- Handling (or not handling) the git hashing problem in specifics (e.g. the fact that git hashes are not purely of the contents of the object), as well as potential means of providing stronger hashes for cryptographic signing than the hardened sha1 - after all this system should be safely usable even against nationstate-level adversaries.
- Creating true multi-user approval/council systems for truly decentralised and democratic repository development on top of a per-user-repository system.
- Managing notions of trust for repository subbranches - e.g. to reduce the risk of hostile and large commits consuming space, how the hell you even count space when dealing with large numbers of forks of a repo with much common history, etc.