Part 0 - The Playing Field

programming p2p git rust decentralisation

Before I start exploring p2p git and how to build it, I want to take a peek at the current options for peer-to-peer git, their issues and my motivation for moving towards constructing my own for everyone to use.

The motivations for peer-to-peer git (and preferrably, anonymous git, but I’ll go into that later) are clear.

While git itself as a protocol and version control system is decentralised - that is, everyone downloading a repository typically holds a total copy of the entire repository on their disk - in practise what we’ve ended up with is a few central nodes (GitHub, GitLab, and a couple others) hosting most of the world’s open-source software and acting as a central hub.

This presents a major vulnerability in the supply chain of software. The world runs on FOSS, on the code that people are hosting on GitHub or GitLab, and that is a problem when you consider the implications of an attack on one of these two major sites. While Git itself provides some defence mechanisms, in practise if someone can push to the master/main branch (and if they’ve infiltrated GitHub, then they definitely can do so), they have compromised the central repository everyone is working on until the main author notices the disparity.

Theoretically, this is mitigated by Git’s “sign commits with gpg/ssh/x509 keys feature (see git verify-commit and git verify-tag)”, but in practise verifying signatures of all the commits is not automatic and rarely done due to the need to manually verify every individual commit.

Furthermore, if an author pulls down commits from upstream and signs a new commit, then the vulnerable file will be treated in that commit as having been checked and signed by the author - this is a common workflow for single-author projects (using git pull instead of git fetch) in case they decide they want to merge in changes they made via a web interface or by someone else providing a contribution.

The second vulnerability - and one much more dangerous as far as I am concerned in practise - is the control this gives the companies running the central repositories over the Free & Open Source movements and projects. In particular, GitHub is run by Microsoft, though the criticism does apply to most of these central repos, not just them.

It means Free and Open Source projects end up being capable of massive disruption by the central host simply deciding they want to shutdown a project (or, are legally forced to do so).

Part of this is also that people build their development infrastructure around these proprietary software hosts - things like “github forking”, github issues and using platform specific CI - all this makes the FOSS movement much more dependent upon corporate platforms to simply function and increases the migration cost in case a particular host becomes hostile, and can completely shut down any ability to collaborate in the case of temporary or permanent removal of a repository.

Luckily, at least, the git VCS itself is decentralised enough that the code and repository histories are preserved when moving, so it isn’t a total dependence like you’d expect from purely proprietary formats, but it concerns me that even this much has happened.

It’s gotten to the point that even being able to post a crate on crates.io requires logging in with a github account - and while I do appreciate that they are trying to integrate the teams/groups feature and that maintaining your own account database takes infrastructure - it worries me that you need to integrate with a proprietary structure to even contribute to an entire language’s ecosystem.

GitHub also has their own weird proprietary CLI that to me emits whiffs of Embrace, Extend, Extinguish - though I am probably being a little paranoid on that one.

Examples

The YouTube-DL Incident

So, I’m writing all this stuff about issues of centralised control, and theoretical attacks. But, what does that mean in practise?

For that, we have to go back to what got me interested in p2p git in the first place - the incident in October 2020 when the widely-used tool youtube-dl (a tool for downloading youtube videos) got DMCA’d off of GitHub by the nasty folks in the RIAA (an industry group that are big on DRM and copyright and all that sort of stuff).

The entire development process was completely shut down. My understanding is that it was quite stressful for the developers. And it illustrates the power of legal threats to shut-down software inconvenient for megacorporations to leave lying around when it is centrally hosted and coordinated. Eventually, they had to remove some tests from the repository that were deemed to be copyright violations (to my knowledge).

While FOSS is bound to centrally-controlled, barely pseudoanonymous repositories, they are held at the whim of legal threats by a sufficiently monied opponent of whatever they do.

Luckily, after the youtube-dl incident, GitHub did say they would provide legal aid to Open Source projects in the case of legal threats, but this is dependent on the goodwill of GitHub and also on the fact that your software isn’t illegal which in the case of anything to unlock proprietary devices or DRM it may well be - for instance, the library VLC uses to crack DVDs to play them (decss and the FOSS implementation libdvdcss) is in a legal grey area at best.

Nintendo ROMS, Emulators, Mods, etc.

Another thing that has been massively suppressed in the FOSS community is the development of game modifications and customisations - as well as device emulators - for the products of particularly litigious corporations (with Nintendo being a major example).

Every year or two, there’s a story where some fan-made version of some Nintendo game like Mario or an emulator for a newer Nintendo device gets smacked down off of GitHub, and frankly, it’s suffocating for free expression of the love of a game or improvements to a game.

This is a subset of a wider issue around the legality of reverse-engineering proprietary code and devices, which is significantly hampered by legal obstacles perpetuated by lack of anonymity.

Encryption Export Laws

For a long time - up until the late 90s/early 2000s - code to enable encryption was illegal to export from several countries like the US due to it’s classification as a munition/weapon of war. This provided challenges towards cooperation between open-source devs on the eastern side of the Atlantic and those on the western related to hosting and who could view code designed for encryption.

All of this is the sort of thing that would be significantly easier if developers could remain anonymous and not host their code centrally in such a way that it can be made inaccessible by legal threats.

Attempts to Legally Force Backdoors and Attempts to Ban End-To-End Encryption

Though the era of crypto-as-munitions has mostly passed, the war on publically available encryption and computer security by controlling governments continues unabated. For instance, the Australian Government, in 2018, passed a law that would allow them to legally demand a backdoor in any communications service or computer service in their country.

Which is terrifying, but also leads back to the issue of having a single point of poorly-verified failure in the hosting of open-source software.

While the threat of backdooring GitHub itself is comparatively low, the threat of taking down open source software that does not willingly insert backdoors via legal attacks on a central hosting service, or by demanding that the service blocks certain repositories in a given country, in my opinion, is much higher.

But more common than the attempts at legally mandating a backdoor, are the attempts at making all end to end encryption illegal or monitored by the government (the latter being a contradiction both logically and mathematically with the meaning of “end to end encryption”), or otherwise threatening services that offer it.

For instance, there have been repeated attempts in the UK to ban or restrict end-to-end encryption (the latest being a campaign to keep it out of Facebook Messenger with ridiculous faux-grassroots advertising that ignores the creepiness of having some person in a company, or any future government or hacker spying on everything someone says and being able to interfere with or manipulate the conversation at their whim or use that information for any purpose), and my home shithole is far from alone in this regard, with continuous attempts by security agencies and miscellaneous authoritarians to control people’s private communications.

In many more authoritarian countries, this has already happened and end-to-end-encryption is hard to come by without Tor access, and this means that interference with open-source development and interference with access to projects that may enable this sort of end-to-end encryption-based “bootstrap to open internet access” is paramount for their government.

It has more subtle effects, too. If, say, a Tor developer in a particularly oppressive regime cannot access the Tor Browser codebase or publish updates anonymously while in a country, then it could be harder (or in many cases, actively dangerous) for them to modify the code to improve Tor access in that region or even just download the repository.

But the big threat - like with the mandated backdoors - is the risk of a central service being legally forced to remove access to a repository that enables end-to-end encryption or anonymity, by the country it is being hosted in or by the government of a country they serve that repository to. Or worse, being forced to patch any downloads with a backdoor without notifying people.

Current Options

So, we get to the meat of the post. What are the current options for decentrally-hosted git, because they do exist, and why do I not think they are good enough to satisfy the motivations I’ve layed out above.

Cloud and Self-Hosted options

There are several options to self-host or cloud-host a git server for your repos without it interacting with others running the software, such as the GitLab open-core sourcecode, and simply using a remote machine with ssh access for repository purposes, or running git daemon to provide public access when avoiding adding SSH keys. Or, for those wanting to use a fully open source system with a web interface, Gitea works too.

All these options do provide benefits with regard to removing some of the centralisation involved in GitHub and GitLab, however, they still:

Federated Hosting

The chief project towards federated git hosting and collaboration tools is, to my understanding, ForgeFed - which Gitea is working on implementing, and there are attempts to make the email workflow that git includes as a built-in more accessible like sourcehut1, which by proxy of email being federated is itself to some degree federated.

These options definitely improve on the decentralisation aspect of git - and in an implementation of p2p git, it is probably worth implementing ForgeFed at the very least to enable relatively seamless integration between federated and p2p instances, as a secondary priority.

However, they do still have the issue of presenting a moderately centralised system (if much less so) and a reliance on the hosts of the federating servers to not mess around with the commits, and there is also the issue of replicating new commits across a federated network efficiently (this also applies to a hypothetical p2p git implementation), and whether a server will even actively propagate.

Existing P2P Git Implementations

This is probably what you clicked here for, and what I want to go over, the existing p2p options for git.

git-ssb (Secure Scuttlebutt Addon)

Secure Scuttlebutt is a peer to peer network that enables generalisable message transmission, feeds, and gossip in a comparatively simple manner, and git-ssb is a tool building upon that to provide some basic git collaboration and repository synchronisation between peers on a given message feed.

This is definitely a workable solution in some respects, and especially good if you already have Scuttlebutt set up. The main problem with this that I have is that Scuttlebutt - as a project - does not do peer discovery automatically which naturally tanks the usability, and without automatic NAT-smashing can lead to a small minority of public nodes always relaying information between the users that can’t reach each other due to being behind a NAT (which is the vast majority of users!).

gittorrent

Gittorrent attempts to build a git remote option on top of the bittorrent protocol, and in my opinion, is pretty good at it. In particular, it uses the DHT (Distributed Hash Table) to point to nodes in a p2p network with a given commit, with updates to the latest commit being embedded in the DHT to allow for pushing/pulling and general mutability.

It uses bittorrent to upload and download packfiles - that is a collection of git objects such as commits and versions of files - with peers. However, it has a number of flaws:

However, I do think out of all the options available today, this is fairly good.

hypergit

Hypergit is another pretty good option, but it uses a tracking server - which is a central point of failure even worse than bootstrap servers for DHTs - and it stores your git repository completely uncompressed, that is, using no packfiles at all. This has a massive cost in terms of disk usage (as in, 1 to 2 orders of magnitude increase).

While it does have local forks, this system does not appear to have anything like merge requests or non-local accessibility of other forks from the original git repository.

It also ties your identity to that of the identity of the node you are working on (to my understanding), which is distinctly inflexible if you ever work on more than one device.

This option doesn’t have the sha-1 problem as (to my understanding) it provides cryptographic signatures to objects.

IPFS Options

The existing article that I’m using for a fair bit of my information also points to a couple of options based around IPFS, which use the IPFS Content-Addressable Filesystem to provide git repositories on a per-object (git objects are things like commits and files) basis. Like hypergit, these host git repositories completely uncompressed on disk and therefore result in a one or two order of magnitude increase in disk usage as well as large increases in network traffic when exchanging repository contents.

They also run the risk of the repository simply being… deleted from the IPFS node because it didn’t pin the repository. These options also lack the ability to enable user-friendly forks and do not take advantage of Git’s packfiles and intelligent fetching/pushing capabilities.

They, at least, also lack the git sha-1 problem, but have several issues in regards to upload/download performance and - due to the need to grab every git object (commit, file version, tags, etc) individually from the distributed hash table - involve massive amounts of network overhead.

Both of these projects appear to be inactive looking back in 2022 on their GitHub pages, though are more recently developed than GitTorrent.

SHA-1

I’ve mentioned several times an issue involving the hash function SHA-1. This is related to the fact that it is, at least partially, broken - i.e. it is possible to generate two different inputs that produce the same output and as such, if a SHA-1 hash is used to authenticate the contents of a file, a malicious actor could replace the contents with something else and it would still be considered valid.

In the case of SHA-1, this (publically at least) still requires use of one of several fixed-prefixes of data to be feasible, but, generally it is considered insecure for this reason - see https://shattered.io/ for details.

Git uses a hardened version of SHA-1 which provides protection against known attacks on SHA-1 (though not a currently-unfeasible but potentially future-feasible birthday paradox attack exploiting the limited (160 bit) output space of SHA-1). However, it is considered better to use more recent hash functions with larger output spaces and more secure algorithms such as sha2-256/512 2, and sha3-256/512

The usage of a hardened version of the algorithm does mitigate potential attacks but I would still consider it very non-ideal to be using for cryptographic signatures (that is, signing a hash of a file to verify the contents of the file), especially in the case of future vulnerabilities in the algorithm.

This poses a problem for p2p systems for git because it is often desired to use git’s inbuilt hashed-contentness as a method of verifying repository contents, but the slow progress on transitioning away from sha-1 (even if hardened) makes this more difficult. Most p2p systems in existence today - such as the IPFS options and HyperGit - use whatever verification function is native to the underlying p2p network.

Gittorrent, however, does still use SHA-1, which is a problem.

The main issue this causes when writing a p2p git protocol is the potential need to maintain a map of (potentially maliciously collided) hardened-sha1 hashes to the valid (non-collidable) sha2-256/512 or sha3-256/512 hashes of the actual contents, if you are working without a preexisting network (like IPFS or Dat in the case of HyperGit) to do it for you, though it is a solvable problem.

It just adds overhead with the need to transfer special signatures of all git objects rather than just any given set of branch commits (because the sha1 hashes in a commit cannot necessarily be considered secure identifiers for the validity of a contained object by themselves), and finding a way to check git objects are valid without inducing a race condition involving git fetch and git cat-file --batch, and making sure to include the header in hash function input.

Anonymity

This is massive issue with p2p networks right now. Currently, the vast majority of p2p networks are incredibly leaky about things like IP addresses. libp2p has proper Tor support - that doesn’t leak - on hold indefinitely, though if you separate out the Tor and Non-Tor (or other anonymising networks) of swarms while using something like this it may be possible.

Bittorrent also has this issue and so do most other p2p networks. Depressingly, this means most p2p networks are much less good at protecting the peers involved from deanonymisation and legal attack, which is a major hypothetical benefit to p2p git, especially with respect to DMCA and hostile copyright enforcers of questionable actual legality but buckets of money to throw at a developer. Or worse, leaks of security-agency sourcecode and documents, and the nationstate threats associated with that kind of leak.

If a reliable means of separating out different networks and manually controlling the relaying of repositories and peers is developed, it would be almost revolutionary in impact for the safety of peer-to-peer networks and the ability to route them over things like tor, I2P, or hypothetical future mixnets.

None of the current options provide seriously good methods of managing this, though covering how I would (and intend to) do such a thing will come within the next two or three posts.

Conclusion

It is clear that the current options do not (at least to me) meet the standards required for collaborational usability, anonymity, identity mobility, and cryptographic verification, though they are a good effort given that they are essentially “first go"s at the technology.

Few of them embrace git’s own inbuilt ability to be decentralised and instead tend towards being very invasive and non-transparent towards git’s internals (such as manually managing git objects, forcing non-compressed repositories or manually building packfiles, and not allowing for many of git’s native capabilities), either because it did not fit with the constraints of the protocol they were building on top of, or because they were written before git implemented certain advanced features like parallel fetch or git namespaces, and protocol-v2.

Using git’s inbuilt decentralised nature and some sidechannels for better cryptographic security (better hashes for signing) means a system which is not only less invasive to implement, but also transparently enables the use of newer git features as they are released.

The next post on this will be about the internal structures of git and how they enable efficient decentralisation - as well as what is necessary on top of the existing infrastructure to realise the full extent of theoretical capabilities.


  1. Unfortunately email is… very janky as far as I’m concerned. ↩︎

  2. git is currently in the early stages of transitioning to sha256↩︎