From Git SCM Wiki
Jump to: navigation, search


Historical background

Linus Torvalds wrote at Fri, 5 May 2006, on git mailing list, in Re: [ANNOUNCE] Git wiki:

On Fri, 5 May 2006, Petr Baudis wrote:
> It's a philosophical question here, but I'd say that Git is much closer
> to Monotone than to any other version control system

Some historical background..

Before I dropped BK, I ended up being involved in trying to get Larry and Tridge to come to some agreement about how to solve the issues Tridge had with BK not being open-source. That actually went on for maybe two months or so, and I kept on hoping that we'd find some acceptably middle ground.

I thought we could find somethign that would actually work for everybody: to hopefully both make BK technically better, and to make the end result more palatable to the "free software or bust" contingency.

One of the suggestions that I tried to push as an acceptable middle ground was to make a "generic" BK repository export format, so that people who didn't want to use BK could still get all the information, and not in a broken format like CVS (yes, CVS makes sense as an interchange format, since everybody speaks CVS, but it's a horrible, horrible, horrible format from any technical standpoint).

My example export format was really a strange mixture of patches with parenthood information, where the history information was described with hashes (MD5 rather than SHA1, but that was just an implementation thing, and mostly because BK used MD5 sums). Not something really useful as a real SCM, but it wasn't designed for that - it was just meant to be a useful and unambiguous interoperability format.

Now, that didn't work out, and I was a little bummed. I thought it would have made both sides happy, because it would actually have been a better format than CVS (and yes, I'm somewhat biased: in my opinion, having a million monkeys throwing crap at the walls and encoding the information in the patterns on monkey shit is a better format than CVS), so it would actually have improved BK, while also making it possible to interoperate if you didn't want to use BK itself.

But Tridge didn't believe that it would actually have exported all the information in a BK tree, even if both I and Larry told him it would. I'm not a hundred percent sure that Larry would have gone for the export format either, but hey, one sign of a good compromise is that neither side really gets what they really want. Whatever. It didn't work.

So it didn't actually resolve the deadlock, but when it became clear that I couldn't work with BK any more, I thought I might use something like that "patch + parenthood" representation as a way to maintain my tree while looking at other alternatives.

So in many ways, when I started looking around for distributed SCM's, I came into the game with the background of keeping the history around as chains of hashes describing it, and then just having patches to describe the differences between versions.

So that was really my "fallback" position: if nothing out there worked, I'd rather go back to lists of patches than use CVS.

Now, if you keep track of just patches, one of the issues is that you can't afford to re-create the tree every time by walking patches forward from the beginning, so I also was planning to have an "cache" that maintained the current state of the tree as a separate state from the working tree, so that I would always have the "working tree" and the "result of patches up to this moment" as two separate things (so that I could do the "`bk diff`" that I was used to doing to see the difference between my last state and the current state of the working tree).

In other words, I was already working on the git "index" file. And I was planning to just have a patch-based system behind it, with a hashed history. Kind of "quilt with history and an index to speed things up".

The index itself would be backed-up with whole files (all hidden in the ".dircache" directory), and the patch series would thus normally never actually be used. So the inefficiency of working with patches would never be much of an issue. A "commit" would create a new patch from the current working directory and the previous shadow tree, and update the shadow tree and add a new entry to the history list.

And then I found Monotone.

Now, monotone was slow. Monotone was so horrendously slow that I had to do special hacks just to import one version of Linux into it in less than two hours. It was something stupid like an O(N**3) algorithm in the number of filenames (and the kernel had 17,291 files at that time: v2.6.12-rc2), and it was just totally unusable for me.

I also thought (and still think) that the whole signing thing was a waste of time and misdesigned, and I obviously am not a huge fan of databases. So in many ways I disliked the monotone implementation decisions (and some of its design decisions). But at the same time, I immediately liked the SHA1 object naming concept of Monotone.

It also already matched how I had conceptually planned on doing on the history anyway, and had some ideas for, but it took that whole "history hashing" all the way.

And thus git was born.

So git really has three parents. In a very real sense, BK (or, perhaps more appropriately - the way I personally used BK, which is not necessarily how others have used it) was the biggest thing from the standpoint of what I wanted my workflow to be like. It was simply how I had done things for the last few years, so a lot of my mental model for how things are supposed to work came from BK.

I still don't think people give Larry enough credit for actually pushing this whole distributed SCM thing as a usable model. Very few of the open-source distributed SCM's are actually usable even today, and as far as I've been able to gather, the commercial ones aren't really any closer either. Larry didn't have the kind of examples of what can work that I had.

The other parent was the stupid "series of patches" model, which was what really resulted in the "index" thing. I realize that people don't always much like the index, but it's really a pretty central part of git history, and one of the distinguising marks of git. It may be trivial, and to some degree it's been overshadowed by all the tree operations we do (the combination of revision walking and tree diffing), but it was very central to how git came to be.

The index also ended up being central to how we did merges - even if some day we may end up doing more of that on a pure tree level (ie the current git-merge-tree model), I think the way we ended up doing merges owes a lot to the index as a staging area.

(Historically, the "index" was called the "cache". Exactly because it came from the notion of "caching" the top commit state in a patch series, and then working with patches either backwards or forwards from that top cached state. Similarly, we didn't have a ".git" directory: it was called ".dircache", exactly because it was all about caching the state of the previous commit directory layout).

And finally, Monotone for the "everything is an object named by its SHA1" model, which to some degree is perhaps the central - or at least the most obvious - part of git. It largely was designed really just to be the "backing store" for the "cache", and to not be that important. That also explains why I didn't worry too much about disk usage etc initially: the object store wasn't even the most important part, and I envisioned just moving old objects that weren't needed into some "backup storage" kind of thing.


Early history

Git development began after many kernel developers were forced to give up access to the proprietary BitKeeper system (see BitKeeper for Linux and other open source projects "Zero-cost BitKeeper for Linux and other open source projects"). Permission to use BitKeeper as freeware had been withdrawn by the copyright holder Larry McVoy after he accused Andrew Tridgell of reverse engineering the BitKeeper protocols. The development of Git began on April 6, 2005, and proceeded very rapidly. The first merge of multiple branches was done on April 18, 2005, and two months later (June 16, 2005), the kernel 2.6.12 release was managed by Git.

Linus wanted a distributed system that he could use like BitKeeper, but none of the available free systems met his needs, particularly his performance needs. From an e-mail he wrote on April 7, 2005 while writing the first prototype:

However, the SCMs I've looked at make this hard. One of the things (the main thing, in fact) I've been working at is to make that process really efficient. If it takes half a minute to apply a patch and remember the changeset boundary etc. (and quite frankly, that's fast for most SCMs around for a project the size of Linux), then a series of 250 emails (which is not unheard of at all when I sync with Andrew, for example) takes two hours. If one of the patches in the middle doesn't apply, things are bad bad bad.

Now, BK wasn't a speed deamon either (actually, compared to everything else, BK is a speed deamon, often by one or two orders of magnitude), and took about 10-15 seconds per email when I merged with Andrew. HOWEVER, with BK that wasn't as big of an issue, since the BK<->BK merges were so easy, so I never had the slow email merges with any of the other main developers. So a patch-application-based SCM "merger" actually would need to be faster than BK is. Which is really really really hard.

So I'm writing some scripts to try to track things a whole lot faster. Initial indications are that I should be able to do it almost as quickly as I can just apply the patch, but quite frankly, I'm at most half done, and if I hit a snag maybe that's not true at all. Anyway, the reason I can do it quickly is that my scripts will not be an SCM, they'll be a very specific "log Linus' state" kind of thing. That will make the linear patch merge a lot more time-efficient, and thus possible.

(If a patch apply takes three seconds, even a big series of patches is not a problem: if I get notified within a minute or two that it failed half-way, that's fine, I can then just fix it up manually. That's why latency is critical - if I'd have to do things effectively "offline", I'd by definition not be able to fix it up when problems happen).

Linus achieved his performance goals; on April 29, 2005, the nascent Git was benchmarked recording patches to the Linux kernel tree at the rate of 6.7 per second.

He developed the system until it was usable by technical users, then turned over maintenance on July 26, 2005 to Junio C. Hamano, a major contributor to the project. Junio was responsible for the 1.0 release on December 21, 2005. As of May 2006, the current release is 1.3.3.

Git Chronicle

On GitTogether 2008 Junio C Hamano gave a "Git Chronicle" talk about more modern history, and how features were added to Git. There are slides in PDF for this talk.

Personal tools