Version Control and Leadership (by alaric)
For many years now, most of my home directory has been under version control of one form or another. I have a laptop, a desktop machine, and a server I ssh to; keeping stuff in synch between three working environments is very valuable, as is having efficient offsite backups and history.
I started my version control career, like most folks, with CVS - since for a long time CVS was the only open-source version control system in widespread usage.
Then along came Subversion, which was clearly Much Nicer, and I quickly switched my personal version control system over to using it. As a freelance software engineer I use it commercially, and now run a virtualised trac/svn hosting system that lets me easily add new projects, which many projects I'm involved with are hosted on. And my open source projects run on a similar platform.
However, more recently, there's been an explosion of interest in the distributed version control model, with lots of products appearing, such as Darcs, Mercurial, Monotone and Git.
I've been quite interested in the distributed model; sure, Subversion is working well for me, but the distributed model interests me because it's more general. You can set up a central repository and push all your changes to it so it's the central synch point, like a Subversion repository, but you don't have to; you can synch changes between arbitrary copies of your stuff without having to go through a central point. And given two approaches, one of which has a superset of the functionality of the other, I'm naturally drawn towards the superset, even if I only need the features of the subset - because I can't predict what my future needs will be.
Also, these distributed version control systems seemed to have better branch merging than Subversion, which until recently required manual tracking of which changes had been merged into a branch from other branches. And being able to do 'local commits' to a local repository, while working offline on my laptop on a train, then commit them to the server as a batch would be great. Subversion really can't do very much without a network connection to its server at the moment.
Now, I was starting to gravitate towards Mercurial, since it's written in Python and seems quite widely available. But then I saw the following talk by Linus Torvalds on git (which he originally wrote):
Two things struck me.
- I do like the architecture of git. Subversion stores history as a set of deltas; each version the files have been through are encoded in terms of their differences from the next version, while git just stores multiple as-is snapshots of the state in a content-addressable file system not unlike Venti, which automatically replaces multiple copies of identical data with references to a single copy of it. So it can pull out any version of the files very quickly, and doesn't really have to worry too much about how versions are related; Subversion stores everything as explicit chains of diffs and has to walk those chains to get anywhere. Git makes a note of which revision led to which revision(s) - it can be more than one if there was a branch, and more than one revision can lead to the same revision if there was a merge - but that's just used for working out the common ancestor of two arbitrary revisions in order to merge them; git can efficiently and reliably merge arbitrary points in arbitrary branches by skipping along the links to find the nearest common ancestor, generating diffs from that to the source of the merge, then applying those diffs to the target of the merge. There's none of the complex stuff that Subversion has to do with tracking which changes have been applied and all that. NOTE: I'm talking about "Subversion vs. Git" here since those are the examples of each model I know much about - I'm really comparing the models, not the precise products, here.
- Linus Torvalds makes an act of calling people who disagree with him "stupid and ugly", and making somewhat grand claims such as stating that centralised version control just can't work, and generally acting as though he's smarter than everyone else. Now, he does that in a tongue in cheek way; I get the impression he's not really a git (even though he claims he named git after himself), although I couldn't be sure unless I met him. Indeed, I used to think he was a bit of a git from reading things he'd said, but seeing him in action on video for the first time made me realise that he seems to be joking after all. BUT, I think this may be part of why he has become famous and well-respected in some circles. There's a few quite cocky people in the software world who push their ideas with arrogance rather than humility, steamrolling their intellectual opponents with insults; Richard Stallman comes to mind as another. Now, people who do this but are notably and demonstrably wrong get 'outed' as a git and lose a lot of respect; but if you're generally right and do this, it seems to lead to you having vehement followers who believe what you say quite uncritically. Which is interesting.
But I still can't choose. I see a lot of git vs. svn vs. hg vs. monotone vs. darcs - most of them complaining about problems with the loser that have been fixed in more recent versions. They're all rapidly moving targets! It looks like the only way to actually choose one is to spend a few months working on a major project with recent versions of each... in parallel. NOT GOING TO HAPPEN!
I dunno. I'm kinda leaning towards moving to git, but I'm worried that this might just be Linus Torvalds' reality distortion field pulling me in. Next I'll be using Linux if I'm not careful...
By David McBride, Sun 3rd Aug 2008 @ 12:55 pm
Looking at your list of systems, I can't seen any significant omissions. If you do decide to switch to using an existing distributed revision-control system, you almost certainly want to be using one of the ones you mention.
Git and Mercurial in particular appear to be the most popular; a number of major open-source projects have switched from CVS/SVN to using one or the other for their revision control needs.
I've used Git experimentally; a couple of years ago there were significant usability issues making it hard for mere mortals to play with, but rapid development has seen the tooling mature and, though I don't used it heavily, I like it. There's a lot of good tooling for it -- particularly for importing old-world revision control archives -- and as you say its data model is rather nice.
Git appears to be particularly popular amongst systems developers -- kernel projects, x.org, samba, Debian etc. -- as well as amongst the Ruby crowd. The main reason people tend to quote for not adopting it is lack of Windows client support, but you probably don't care about that.
I haven't used Mercurial, and don't know anyone who does, so can't comment usefully on it. It appears to be very similar to Git, so the main reasons for using one over the other are probably going to be based on popularity and network-effects rather than actual technical differences.
By David McBride, Sun 3rd Aug 2008 @ 3:36 pm
Hmm, it appears darcs is no longer a viable concern; see: https://lopsa.org/node/1656
By Violet, Mon 4th Aug 2008 @ 8:43 am
I don't seem to mention bazaar in your list of likely candidates, which is the one distributed rcs that I've heard most about. Any particular reason? Or just dropped off your radar? Do check it out if it has passed you by. Cheers.
By alaric, Mon 4th Aug 2008 @ 9:00 am
"If I do decide to switch to using an existing..." Are you daring me to write my own? Are you? Are you? 🙂
By alaric, Mon 4th Aug 2008 @ 9:19 am
Yeah, I'd looked briefly at bazaar, but it seems to be falling by the wayside compared to the git/monotone/mercurial set (RIP darcs). I'll take a closer peek...
What I am finding tricky is finding out which ones use a content-addressable store under the hood as opposed to a revision tree. Is git actually the only one? I know that Subversion (and, therefore, its distributed-operation wrapper svk) and Mercurial do, but digging into the depths of systems to find out how they work at the bottom level is a slow process.
Content-addressable storage is really brilliant for append-only filesystems; it's efficient and simple, and everything in the store comes with a free Merkle tree hash.
By alaric, Mon 4th Aug 2008 @ 10:19 am
Another point I note is that working-copy filesystem integration seems to be a weak point of all version control systems. Particularly when it comes to file/directory renames, adds, and deletes:
http://article.gmane.org/gmane.comp.version-control.monotone.devel/3264
And, from a user-interface perspective, having to tell the VCS about renames by hand is awkward at times. It'd be interesting to make a VCS where your working copy wasn't "checked out" per se, but exported from a little daemon via NFS or the like so that the VCS had total knowledge of file operations. It could also provide a virtual directory structure for nice read-only access to past revisions - /vcs/history/200808041119/foo.txt versus /vcs/branches/master/foo.txt etc.
By alaric, Mon 4th Aug 2008 @ 10:21 am
Gosh, I'm commenting on my own post a lot.
I just noticed a parallel between the VCS issue of ignoring generated and temporary files (.cvsignore, svn:ignore, and so on) and my thoughts on separating different types of files for different backup/archive regimes here:
http://www.snell-pym.org.uk/archives/2008/07/11/backups-and-archives/
By David McBride, Mon 4th Aug 2008 @ 2:06 pm
Actually, I think the post you reference is talking about a different problem -- designing merging algorithms that Don't Do Bad Things. It's working on the basis that you already have different branches containing existing commits, and want to sensibly merge the changes.
What you're talking about is having (in SVN, for example) to explicitly tell the version-control system about file renames, deletions, and other changes that you've made in preparation for adding a brand-new commit to history.
And this is one of the places where content-addressable-based systems like git really shine -- you don't have to explicitly specify these actions at all! Because it's taking a snapshot of the entire working set, you it's not necessary to explicitly tell it that you renamed file A to file B just so that it can maintain a complete history for particular chunks of code; instead, git (for example) calculates this information on-the-fly at history-inspection time.
Not only does this save an administrative burden, it can track content at the content rather than at the file level -- and is thus able to faithfully show content history in circumstances that simpler tools like SVN simply can't represent properly, such as a single file being split up into many smaller ones.
By alaric, Mon 4th Aug 2008 @ 2:33 pm
Ah, but the problems often revolve around renames and/or deletes, which are also the weak points in working copy integration.
How's it detect a file being renamed and modified, though? Without being explicitly told it's happened, there's then only heuristic connections between that operation and a simultaneous deletion of a file and a creation of a new one.
I was wondering about ways of merging changes within a file (eg, reordering lines) and changes between files (eg, renames) in the same way; store the entire repository as something like a giant tar file using a diff algorithm that detects moved regions...
By Angie, Mon 4th Aug 2008 @ 10:48 pm
Yes Alaric, I dare you to write your own! You have the knowledge, and ability, I know time is sometimes a constraint but often the best work is done under pressure. OK there is a chance that it will all be for nought, but a chances sometimes has to be taken. I know I don't work in your field but have had to step off the edge several times in my life to achieve my goals. Some people seem to think that taking a degree at nearly 60 is jumping over the edge, and with breast cancer as well! My brain is still 30 and can take a challenge, so I know that your 30 year old brain can. Once more I dare you.
By Sarah, Tue 5th Aug 2008 @ 10:12 am
Al I think you need to write a follow up post to this one as many people dont explore the comments.