Git Blame: November 2011

Wednesday, November 30, 2011

Buying a new Git feature

You are a manager of a technology company, and your engineers love Git in general, but Git is not a perfect fit to your organization. Perhaps some work-flow elements your people are used to are not supported nicely by today's Git. Perhaps some class of assets you want to keep track of are not supported well by today's Git.

You are wealthy enough to pay for a developer or two to identify, design and implement necessary changes to Git, but you are not wealthy enough fork Git to maintain such a change yourself forever while the upstream Open Source community continues to improve Git.

What can you do?

Of course, if the changes you initially develop are good enough, they may be merged to the upstream and then you do not have to worry about maintaining your fork yourself. But how would you ensure that the quality of your changes is good enough for upstreaming? Perhaps withhold the payment to your consultants until the changes hit the upstream?

I do not think this is necessarily limited to Git, but applies equally to any useful and active Open Source project.

Monday, November 28, 2011

Git 1.7.8-rc4 and upcoming cycle

This cycle is taking a bit longer than I had hoped but this should be the last rc before the final.

We had to roll in a fix to the UI for a new feature added in this cycle to "revert/cherry-pick" to avoid costly migration in the later releases (originally, we introduced "--reset" action to discard the in-progress state of a multi-commit revert/cherry-pick sequence, but it was argued that what the action actually did was "--quit" in the sense that it does not reset the state to some known state. Renaming it to "--quit" further opened the door to introduce "--abort" which does revert the state to where the entire revert/cherry-pick sequence started).

The next cycle will have many interesting topics that are already cooking in various doneness, including:

local branch description (in addition to a good discipline of giving descriptive branch names) that can be used in various places including pull request messages, local merge messages and format-patch cover letters;
electronically signed pull requests by asking to pull a signed tag instead of a branch;
signed commit (possibly-if it is found useful, that is);
credential helper API to integrate with platform native keychain implementations;
progress eye-candy for fsck and repack;
"git add" of large contents will send blobs directly to a packfile;
side-by-side diff in gitweb; and
i18n of Git Porcelain messages.

Thursday, November 24, 2011

PGP Key-signing and CA fire-and-forget

Someone at work used to have a kernel.org account but recently needed to re-establish the presense in the web of trust by getting his PGP key signed, so we met and exchanged our key IDs and fingerprints, to mutually sign our keys. I earlier attended a key-signing party and my key has been signed by many other people, and it was a good place to bootstrap his key.

A PGP key has two parts; the public part that you give to others, and the private part that you keep to yourself. The easiest and most common way to distribute public part of the key is to upload it to public keyservers, where other people can find and retrieve your key by specifying the key ID, your name or e-mail address.

When other people want to send a message to you and preserve the secrecy of the message, they only need to use the public part of your key to encrypt the message for you, and PGP guarantees that the encrypted message can be decrypted and read only by whoever holds the corresponding private part of the key unless a complex math problem that is believed to be practically unsolvable can somehow be solved (in other words, "public crypto-system gets broken"). When you want to prove that a message was written by you, you use the private part of your key to electronically sign the message and make the result public, others can check the authenticity of the electronic signature by using only the public part of your key, and again PGP guarantees that the message couldn't have been signed by a person who does not have the private part of the key.

The public part of your PGP key records your name and e-mail address, among other things. It can and often does record more than one pair of name and e-mail (e.g. work address vs personal address). Anybody can generate a PGP key on his or her own and record any name and e-mail address in its public part. If you see a message signed with a PGP key whose public part records my name and address, unless you somehow know that it indeed is the key I created for me and whose private part I have, such a signature has no value. It may have been created by a random person inpersonating me.
If you encrypt a message you want to show only to me, using a random PGP key that records my name and e-mail address to encrypt it would not guarantee that only I would be able to read it, unless you somehow know that the key belongs to me.

Hence, people need a way to validate the authenticity of public keys. People can add electronic signatures to the public part of a PGP key that belongs to another person, vouching that the signer knows the key belongs to the signee. This signature can be made per name and e-mail pair recorded in the public part of the key.

If you see a signature on an unknown public key, signed by public keys that you know belong to people you trust, you can be as sure as you trust these signers that the unknown publiic key belongs to the person it claims to belong to. This "web of trust" extends recursively and I heard that a recent study indicates that all people in the world are connected by 4.74 hops on average.

The only facts I learned when I met the other person for the purpose of key-signing are:

The person looked like his photo in our employee directory, and possessed a photo ID that matches his name;
The achievements by the person described in our employee directory matched what the person I was supposed to be meeting who worked in the Linux kernel project had done; and
The person claimed that a public key belonged to him, and gave me a way to retrieve the public part of this key.

It is not directly the above that I am vouching for by signing the public part of his key, however. I am vouching for the fact that I somehow know that the public key belongs to the person who is in control of the name and e-mail address pair recorded therein. That is not something I checked by meeting the person and chatting with him. I only checked the "name" part, but not the "e-mail address" part.

CA fire-and-forget is a clever scheme to solve this last bit of the problem. Instead of signing the public part of the key for all the name and e-mail pairs and upload the result myself, I make N separate signatures on his public key, one for each pair of name and e-mail address recorded in it. And then I encrypt these N signatures with his public key and send them to the corresponding e-mail addresses. The recipient of these encrypted signatures then decrypt them and upload the result to the public keyservers to complete the cycle.

If the e-mail address belonged to somebody else who does not have the corresponding private part, the encrypted signature would not reach the intended recipient, and the signature would not be decrypted to be uploaded to the public keyservers. I'll see my signature only if the person sitting behind the e-mail address has the private key that corresponds to the public part I have signed.

It is a clever scheme, even though it is a bit cumbersome to use, even with the use of dedicated tools (caff found in signing-party package on some distributions).

Friday, November 18, 2011

Git 1.7.7.4

Just another maintenance update, this time to fix minor build issues and fix a trivial corner case bug in the git name-rev --all command.

The upcoming feature release 1.7.8 is getting closer, too.

Thursday, November 17, 2011

Git 1.7.8-rc3 and being lenient to others while being strict to self

Hopefully the last release candidate before the real thing.

A big "Thanks" goes to Andrea Arcangeli for reporting an unpleasant regression, me for quickly fixing, and Michael Haggerty for reviewing the proposed fix.

The regression that will not be in the final release was that we broke

  $ git clone --reference=$local_repository $upstream

when the local repository we are borrowing objects from has signed or annotated tags, and the cause of this regression is that a recent topic screwed up implementation that tightens checks for branch and tag (collectively known as "refs") names. When we clone from $upstream while borrowing objects from a $local_repository, we tell the $upstream that objects that are in the $local_repository need not be sent to us, and we discover what objects $local_repository has by reading the output of

$ git ls-remote $local_repository

and adding the result to the set of "extra refs". We internally keeps track of all the "refs" that exist in our repository, and the code that registers the extra refs share the same codepath as the one that finds the branches and tags by reading from .git/refs/{heads,tags} directories. The problem was that the add_ref() function in this shared codepath had a check to error not (not just warn) when it tries to register any "ref" whose name does not conform to the rule. Because an entry for a signed or an annotated tag in the output from ls-remote denotes the object (typically a commit) the tag points at, and because such an entry is marked by adding ^{} at the end of the name of the tag to make sure it will not collide with names of the real refs (that character sequence is invalid), the new check triggered and made the whole clone command fail.

This episode shows two fundamental failures in the topic:

"extra refs" are not real refs, and they shouldn't even need names. The only reason they exist is to let our repository know the objects reachable from them do not need to be transfered into our repository when talking with the outside world. Perhaps we should even consider dropping the name parameter from add_extra_ref() function (but after making sure the code would not make unwarranted assumptions. One such assumption was that they have names and their name must conform to the usual refname rules, which was fixed, but there may be others).
The other use of add_ref() function is used to register existing refs that we find in our repository. While we might not like the name of some of them (nobody stops a user or a tool from creating a randomly named file under .git/refs/{heads,tags} directories after all), it is wrong to error out any operation when talking about what already exists in the repository; the damage is already done. Warning against them to help the user notice and correct is a different story.

The code should be lenient to what it receives and strict in what it produces.

For example, a colon is a forbidden character in a branch name, primarily because a branch with such a name, e.g. a:b, cannot be pushed out to another repository. But if you do not ever push such a branch out, it is not that unreasonable to expect that the following to work, at least for some definition of working:

  $ H=$(git rev-parse HEAD~20) && echo $H >.git/refs/heads/inval:id

  $ git show inval:id

It may be OK for the second line to error out (we cannot do much about the manual echo doing damage to the repository), but where there is no ambiguity (i.e. if there is a ref that is called inval, the above could be a request to show a subdirectory called id in that commit), warning that inval:id is a wrong name but still letting the user what s/he wanted to do would be a far nicer way to deal with a problem like this. After the above sequence, if the following fails only because the repository has a ref with an invalid name, it is even worse:

  $ git show master

and I would have to say it is close to inexcusable.

Sunday, November 13, 2011

Git 1.7.8-rc2 and the road forward

The second release candidate, that is not much different from the first one, is out.

The reason why this is not tagged as the 1.7.8 final is because we want to make sure there is no regression since 1.7.7, and the reason why this is not so different from the first one is because no such regression has been found that needs fixing, which is a good thing.

I have been working on a handful of topics for the development cycle after 1.7.8, and these topics all share the same theme: giving better ways to users so that they can assure themselves that their patches that flow over the public channel are not tampered with, and also helping them communicate more clearly among themselves in general.

A new change originates from a contributor, who has a theme in mind to achieve a specific goal. There is a new feature in the branch command that allows a descriptive text to be added to a topic branch and this facility can be used to record and update that "theme/goal" when starting to work on the topic and while polishing it.
The contributor, after perfecting the topic, would request the resulting history to be pulled by the integrator. This pull-request traditionally gave only the list of commits on the topic and did not encourage the contributor to clearly describe what the topic was about. The branch description will be copied to the resulting message in the updated version of request-pull command.
The integrator will receive the pull request in e-mail, but typically PGP signed e-mails are hard to use. The updated version of request-pull command does not use PGP signing on pull-request e-mails, either, but the contributor can ask the integrator to pull a signed tag, instead of the tip of a branch, using the updated pull command.
When the integrator records the result of a pull request, traditionally the command did not open editor to encourage the integrator to describe the merge. The updated version of merge does this when responding to a request to merge a signed tag, and shows the result of PGP verification of the tag in the comment to help the integrator.
In addition, the contributor can optionally add PGP signature to individual commit with the updated commit command.

So far, things are looking reasonably good for these topics.

Wednesday, November 9, 2011

Sleepless

Somehow I couldn't sleep (no, I am not insomniac) and ended up rising way too early at 4:30 which is too late to go back to sleep.

Which turned out to be a rather productive quality 2-hour Git-morning. A few patches sent, and a few reviews made.

I may not be insomniac, but I sometimes wonder if I am a bit workaholic.

Tuesday, November 8, 2011

Helping the kernel workflow redux

[edit: this is now used in the wild]

The goal is still to give the kernel developers and its users a better way to validate the authenticity of changes that eventually land on Linus's tree.

The "signed commit" mechanism discussed in a previous post may be useful in some workflows, but not necessarily so in an environment where you would push a commit out, and then decide that the commit is worth including in the upstream history after a long while. If you forgot to sign the commit when pushed it out but otherwise the commit is in good shape, it feels a bit dirty that you would have to either amend it or cap it with a signed empty commit.

The latest round after a lengthy discussion across three mailing lists is to allow the integrator to run "git pull" against a signed tag, e.g.

$ git pull git://.../rusty.git/ rusty-for-linus

When 'rusty-for-linus' is a tag, the above syntax does not work with the current git (and it won't change in the upcoming 1.7.8 as we are deep in the pre-release feature freeze period), but you can instead say 'tags/rusty-for-linus' to do the same thing.

When recording the merge result of such a pull that names a tag, Git will open an editor and ask the integrator to give a merge commit message. So far, 'git merge' never asked for commit log message to be edited, and histories of many projects, especially when 'merge.log' configuration variable is not enabled, are littered with one-liner messages, such as "Merge from origin" that does not tell anything useful - why was this merge made, what changes were brought in, etc. That is going to change as well, as a side effect of this topic.

The integrator will see the following in the editor when recording such a merge:

The one-liner merge title (e.g 'Merge tag rusty-for-linus of git://.../rusty.git/');
The message in the tag object (either annotated or signed). This is where the contributor tells the integrator what the purpose of the work contained in the history is, and helps the integrator describe the merge better;
The output of GPG verification of the signed tag object being merged. This is primarily to help the integrator validate the tag before he or she concludes the pull by making a commit, and is prefixed by '#', so that it will be stripped away when the message is actually recorded; and
The usual "merge summary log", if 'merge.log' is enabled.

The contents of the signed tag is also recorded in the header field of the resulting commit object, so that anybody can later retrieve it from the history and validate the signature. The signed tag that was pulled is not stored in the integrator's repository, nor pushed out to the integrator's publishing point.

The primary reason the new mechanism records this information inside the commit instead of leaving the tag around is for convenience. Recent kernel history contains about 400 merges by Linus within 3 months (4 to 5 pulls per day), and that counts only the pulls by Linus. To make the whole merge fabric more trustworthy, the integration made by his lieutenants by pulling from their sub-lieutenants need to be made verifyable the same way, which would (1) make the number of signed tags even larger and (2) make it more likely somebody in the foodchain gets lazy and refuses to push out the signed tags after he or she used them for their own verification.

Git 1.7.7.3

Yet another minor update.

Arguably, the most important fix since 1.7.7.2 is that this one actually identifies itself as 1.7.7.3 (1.7.7.2 release still called itself 1.7.7.1 by mistake).

Monday, November 7, 2011

Git 1.7.8-rc1

The first release candidate for the upcoming release is out. Because there won't be any more new feature merged until the 1.7.8 final, it is a good time for the coolest kids on the block to start using the upcoming release before others do.

The release tarballs are found at:

http://code.google.com/p/git-core/downloads/list

and their SHA-1 checksums are:

f35e5c4410b21710434cb591f4c89843e75bb793 git-1.7.8.rc1.tar.gz 72e27cd397f5ae7b3c9d8bb030a76d7c99cdbb50 git-htmldocs-1.7.8.rc1.tar.gz 95429858e879df3f9425cf1279e03cdec7832379 git-manpages-1.7.8.rc1.tar.gz

Also the following public repositories all have a copy of the v1.7.8.rc1 tag and the master branch that the tag points at:

url = git://repo.or.cz/alt-git.git
url = https://code.google.com/p/git-core/
url = git://git.sourceforge.jp/gitroot/git-core/git.git
url = git://git-core.git.sourceforge.net/gitroot/git-core/git-core
url = https://github.com/gitster/git

Tuesday, November 1, 2011

Git 1.7.7.2

This is just the result of applying fixes that are already applied to the 'master' branch for upcoming 1.7.8 release. Nothing earth-shattering, which is the whole point of the maintenance series ;-).

Helping the kernel workflow

[edit: there is an update here]

As many people may have already heard, the kernel developers would want to have a better way to validate the authenticity of changes that eventually go into Linus's tree. An e-mailed pull request asking Linus to pull from a public repository has three weak points:

The sender of e-mails can easily be spoofed;
Traditionally, a pull-request generated by tools states what commit of Linus the new work is based on and which branch of what repository needs to be pulled to receive it, but it does not even say what commit Linus should expect to see at the tip of the history; and
A pull-request could specify a random Git hosting site that gives out repository to anybody. Unless the security of the site is trustworthy and Linus knows the developer who asks him to pull from uses that repository, a pull from such a location is suspect.

A typical reaction to the first point is "Use signed e-mail", and while it is a technically valid statement, in practice GPG e-mails are pain to use for some people (including Linus).

The second point is rectified in the development version of Git, namely by commit cf73166 (request-pull: state what commit to expect, 2011-09-16), which is still cooking in the next branch. I expect this feature will be in the release after the upcoming 1.7.8.

The third point is currently addressed by Linus demanding his lieutenants to send pull-requests for repositories on trusted hosting site, including (updated) kernel.org.

I have been working on this issue for the past months, toying with a few alternative designs. My current thinking is to teach "git commit" an option to embed GPG signature in the commit object (already implemented and cooking in the next branch, expected to be in the release after the upcoming 1.7.8), add "the tip commit to expect has this object name" in the pull-request e-mail (mentioned earlier), and teach "git fetch" to verify the GPG signature of the tip commit. A typical lieutenant-to-Linus communication would probably look like this:

(Lieutenant)

Do his/her work normally.
When finishing up the work in his/her tree before the final testing s/he usually does before sending out a pull-request, "git commit [--amend] --gpg-sign" the tip of the history.
Push out the history to be pulled.
Run "git request-pull" to generate the pull-request message, that states what the tip commit should be, and send it to Linus.

(Linus)

Read the pull-request.
Run "git pull" from the requested repository, which fetches the history, verifies that the tip commit matches what was in pull-request, and verifies that the commit is signed by the developer.

(Others)

Fetch from Linus. If they are inclined to independently validate what Linus pulled in, they can run "git log --show-signature" to view the tips of histories Linus merged are indeed signed.

This does not require signed pull-requests (a spoofed pull-request may cause Linus to fetch and merge, but the commit to be merged wouldn't be signed correctly so no real harm other than a bit of wasted time is done), and also the repository does not have to be hosted on a trusted site.