Thursday, November 17, 2011

Git 1.7.8-rc3 and being lenient to others while being strict to self

Hopefully the last release candidate before the real thing.

A big "Thanks" goes to Andrea Arcangeli for reporting an unpleasant regression, me for quickly fixing, and Michael Haggerty for reviewing the proposed fix.


The regression that will not be in the final release was that we broke

  $ git clone --reference=$local_repository $upstream


when the local repository we are borrowing objects from has signed or annotated tags, and the cause of this regression is that a recent topic screwed up implementation that tightens checks for branch and tag (collectively known as "refs") names. When we clone from $upstream while borrowing objects from a $local_repository, we tell the $upstream that objects that are in the $local_repository need not be sent to us, and we discover what objects $local_repository has by reading the output of


  $ git ls-remote $local_repository


and adding the result to the set of "extra refs". We internally keeps track of all the "refs" that exist in our repository, and the code that registers the extra refs share the same codepath as the one that finds the branches and tags by reading from .git/refs/{heads,tags} directories. The problem was that the add_ref() function in this shared codepath had a check to error not (not just warn) when it tries to register any "ref" whose name does not conform to the rule. Because an entry for a signed or an annotated tag in the output from ls-remote denotes the object (typically a commit) the tag points at, and because such an entry is marked by adding ^{} at the end of the name of the tag to make sure it will not collide with names of the real refs (that character sequence is invalid), the new check triggered and made the whole clone command fail.


This episode shows two fundamental failures in the topic:
  • "extra refs" are not real refs, and they shouldn't even need names. The only reason they exist is to let our repository know the objects reachable from them do not need to be transfered into our repository when talking with the outside world. Perhaps we should even consider dropping the name parameter from add_extra_ref() function (but after making sure the code would not make unwarranted assumptions. One such assumption was that they have names and their name must conform to the usual refname rules, which was fixed, but there may be others).
  • The other use of add_ref() function is used to register existing refs that we find in our repository. While we might not like the name of some of them (nobody stops a user or a tool from creating a randomly named file under .git/refs/{heads,tags} directories after all), it is wrong to error out any operation when talking about what already exists in the repository; the damage is already done. Warning against them to help the user notice and correct is a different story.
The code should be lenient to what it receives and strict in what it produces.


For example, a colon is a forbidden character in a branch name, primarily because a branch with such a name, e.g. a:b, cannot be pushed out to another repository. But if you do not ever push such a branch out, it is not that unreasonable to expect that the following to work, at least for some definition of working:


  $ H=$(git rev-parse HEAD~20) && echo $H >.git/refs/heads/inval:id
  $ git show inval:id


It may be OK for the second line to error out (we cannot do much about the manual echo doing damage to the repository), but where there is no ambiguity (i.e. if there is a ref that is called inval, the above could be a request to show a subdirectory called id in that commit), warning that inval:id is a wrong name but still letting the user what s/he wanted to do would be a far nicer way to deal with a problem like this. After the above sequence, if the following fails only because the repository has a ref with an invalid name, it is even worse:


  $ git show master

and I would have to say it is close to inexcusable.

No comments: