Sunday, February 8, 2015

Fun with "git diff -B -M"

Git lets you view a change that renames an original file A to a new location B while doing some minor edits to its contents as a "rename" patch (i.e. "rename A to B, with the following content differences"). You can even view a change that renames two original files, A and B, by swapping their contents and optionally doing some minor edits to them as a patchset that contains two "rename" patches ("rename A to B" and "rename B to A"). These were invented by me early in Git's life, when Linus was still running the project, back in mid 2005. More recently, other tools (including GNU patch) started understanding patches that use these features.

I however recently noticed a few corner cases that git diff and friends produce a wrong patchset, or git apply fails to apply correctly constructed patches, and I have been thinking about the right fixes to these issues. This article will illustrate these tricky cases and describe my current thinking.

In this write-up, I'll use these terms:

patchset::
Output from a single git diff invocation, which may contain one or more patch.

patch::
A part of a patchset from a header line that begins with "diff --git" up to (but excluding) the next such header line.

tree::
git diff compares two collections of files; each collection is a tree. Left and right sides of the comparison are called old tree and new tree, respectively. The tree to which we attempt to apply a patchset is the target tree. The tree we would get after a successful application of a patchset is the resulting tree. A tree does not have to be a tree object—we may be comparing the index and the files in the working tree, for example.

preimage::
postimage::
The preimage consists of lines in a patch that are prefixed by "-" (minus) or " " (space) but not "+" (plus) that denote what the patched file ought to have for the patch to apply. The postimage consists of lines that are prefixed by "+" (plus) or " " (space) that denote what the patched result ought to look like.


1. Basics

First the basics. Let's think about a patchset with a single patch. What does this patchset tell us?

    diff --git a/major-08.txt b/major-08.txt
    index 680c5f6..5de90cb 100644
    --- a/major-08.txt
    +++ b/major-08.txt
    @@ -1,3 +1,3 @@
    -8. Fortitude.
    +8. Strength.

     This is one of the cardinal virtues, of which I shall speak later.

It obviously tells us that the new tree changed "Fortitude", that used to be in the old tree, to "Strength", but it actually tells us a bit more about the old tree. For this patchset to apply, the target tree must have a file "major-08.txt" that begins with lines we see as the preimage in the patch.


2. Renaming a file

Now let's get a bit fancier and study a patchset with a rename patch. What does this patchset tell us?

    diff --git rws/major-08.txt marseille/major-11.txt
    similarity index 97%
    rename from major-08.txt
    rename to major-11.txt
    index 680c5f6..2ab22a0 100644
    --- rws/major-08.txt
    +++ marseille/major-11.txt
    @@ -1,3 +1,3 @@
    -8. Fortitude.
    +11. Fortitude.

     This is one of the cardinal virtues, of which I shall speak later.

We can see that this is going from the same old tree as the previous one's old tree, renames major-08 to major-11 with slight modification.

It tells us more about the trees, compared to the previous example. For this patchset to apply, the target tree must satisfy the same pre-conditon as the previous one about major-08, and in addition it must lack major-11; otherwise we wouldn't be renaming a new file to it.

So far, things are straight-forward.

In summary:

Rule 1.
A patch from file A to file A requires that file A exists in the target tree with contents that match the preimage of the patch.

Rule 2.
A patch renaming file A to file B requires that file A exists the target tree with contents that match the preimage of the patch. It also requires that file B must not exist in the target tree.

Rule 3.
A patch that creates file A requires that file A does not exist in the target tree.

Rule 4.
A patch that deletes file A requires that file A exists in the target tree with contents that match the preimage of the patch.

The latter two I didn't illustrate with examples, but they should be obvious. Also, we can think of Rule 2 (rename) as a natural extension of Rule 3 (creation) and part of Rule 4 (deletion). When you rename file A to file B, optionally with some content changes, you are:

  • creating file B, so the target tree must not have file B already.
  • deleting file A, so the target tree must have file A with the content that matches (part of) it.
Similarly, a patch that creates file B by copying file A, optionally with some content changes, you are creating file B, so the target tree must not have file B already. Also, the target tree must have file A with the contents that match the preimage of the patch.



3. First twist: cross renaming

Now, here is the first twist. What does this patchset mean?

    diff --git rws/major-11.txt marseille/major-08.txt
    similarity index 99%
    rename from major-11.txt
    rename to major-08.txt
    index 517d9f8..44e8d3a 100644
    --- rws/major-11.txt
    +++ marseille/major-08.txt
    @@ -1,3 +1,3 @@
    -11. Justice.
    +8. La Justice

     That the Tarot, though it is of all reasonable antiquity, is not of
    diff --git rws/major-08.txt marseille/major-11.txt
    similarity index 97%
    rename from major-08.txt
    rename to major-11.txt
    index 5de90cb..a101d5f 100644
    --- rws/major-08.txt
    +++ marseille/major-11.txt
    @@ -1,3 +1,3 @@
    -8. Strength.
    +11. La Force

     This is one of the cardinal virtues, of which I shall speak later.

This is a "swap" patchset, that swaps major-08 and major-11 with small edit. You would have done something like this to prepare such a change, starting from an old tree with two files with substantially different contents, both of which are of meaningful sizes:

    $ mv major-11.txt tmp
    $ mv major-08.txt major-11.txt
    $ mv tmp major-11.txt
    $ edit major-08.txt major-11.txt ;# just a bit
    $ git commit -m swap major-08.txt major-11.txt
    $ git diff -B -M HEAD^

A patch renaming major-11 to major-08 (i.e. the first one in this two-patch patchset) still requires that major-11 must exist in the target tree for the patchset to apply, which is the first half of Rule 2.

But the other half of Rule 2 is not satisfied. The target of the rename, major-08, has to exist in the target tree; otherwise we cannot rename it to major-11 in the second patch in the patchset. The rule needs a bit of revising, perhaps like this:

Rule 2.
A patch renaming file A to file B requires that file A exists with contents that match its preimage. And file B must not exist in the target tree, unless another patch in the patchset renames file B to some other file (possibly but not necessarily file A).

Of course, for such an "other patch" to be able to rename file B to somewhere else, the target tree is required to have file B.

It is important to have that "unless" part in the revised Rule 2. We need to make sure that we do not allow the sample patchset in "2. Renaming a file" to overwrite an existing file major-11 in the target tree blindly.

4. Second twist: rewriting and copying

The previous one showed how git diff -B -M can be used to detect cross renaming files and apply the resulting patchset (you can circularly rename more than two, i.e. A -> tmp, B -> A, ..., Z -> Y, tmp -> Z). It can also detect when you did this:

    $ cp major-08.txt major-11.txt
    $ edit major-08.txt ;# extensively
    $ git add major-08.txt major-11.txt
    $ git commit -m 'create 11 out of 08, rewrite 08'
    $ git diff -B -M HEAD^

And you would see:

    diff --git a/major-08.txt b/major-08.txt
    dissimilarity index 99%
    index 5de90cb..44e8d3a 100644
    --- a/major-08.txt
    +++ b/major-08.txt
    @@ -1,10 +1,31 @@
    -8. Strength.
    -
    -This is one of the cardinal virtues, of which I shall speak later.
    -...
    -the principle of all force.
    +8. La Justice
    +
    +That the Tarot, though it is of all reasonable antiquity, is not of
    +...
    +via prudentiæ.
    diff --git a/major-08.txt b/major-11.txt
    similarity index 97%
    copy from major-08.txt
    copy to major-11.txt
    index 5de90cb..a101d5f 100644
    --- a/major-08.txt
    +++ b/major-11.txt
    @@ -1,3 +1,3 @@
    -8. Strength.
    +11. La Force

     This is one of the cardinal virtues, of which I shall speak later.

The first patch in the patchset is "a patch from file A to file A", even though it is an extensive rewrite. The target tree is required to have major-08 whose contents match the preimage of the patch (Rule 1). The second patch copies from major-08 to create a new file major-11. The target tree is required to lack major-11 (Rule 3; copying into A is creation of A). It also must have major-08 that begins with the preimage of the patch.

Another thing to note is that an application of a patchset in Git is not incremental. Even though the first patch in the patchset talks about extensively rewriting major-08, and the second patch talks about creating major-11 by copying major-08 and then making a minor edit to it, the latter patch is the difference between the major-08 in the old tree and the major-11 in the new tree. It is not the difference between these two files in the new tree, i.e. it is not "modify major-08 and then copy the result to major-11 and then edit". If you think about it, this is also consistent with the previous "cross renaming" section. The first patch in the patchset renames major-11 to major-08, and the second patch that renames major-08 to major-11 is not about remaing the file that originally was major-11 that the first patch renamed back to its original position. The two patches are not applied incrementally (or sequentially).

So far, all the examples shown above will work correctly with today's Git (some reimplementations of Git may lack support, but at least the one I maintain does work correctly). When you use the old tree as the target tree, git apply accepts the patchset and recreates the new tree correctly.

But if you use the new tree of this example as the target tree and try to use git apply -R to apply the patchset in reverse, it does not work correctly. It is a bug.

Currently git apply -R does a nonsense for a copying patch. To reverse any patch, it just swaps the preimage and the postimage, and then swaps the names of the files in the old tree and in the new tree.

But the reverse of "create major-11 by copying major-08 into it and then change Strength to La Force" (which is the second patch in the patchset in this section) is not "create major-08 by copying major-11 into it and then change La Force to Strength", which you would get by simply swapping the preimage and the postimage and swapping the names of the files in the second patch.

What should we do to "reverse" a patchset that has copies?

Reverse of "create major-11 by copying major-08" should at least be "remove major-11", and preferably accompanied by "while making sure that major-11 matches the postimage of the patch".

The "preferably" part is a moderately strong preference. When the copying was done without any modification, we would not have any preimage or postimage to enable us to check that the target tree of the reverse application is similar enough to the new tree the patchset was taken from. Instead, we would end up just checking "major-11 exists" and then removing it happily, even if the contents of the file major-11 is vastly different from that of the new tree the patchset was taken from, which feels somewhat unsafe.

Admittedly, the same "it feels unsafe" factor exists when applying a bog-standard pure rename patch (imagine that the example in "2. Renaming a file" was done without editing the first line and kept the original "8. Fortutide." without renumbering it. We would not have any preimage we can use to make sure we are patching the correct file).

But as long as we have patch text that we can use for sanity checking, we should use it, I would think.


5. Third twist: rewriting by copying

If you started from two vastly different files, both of which have contents of meaningful size, and did this:

    $ cp major-08.txt major-11.txt
    $ edit major-11.txt
    $ rm major-08.txt
    $ git commit -m 'rewrite 11 by copying 08' major-08.txt major-11.txt
    $ git diff -B -M HEAD^

You would see this patchset:

    diff --git a/major-08.txt b/major-11.txt
    similarity index 97%
    rename from major-08.txt
    rename to major-11.txt
    index 5de90cb..a101d5f 100644
    --- a/major-08.txt
    +++ b/major-11.txt
    @@ -1,3 +1,3 @@
    -8. Strength.
    +11. La Force

     This is one of the cardinal virtues, of which I shall speak later.

This is another bug. I sent out a warning to both the Git and the Linux kernel mailing list, not to use the "-B -M" options together for this reason.

The revised Rule 2. from "3. First twist" tells us that major-08 must exist in the target tree, which is OK, but also major-11, the target of the rename, must not exist. This makes the resulting patchset unapplicable to the old tree the patchset was taken from, which simply does not make sense.

If you take a diff between states X and Y, you should be able to apply that diff to the state X and the resulting state should be identical to the state Y, and you should be able to apply that diff in reverse to state Y to go back to the state X.

Worse, the reverse of this patchset would apply to the new tree without an error, but does not reproduce the old tree correctly, which is a more serious bug. It instead applies the patch in reverse and recreates the original major-08, but the other file, major-11, is lost.

The patchset does not have enough information for us to recreate its original contents of major-11 we had in the old tree. The patchset says that the contents of major-11 in the new tree came from the contents of major-08 in the old tree, and the major-11 in the new tree does not have any resemblance to major-11 in the old tree. That is not incorrect per-se, but that means that we cannot apply this patchset in reverse.

One possible way to fix this is to include another patch in the same patchset that shows the deletion of major-11. Rule 2. would be further revised to something like:

Rule 2 (revised again).
A patch renaming file A to file B requires that file A exists in the target tree with contents that match the preimage. It also requires that file B does not exist in the target tree, unless another patch in the patchset renames file B to some other file (possibly but not necessarily file A) or removes file B.

Again, that "other patch" in the patchset either renames or removes file B, so that requires that the target tree to have file B with contents that match the preimage of that patch.

More generally, the revised Rule 2. can be split into two parts; the former becomes an extension to Rule 4, and the latter becomes an extension to Rule 3.




  • A patch that causes a file A to disappear (i.e. removing file A, or renaming file A to file B) requires that the target tree to have file A, with contents that match the preimage of the patch.
  • A patch that causes a file B to appear (i.e. creating file B, or renaming/copying file A to file B) requires the target tree to lack file B, unless another patch in the patchset makes file B disappear (i.e. removing file B or renaming file B to something else).
In any case, a fixed patchset would look like this:

    diff --git a/major-08.txt b/major-11.txt
    similarity index 97%
    rename from major-08.txt
    rename to major-11.txt
    index 5de90cb..a101d5f 100644
    --- a/major-08.txt
    +++ b/major-11.txt
    @@ -1,3 +1,3 @@
    -8. Strength.
    +11. La Force

     This is one of the cardinal virtues, of which I shall speak later.
    diff --git a/major-11.txt b/major-11.txt
    deleted file mode 100644
    index 517d9f8..0000000
    --- a/major-11.txt
    +++ /dev/null
    @@ -1,31 +0,0 @@
    -11. Justice.
    -
    -That the Tarot, though it is of all reasonable antiquity, is not of
    -...
    -via prudentiæ.

And these patches, under the re-revised rules, would apply cleanly to the old tree.

What about the reverse application? It would be a patchset that creates major-11 from nothingness (which is the reversal of a "deletion" patch), and creates major-08 by renaming major-11 and editing. Is the Rule 2. re-revised above sufficient?

The new tree (which is the target of the reverse application) only has major-11 and not major-08, so this rename should go through. The reverse of the deletion of major-11 is a creation of it with the contents fully given as the preimage of the (original) patch before reversing it, so that should also be OK with Rule 3 that is revised in a similar way with that "unless" thing. That is, creating major-11 requires that the old tree does not have major-11, but if another patch in the same patchset renames major-11 away or deletes it, then it is OK for a patch to create major-11. And the reversal of the first patch does rename major-11 to major-08, so all is well.

One disturbing thing about the above plan is that we have this comment at the end of diffcore-rename.c:

         * We would output this delete record if:
         *
         * (1) this is a broken delete and the counterpart
         *     broken create remains in the output; or
         * (2) this is not a broken delete, and rename_dst
         *     does not have a rename/copy to move p->one->path
         *     out of existence.
         *
         * Otherwise, the counterpart broken create
         * has been turned into a rename-edit; or
         * delete did not have a matching create to
         * begin with.

That is, we have an explicit logic to omit the missing "delete major-11" patch from the patchset. This comes from the very first commit that introduced "diff -B" (f345b0a0 (Add -B flag to diff-* brothers., 2005-05-30); it is plausible that the above comment came from lack of thinking in the original and not something we did to fix some bugs (if it were the latter, by showing the deletion in the case under discussion to "fix" the patchset in this example would end up breaking the original "fix").

So I would think that the right way to fix this is to stop filtering out the deletion half of the broken pair, even when the other creation-half of the pair no longer is in the output.