Thursday, October 15, 2015

Fun with recreating an evil merge

Sometimes we wish there were good ways to recreate a complex merge, replaying a previously resolved conflict resolution, and reapplying a previously done evil merge, of a side branch to an updated mainline.

For example, you have a side-branch that consists of two commits, A and B, and you create a merge with the mainline that ends with X, like so:

           A---B
          /     \
  ---o---O---X---M

resulting in a new merge commit M. When you created this merge, it could be that changes A and B overlapped (either textually or semantically) with what was done on the mainline since the side branch forked, i.e. what was done by X. Such an overlap, if it is textual, would result in a merge conflict. Perhaps X added a new line at the same place A and/or B added a different line, resulting in something like:

...
original line 1
<<<<<<< HEAD
line added by X
||||||| O (common ancestor)
=======
line added by A
>>>>>>> B
original line 2
...

which you may resolve, when recording M, to:

...
original line 1
line added by A
line added by X
original line 2
...

Expressed in "git show --cc" format, such a merge result would appear this way:

  ...
  original line 1
 +line added by A
+ line added by X
  original line 2
  ...

A line with two leading spaces are common lines that both branches agree with, a line with plus at the first column is from the mainline and a line with plus at the second column is from the side branch.

If the overlap were not just textual but semantic, you may have to further update parts of files that did not textually conflict. For example, X may have renamed an existing function F to newF, while A or B added new callsites of F. Such a change is not likely to overlap textually, but in the merge result M, you would need to change the new calls you added to F to instead call newF. Such a change may look like this:

  ...
  original line 1
 +line added by A
+ line added by X
  original line 2
  ...
 -a new call to F() added by A
++a new call to newF() added by A
  ...

A line with minus at the second column is what was only in the side branch but that does not appear in the result (i.e. the side branch added the line, but the result does not have it). A line with two pluses at the beginning is what appears in the result but does not exist in either branch.

A merge that introduces such a line that did not exist in either branch is called an evil merge. It is something that no automated textual merge algorithm would have produced.

Now, while you were working on producing the merge M, the mainline may have progressed and gained a new commit Y. You would like to somehow take advantage of what you have already done when you created M to merge your side branch to the updated mainline to produce N:

           .---A---B
          /         \
  ---o---O---X---Y---N

A good news is that, when the evil merge is in a file that also has textual conflicts to resolve, "git rerere" will automatically take care of this situation. All you need to do is to set the configuration rerere.enabled to true before attempting the merge between X and B and recording their merge M, and then attempt a new merge between B and Y. Without even having to type "git rerere", the mechanism is invoked by "git merge" to replay the recorded resolution (which is where the name of the machinery "rerere" comes from). A bad news is that when an evil merge has to be made to a file that is not involved in any textual conflict (i.e. imagine the case where we didn't have "line added by A" vs "line added by X" conflict earlier in the same file in the above example), "rerere" does not even kick in. The question is what to do, knowing B, X, and M, to recreate N while keeping the adjustment needed for semantic conflicts to record M.

One naive approach would be to take a difference between X and M and apply it to Y. In the previous example, X would have looked like:

...
original line 1
line added by X
original line 2
...

and the difference between X and M would be (1) addition of "line added by A", (2) addition of "a new call to newF() added by A", and (3) any other change made by A and B that did not overlap with what X did. Implementation-wise, it is unlikely that we would do this as a "diff | patch" pipeline; most likely we would do it as a three-way merge, i.e.

$ git checkout Y^0
$ git merge-recursive X HEAD M

to compute the state we would have obtain by making the same move as going from X to M starting at Y, using the index and the working tree.

While that approach would work in simple case where Y does not do anything interesting, it would not work well in general. The most obvious case is when Y is actually a merge between X and A:

           .---A---B
          /     \   \
  ---o---O---X---Y---N

The difference between X and M would contain all that was done by A and B, in addition to what was done at M to adjust for textual and semantic conflicts. Replaying that on top of Y, which already contains what was done by A but not B, would end up duplicating what A did. At best, we will get a huge and uninteresting merge conflict. At worst, we will get the same code silently duplicated twice.

I think the right approach to recreate the (potentially evil) merge M is to consider M as two steps.
The first step is to merge X and B mechanically, and make a tree out of the mechanical merge result, with conflict markers and all. Call it T. The difference between T and M is what the person who made M did to adjust for textual and semantic conflicts.

           A---B
          /     \
  ---o---O---X---T-M

Then, you can think of the process of recreating N in a way similar to M was made as a similar two step process. The first step is to merge Y and B mechanically, and create a tree out of the mechanical merge result, and call it S. Applying the difference between T and M on top of S would give you the textual and semantic adjustments the same way "git rerere" replays the recorded resolution.

           .---A---B
          /    (\)  \
  ---o---O---X---Y---S-N

This should work better whether Y is a merge with A.

$ git checkout X^0
$ git merge --no-commit B
$ git add -u
$ T=$(git write-tree)
$ git reset --hard Y^0
$ git merge --no-commit B
$ git add -u
$ S=$(git commit-tree $(git write-tree) -p HEAD -m S)
$ git checkout $S
$ git merge-recursive $T HEAD M

would compute the result using the index and the working tree, so after eyeballing the result and making sure it makes sense, the above can be concluded with a

$ git commit --amend

Of course, this article is only about outlining the idea. If this proves to be a viable approach, it would make sense to do these procedures inside "rebase --first-parent" or something.








2 comments:

E. R. D'Avila said...

Some time ago I also wanted take advantage of a previous merge to produce a new one. For that, I used a different approach.
For the case presented on this post, my approach would merge Y and M and record the result as having Y and B as parents:

$ git checkout Y^0
$ git merge --no-commit M
$ N=$(git commit-tree $(git write-tree) -p Y -p B -m N)
$ git checkout $N

I've made some tests, and they produced the same tree SHA1 as the approach presented on this post.
My approach seems simpler, but I'm not sure if I'm missing something.
What do you think?

Quentin said...

Hi Junio

Your post recalled in itch we've had at my work where I'm the upstream integration guy. We needed to solve large conflicting merges which spanned domains and departments (and even repos), meaning many people looking at intermediate results. But we couldn't share intermediate conflict resolution state using familiar git tools.

We experimented with committing with conflict markers, and then solving the conflicts using gerrit. What I disliked about that was the abandoning of the git indexes (showing the conflict states), and the appearance of conflict markers in the history. We evolved to merging and solving conflicts on temporary branches, with a final step of replaying the initial merge using the temp branch contents.

The solution collapsed of its own weight (and IMHO due to a proper revisiting of the questionable practice of using large intermediate merges). Thanks to some evangelizing and some infrastructure development we now live in a continuous merge world closer to upstream.

My point is, I think your description of a two step consideration of the evil merge M (as T + M) could be formalized into a first class git process that would significantly enhance conflict resolution.

I don't have enough git internals knowledge to propose anything other than a rough outline with the hope it suggests the value I perceive:

git merge
git add
git conflict-commit ...
# a different kind of commit is made here,
# it can be shared and pushed to gerrit but
# when checked out recreates the state of the
# merge resolution from the conflict-commit
git conflict-checkout
git add
git commit

Perhaps this intersects with M. Haggerty's git-imerge, but this would add a formal step managing the overall final merge commit management.

Sorry if this isn't the right forum for such a long post...