Friday, May 24, 2013

Git 1.8.3 and even more leftover bits

The 1.8.3 release has finally been tagged and pushed out to the usual places. Also the release tarballs at kernel.org are back.

For a list of highlights, please see the previous post on -rc2; not much has changed since then.

During the last development cycle including its pre-release feature freeze, a few more interesting topics were discussed, and at this moment there aren't actual patches or design work.

[Previous list of "leftover bits" is here]
  • "git config", when removing the last variable in a section, leaves an empty section header behind. Anybody who wants to improve this needs to consider ramifications of leaving or removing comments.
    Cf. $gmane/219524
  • [STARTED AND THEN STALLED] Add "git pull --merge" option to explicitly override configured pull.rebase=true. Make "git pull" that does not say how to integrate fail when the result does not fast-forward, and advise the user to say --merge/--rebase explicitly or configure pull.rebase=[true|false]. An unconfigured pull.rebase and pull.rebase that is explicitly set to false would mean different things (the former will trigger the "fast-forward or die" check, the latter does the "pull = fetch + merge".
    Cf. $gmane/225326
  • Teach more commands that operate on branch names about "-" shorthand for "the branch we were previously on", like we did for "git merge -" sometime after we introduced "git checkout -".
    Cf. $gmane/230828
  • Proofread our documentation set, and update to reduce newbie confusion around "remote", "remote-tracking branch", use of the verb "to track", and "upstream".
    Cf. $gmane/230786.

Monday, May 13, 2013

Git 1.8.3-rc2

The second and planned-to-be-the-final release candidate for the upcoming 1.8.3 release was tagged today. Also, the release tarballs at kernel.org are back ;-)

Hopefully we can have the final late next week, but we might end up doing another release candidate. Please help testing the rc2 early to make sure you can have a solid release.

There are numerous little fixes, new features and enhancements that cannot be covered in a single article, so I'll only highlight some selected big-picture changes here. For the full list of changes, please refer to the draft Release Notes.

Preparation for 2.0

A lot of work went into preparing the users for 2.0 release that will come sometime next year, which promises large backward-incompatible UI changes:
  • "git push" that does not say what to push from the command line will not use the "matching" semantics in Git 2.0 (it will use "simple", which pushes your current branch to the branch of the same name only when you have forked from it previously; e.g. "git push origin" done while you are on your "topic" branch that was previously created with "git checkout -t -b topic origin/topic" will push your "topic" branch to origin).

    This default change will hurt old-timers who are used to the traditional "matching" (if you have branches A, B and C, and of the other side has branches A and C, then your branches A and C will update their branches A and C, respectively), and they can use "push.default" configuration variable to keep the traditional behaviour. I.e.

    $ git config push.default matching

    Recent releases since 1.8.0 has been issuing warnings when you do not have any push.default configuration set, and this release continues to do so.

  • "git add -u" and "git add -A" that is run inside a subdirectory without any other argument to specify what to add will operate on the entire working tree in Git 2.0, so that the behaviour will be more consistent with "git commit -a" (e.g. "edit file && cd subdir && git commit -a" will commit the change to the file you just edited which is outside the directory you ran "git commit" in).

    Users can say "git add -u ." and "git add -A ." (the "dot" means "the current directory") to limit the operation to the subdirectory the command is run in with the traditional versions of Git (and this will stay the same in Git 2.0 or later), so there will be no configuration variable to change the default.

    The 1.8.3 and later releases do not yet change the behaviour until Git 2.0, and limit these operations to the current subdirectory, but they do notice when you have changes outside your current subdirectory and warn, saying that if you were to type the same command to Git 2.0, you would be adding those other files to your index, and encourages you to learn to add that explicit "dot" if you mean to add changed or all files in the current subdirectory only.

  • "git add path" has traditionally been a no-op for removed files (e.g. "rm -f dir/file && git add dir" does not record the removal of dir/file to the index), but Git 2.0 will notice and record removals, too.

    The 1.8.3 and later releases do not yet change the behaviour until Git 2.0, but they do notice when you have removed files that match the path and warn, saying that if you were to type the same command to Git 2.0, you would be recording their removal, and encourages you to learn to use the --ignore-removal option if you mean to only add modified or new files to the index.

Tightening of command line verification

There are quite a many UI fixes that tie loose ends. Some commands assumed that the users were perfect and would never throw nonsense command line arguments at them, and some operations that need two parameters were happily carried out even when they got three parameters without diagnosing such a command line as an error (the excess one was simply ignored).

Many of them have been updated to detect and die on such errors.

Helping our friends at Emacs land

We expedited the update of the foreign SCM interface to bzr we have in the contrib/ area since 1.8.2, and included a version that is vastly modified from what we had before, with help from some Emacs folks. This code could be a bit rougher than the rest of Git that usually moves slowly and cautiously, but we decided that, given the circumstance, it is more important to push out some improved version early, in order to help our friends in Emacs land, who have been (reportedly) suffering from less than ideal response to the issues they are having with their SCM of choice.

A beginning of a better triangular workflow support

The recommended workflows to collaborate with others are either:
  • to have your own repository and push your work there while pulling from your upstream to keep up to date, or
  • to have a central repository where everybody pushes to and pulls from.
The latter was primarily to make those who are coming from centralized version control systems feel at ease, and the repository configuration mechanisms such as "remote.origin.url" variable were designed to help that workflow (there is one "origin" you pull from and push to). The former however is also important, and many people on Git hosting sites (e.g. GitHub) employ this workflow (you pull from one place and push to another place, but they are not the same).

A new configuration mechanism "remote.pushdefault" has been introduced to support such a triangular workflow. After you clone from somebody else's project, that upstream repository will still be your 'origin', but you can add the repository you regularly push to in order to publish your work (and presumably then you will throw a "pull request" at the upstream) as another remote, and set it to this configuration variable. E.g.
$ git clone git://example.com/frotz.git frotz
$ cd frotz
$ git remote add publish ssh://myhost.com/myfrotz.git
$ git config remote.pushdefault publish
After this, you can say "git push" and the push does not attempt to push to your origin (i.e. git://example.com/frotz.git)  but to your publish remote (i.e. ssh://myhost.com/myfrotz.git) because of the last configuration.


Tuesday, April 2, 2013

Where do evil merges come from?


A canonical example of where "evil merge" happens in real life (and is a good thing) is to adjust for semantic conflicts. It almost never have anything to do with textual conflicts.

Imagine you and your friend start with the same codebase where a function f() takes no parameter, and has two existing call-sites.

You decide to update the function to take a parameter, and adjust both existing call-sites to pass one argument to the function. Your friend in the meantime added a new call-site that calls f() still without an argument. Then you merge.

It is very likely that you won't see any textual conflict. Your friend added some code to block of lines you did not touch while you two were forked. However, the end result is now wrong. Your updated f() expects one parameter, while the new call-site your friend added does not pass any argument to it.

In such a case, you would fix the new call-site your friend added to pass an appropriate argument and record that as part of your merge.

Consider the line that has that new call-site you just fixed. It certainly did not exist in your version (it came from your friend'd code), but it is not exactly what your friend wrote, either (it did not pass any argument). That makes the merge an "evil merge".

With "git log -c/--cc", such a line will show with double-plus in the multi-way patch output to show that "this did not exist in either parent".

Thursday, March 21, 2013

Measuring Project Activities (2)

Continuing from an earlier article, let's see how you can compute some interesting stats on your own projects.

How much change did a release have?

As I said earlier, you can measure the extent of change to your codebase in two ways. A quicker and less precise way, and a more involved but more accurate way.

A quicker way is to ask git diff --numstat to count the deleted and added lines between the release tags, and add them up yourself. If you care about whole-file renames, you can add the -M option to the git diff command:

addremove2 () {
  git diff --numstat "$@" | {
    total=0 &&
    while read add remove path
    do
      total=$(( $total + $add + $remove ))
    done &&
    echo "$total"
  }
}

And with that helper, the main function we introduced in the earlier article can do this to compute the modified2 number for the entire release cycle and per each day:

handle () {
  old="$1^0" new="$2^0"
  ...
  modified2=$(addremove2 "$old" "$new")
  mod2perday=$( echo 2k "$modified2" "$days" / p | dc )
}

How much real change did a release have?

Counting number of added and removed lines using git diff --numstat is straightforward, but this tends to over-count changes. For example, when adding a new caller to an existing code, you may have to move that existing code up in the same file (especially if it is a file-local static function) to make the callee come before the caller, or move it to a different, more "library-ish" file, while making its visibility from static to extern. Both of these kind of changes unfortunately appear as a bulk deletion of existing block of lines and bulk addition of the same contents elsewhere in the codebase.

In order to count the true amount of work went into the new release, you would want to exclude such changes from your statistics.

This is where git blame can help. In the most basic form, it can trace each and every line of a file in the given commit back to its origin, i.e. which commit it came from. By default, it notices when the whole file gets renamed (e.g. the file hello.c you are running the command on in the current release may have been called goodbye.c in an earlier release), and employs no other fancy tricks, but you can tell it to notice code movement within a file (e.g. moving the callee up in the file) with the -M option, or code moves across files (e.g. moving a static function from a file that an existing caller lives in to a different "library-ish" file, to make it also visible to a new caller in another file) with the -C option. You can also tell it to ignore whitespace changes with the -w option like you can with git diff. For example:

  git blame -M -C -w -s v1.8.0..v1.8.1 -- fetch-pack.c

will show you which commit each and every line in the fetch-pack.c file came from; its output may begin like this:

745f7a8c fetch-pack.c           1) #include "cache.h"
^8c7a786 builtin/fetch-pack.c   2) #include "refs.h"
^8c7a786 builtin/fetch-pack.c   3) #include "pkt-line.h"


The first line is blamed to commit 745f7a8c, while the other lines are attributed to commit 8c7a786 (the leading caret ^ means it is attributed to a commit at the lower boundary of the range), which is the v1.8.0 release. Note that these old lines used to live in a different file builtin/fetch-pack.c in the older release, and would have been counted as additions if you used the approach based on git diff --numstat -M to count them, because there was no file renaming involved between these two releases.

Also notice that these lines may have been untouched since a commit that may be a lot older than v1.8.0, but we told the command to stop at v1.8.0 from the command line, so these are all attributed to that range boundary.

If you count the number of lines in the whole output from the above command, that will show the number of lines in the fetch-pack.c file at the v1.8.1 release. If you count the lines that do not begin with a caret, that counts the lines added in the new release.

added_to_file () {
  old="$1" new="$2" path="$3"
  git blame -M -C -w -s "$old".."$new" -- "$path" |
  grep -v '^^' |
  wc -l
}

This may be sufficient as a starting point, but we are not all interested in checking each and every commit between the two releases (e.g. the commit 745f7a8c in the above example is not the v1.8.1 release and the only thing we care about is that the line is new in the new release; we do not care where in the development cycle leading to the release it was added), so it is a waste of computational cycles.

Fortunately, you can tell git blame to pretend as if the commit tagged as v1.8.1 release were a direct and sole child of the commit tagged as v1.8.0 release with the -S option. First, you prepare a graft file to describe the parent-child relationship.

added_to_file () {
  old="$1" new="$2" path="$3"
  graft=/tmp/blame.$$.graft
  cat >"$graft" <<-EOF
  $new $old
  $old
  EOF
  git blame -M -C -w -s "$old".."$new" -- "$path" |
  ...
}


The graft file lists each commit object and its parent. The above snippet says that the $new commit has a single parent, which is $old, and $old commit does not have any parent. This lets us lie to git blame that our history consists of only two commits, and one is a direct child of the other.

With this, we can tell how much new material was introduced to the given path in the new release, but what about the material removed from the old release? We can compute it in a similar way with a twist. You take a path in the old release, and pretend as if the old release were the direct child of the new release. We compute what we have added if we started from release v1.8.1 and development led to the contents of v1.8.0, like this:

removed_from_file () {
  old="$1" new="$2" path="$3"
  graft=/tmp/blame.$$.graft
  cat >"$graft" <<-EOF
  $old $new
  $new
  EOF
  git blame -M -C -w -s "$new".."$old" -- "$path" |
  grep -v '^^' |
  wc -l
}

By tying these two helper functions with a list of paths that existed in the two releases, you can compute the amount of real changes made to reach the new release, but this article is getting a bit too long, so I'll leave it to another installment. We will use the added_to_file helper to construct added_to_commit function like this:

added_to_commit () {
  old=$(git rev-parse "$1^0")
  new=$(git rev-parse "$2^0")
  list_paths_in_commit "$new" |
  while read path
  do
    added_to_file "$old" "$new" "$path"
  done | {
    total=0
    while read count
    do
      total=$(( $total + $count ))
    done
    echo $total
  }
}

Monday, March 18, 2013

A bit annoyed by LinkedIn Endorsements

A few times a week, I get "X endorsed your skills and expertise" e-mail messages from LinkedIn, listing people from my past and present. One of the embarrassing ones I saw the other day was an endorsement on "Linux Kernel", made by somebody who used to work as a receptionist at a small company I used to be at several years ago. She didn't know (and need to know) what technical work I did back then, I do not think she changed her career to know what technical work I do these days, and most importantly, I do not do the Kernel X-<.

And then today I got endorsement from a few Git people on "Ruby", but I know they know I do not do Ruby (not that I hate the language or its ecosystem; it is just I didn't get around to touch it).

I was told by the former receptionist that LinkedIn nags every once in a while to give endorsement to others and it is very easy to click on it, only to dismiss the nagging message, and ending up giving such irrelevant endorsements.

It is mildly annoying. Just as annoying as that big red "unread count" number I see on the right top corner of the Gmail window.

Grumpy I am.

Thursday, March 14, 2013

Measuring Project Activities (1)

Earlier, I showed a handful of metrics to view the level of activities in Git project, grouped by its release cycle, and promised to expllain how you can compute similar numbers for your projects.

This is the first of such posts. This post covers the very basics.

How long did a cycle last?

Each release is given a release tag. The latest I tagged for Git project was v1.8.2 and the release before that was v1.8.1. The release cycle began when I tagged v1.8.1 and ended when I tagged v1.8.2. As each commit in Git records commit timestamp and author timestamp, we can use diffrence between the commit timestamps of the two release.

We can ask git log to give us the timestamp for one commit:

  git log -1 --format=%ct $commit

The --format= option lets us ask for various pieces of information, and %ct requests the committer timestamp, expressed as number of seconds since midnight of January 1, 1970 (epoch). You can use %ci for the same information but in ISO 8601 format, i.e. YYYY-MM-DD HH:MM:SS; see git log --help and look for "PRETTY FORMATS" for other possibilities.

So the part, given two commits, that computes the number of days between them, becomes something like this:

handle () {
  old="$1^0" new="$2^0"
  oldtime=$(git log -1 --format=%ct "$old")
  newtime=$(git log -1 --format=%ct "$new")
  days=$(( ($newtime - $oldtime) / (60 * 60 * 24) ))
  ...
}

We ask the commit timestamps for the two commits in seconds since epoch, take the difference, and divide that by number of seconds in a day.

How many commits do we have in the cycle?

This is a single-liner.
git log has a way to list commits in a specified range, and the range we want can be expressed as: "We are interested in commits that are ancestors of v1.8.2, but we are not interested in commits that are ancestors of v1.8.1" (as the latter is the set of commits that happened before v1.8.1).

In a merge-heavy project like Git, however, merge commits make up a significant part of the history. A logical change that consists of three patches may start its life as three commits on a topic branch to be tested, and later when it proves to be sound gets merged to the mainline with a merge commit, at which point the mainline gains 4 commits (the original three plus the merge commit). That means the real change is only 75% of the history in the example.

Of course, merging other people's work is an important part of the work done in the project, so you may want to count merge commits as well. The choice is up to you.

When I counted commits for Git project in the earlier article, I chose not to include merges, so the part that computes the number of commits between two given commits becomes:

handle () {
  old="$1^0" new="$2^0"
  ...
  commits=$(git log --oneline --no-merges "$old..$new" | wc -l)
  ...
}

Drop --no-merges if you want to count your merges. The --oneline option is to show a single line of output per commit; by counting the lines in the output from that command with wc -l, we can count the number of commits.

As we are not interested in the contents of the output (we are just counting the number of lines), we can also use git rev-list that only shows the commit object name, if you want.

  commits=$(git rev-list --no-merges "$old..$new" | wc -l)

How many contributors did we have in the cycle?

You can list the names and e-mails of people who authored commits in a specified range in two ways.

Using the git log --format we saw earlier, we can ask the name %an and e-mail %ae of the author, i.e.

  git log --no-merges --format="%an <%ae>" "$old..$new"

You can count the unique lines in the output from this command. That is the list of your contributors. The end result will become something like this:

  authors=$(git log ...the same as above... | sort -u | wc -l)

The other way is to use the git shortlog command designed specifically for this purpose.

  git shortlog --no-merges -s -e "$old..$new"

The command without -e option only shows the names (and with it, names and e-mails). It lists commits made by each author along with the author name when run without -s option (and with it, the number of commits and the author's name on the same line). So the number of lines in the output from the above command is the number of your contributors.

  authors=$(git shortlog --no-merges -s -e "$old..$new" | wc -l)

Again, if you want to count merges, drop --no-merges from the command line.

How many new contributors have we added during the cycle?

This is a bit trickier than the previous one. The idea is to list contributors we already had in the entire history before v1.8.1, and subtract that from the list of contributors in the entire history up to v1.8.2. The remainder are the newcomers you want to welcome when writing your release notes.

The contributors in the entire history leading to a commit can be listed with a helper function:

authors () {
  git log --no-merges --format="%an <%ae>" "$@" | sort -u
}

and we can write the entire thing using the helper function like so:

handle () {
  old="$1^0" new="$2^0"
  ...
  authors "$old" >/tmp/old
  authors "$new" >/tmp/new
  new_authors=$(comm -13 /tmp/old /tmp/new | wc -l)
  ...
}

The authors helper function will write the authors for the old history and new history into two temporary files, both in sorted order, and using comm -13, we list lines that only appear in /tmp/new to see who are the new contributors.

In the next installment of this series, let's count the changes made to the codebase by these commits we counted in this article.

Wednesday, March 13, 2013

So how well are we doing lately?

Git v1.8.2 has 630+ non-merge commits since v1.8.1 release.

Averaged over time, the impact of each individual commit is about the same (because we reject an oversized patch that does too many unrelated things and have the submitter split it into multiple patches), so if one release cycle has twice the non-merge commits compared to another cycle, we can say it was about twice as busy.

But from time to time, it is good to measure our progress with different metrics to see how different metrics correlate with each other.

The table at the end lists how long each release cycle lasted, how many non-merge commits we had in the cycle and in each day in the cycle and how many lines of code (counting only *.[ch] source files) were affected in the cycle and in each day in the cycle. The "lines of code affected" are counted in two different ways.

We can see that the latest release cycle was relatively active, while the previous cycle that overlapped the year-end holidays was slow, both from commit count (348 vs 635) and one modification count (2921 vs 5881), but not with the other modification count (6047 vs 6355).

The "modified" number is computed with "git blame -C -C -C -w" to track line-level movements, while the "modified2" number is computed with "git diff -M" which is less accurate when the code gets refactored, and that is where the above apparent discrepancy comes from.

The v1.8.1 cycle has a lot smaller real changes (measured by "git blame") than apparent changes (measured by "git diff -M") because it created two new files by splitting two existing files.  The measurement based on "git blame -C" knows to consider bulk movement of lines by such a change as a non-event, but it will show up in "git diff --stat -M" output as a large change.

 builtin/fetch-pack.c | 950 +-------------------------------------------------------------
 builtin/send-pack.c  | 333 ----------------------
 fetch-pack.c         | 951 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 send-pack.c          | 344 +++++++++++++++++++++++

So the "modified" number is a better indication of how much actual work is done, but it is painfully expensive to compute.

I'll later explain how to use blame and diff to compute these numbers for your projects in a separate post.

releasedayscommitscommit/daymodifiedmod/day modified2mod2/day
v1.5.090144816.0812369137.4313640151.55
v1.5.14964313.127822159.638213167.61
v1.5.24657512.507604165.308273179.84
v1.5.3104132212.71808177.70953891.71
v1.5.4152159510.4921422140.9324934164.03
v1.5.56672911.049790148.3312172184.42
v1.5.6715698.01709199.878354117.66
v1.6.05973112.3816709283.2019481330.18
v1.6.112910338.001031679.9614262110.55
v1.6.2694997.23483570.07532277.13
v1.6.36369210.986642105.428687137.88
v1.6.4835006.0213571163.5014296172.24
v1.6.5724125.72501869.69562378.09
v1.6.6744836.52601181.22673090.94
v1.7.05156911.157698150.948635169.31
v1.7.1704776.81583083.28655893.68
v1.7.2885326.04561563.80638072.50
v1.7.3594818.1520753351.7421473363.94
v1.7.41347465.56852763.63974472.71
v1.7.5835486.60676681.51754390.87
v1.7.6634276.77396262.88435169.06
v1.7.7965635.86892893.0010107105.28
v1.7.8624266.87509882.22546388.11
v1.7.9563916.986338113.176886122.96
v1.7.10694406.37505173.207271105.37
v1.7.11726529.057354102.138863123.09
v1.7.12633826.06306048.57341154.14
v1.8.0624978.01561190.50603797.37
v1.8.1713484.90292141.14604785.16
v1.8.2716358.94588182.83659992.94

Git 1.8.2

The latest feature release Git v1.8.2 is now available at the usual places. The release has commits from 1200+ contributors, among which ~30 are new names in this cycle (~90 people contributed 630+ changes since v1.8.1 release).

I've already mentioned backward-compatibility notes and notable new features, so I won't repeat them.

Have fun.

Thursday, March 7, 2013

Git 1.8.2-rc3

This is the third and planned-to-be-the-final release candidate for the upcoming 1.8.2 release. Hopefully we can have the final sometime next week.

Reviewing the draft release notes, I see that there aren't many earth-shattering new features, but there are quite a few niceties around the fringes. To name a few at random:

  • Command line completion (in contrib/completion) has learned what can be "git add"ed and what are irrelevant. You would not want "git add hello.<TAB>" to offer choices between hello.c and hello.h when you only have changes in hello.c and hello.h is unmodified.
  • The patterns used in .gitignore and .gitattributes files can have double-asterisk /**/ to match 0 or more levels of directories.
  • In the documentation, we consistently refer to Git the software as "Git", not "git" or "GIT" (the last one was a poor-man's attempt for imitating "Git" rendered in SMALL CAPS).
  • "git branch" used to accept nonsense parameters from the command line and silently ignored them (e.g. "git branch -m A B C"). Such an erroneous input is checked more carefully.
  • "git log" and friends can be told to use the same mailmap mechanism used by "git shortlog" to canonicalize the user names.
  • "git log --grep=<pattern>" first converts the log messages to i18n.logoutputencoding before matching them against the pattern.

For notable backward compatibility issues (read: there is none yet ;-), please refer to the earlier article on 1.8.2-rc2.

Monday, March 4, 2013

Living in the browser seems to be possible

My primary Git work environment lives inside a GNU screen session that is running forever. In the first virtual terminal is an instance of Emacs, and it also is running forever. That is where I read e-mails and process patches. I have several virtual terminals in this screen session, and no matter where I physically am, I am connected to this screen session over SSH whenever I am working on Git.

Usually I open two or three Gnome terminals on my notebook and from these terminals go over SSH to the said screen session, but I tried Secure Shell in Chrome on Friday evening. An earlier version of this extension I tried long time ago did not support anything but password authentication, but the recent versions seem to grok private key authentication just fine. It is still rough in that there is no UI to tweak font sizes &c, but a few essentials can be tweaked from the JavaScript console (I hear some of you say "eek" already) and these tweaks survive browser or machine restart. I do not mind doing "set once and forget" configuration by hand.

Following its FAQ page, I came up with the following to let me get going:

term_.prefs_.set('background-color', 'white')
term_.prefs_.set('foreground-color', 'black')
term_.prefs_.set('font-family', 'monospace')
term_.prefs_.set('font-size', 12)
term_.prefs_.set('scrollbar-visible', false)

After opening the terminal window (make sure to set the extension to open in its own window—otherwise an innocent \C-w will close the terminal), I clicked to "Inspect Element", then went to "Console" to open the JavaScript console, and then typed the above four lines. Enter the connection parameter and I have a working and usable terminal. Happiness.

I didn't really mean to, but I ended up not opening any program other than Chrome over the weekend. It was rather a fun experience.

I do have to run a few local programs from time to time (e.g. GnuCash, Gimp, and Kid3 come to mind), and I do not think I can switch entirely to a Chromebook yet, but I should be able to survive for a few days with only Chrome.