Thursday, March 21, 2013

Measuring Project Activities (2)

Continuing from an earlier article, let's see how you can compute some interesting stats on your own projects.

How much change did a release have?

As I said earlier, you can measure the extent of change to your codebase in two ways. A quicker and less precise way, and a more involved but more accurate way.

A quicker way is to ask git diff --numstat to count the deleted and added lines between the release tags, and add them up yourself. If you care about whole-file renames, you can add the -M option to the git diff command:

addremove2 () {
  git diff --numstat "$@" | {
    total=0 &&
    while read add remove path
      total=$(( $total + $add + $remove ))
    done &&
    echo "$total"

And with that helper, the main function we introduced in the earlier article can do this to compute the modified2 number for the entire release cycle and per each day:

handle () {
  old="$1^0" new="$2^0"
  modified2=$(addremove2 "$old" "$new")
  mod2perday=$( echo 2k "$modified2" "$days" / p | dc )

How much real change did a release have?

Counting number of added and removed lines using git diff --numstat is straightforward, but this tends to over-count changes. For example, when adding a new caller to an existing code, you may have to move that existing code up in the same file (especially if it is a file-local static function) to make the callee come before the caller, or move it to a different, more "library-ish" file, while making its visibility from static to extern. Both of these kind of changes unfortunately appear as a bulk deletion of existing block of lines and bulk addition of the same contents elsewhere in the codebase.

In order to count the true amount of work went into the new release, you would want to exclude such changes from your statistics.

This is where git blame can help. In the most basic form, it can trace each and every line of a file in the given commit back to its origin, i.e. which commit it came from. By default, it notices when the whole file gets renamed (e.g. the file hello.c you are running the command on in the current release may have been called goodbye.c in an earlier release), and employs no other fancy tricks, but you can tell it to notice code movement within a file (e.g. moving the callee up in the file) with the -M option, or code moves across files (e.g. moving a static function from a file that an existing caller lives in to a different "library-ish" file, to make it also visible to a new caller in another file) with the -C option. You can also tell it to ignore whitespace changes with the -w option like you can with git diff. For example:

  git blame -M -C -w -s v1.8.0..v1.8.1 -- fetch-pack.c

will show you which commit each and every line in the fetch-pack.c file came from; its output may begin like this:

745f7a8c fetch-pack.c           1) #include "cache.h"
^8c7a786 builtin/fetch-pack.c   2) #include "refs.h"
^8c7a786 builtin/fetch-pack.c   3) #include "pkt-line.h"

The first line is blamed to commit 745f7a8c, while the other lines are attributed to commit 8c7a786 (the leading caret ^ means it is attributed to a commit at the lower boundary of the range), which is the v1.8.0 release. Note that these old lines used to live in a different file builtin/fetch-pack.c in the older release, and would have been counted as additions if you used the approach based on git diff --numstat -M to count them, because there was no file renaming involved between these two releases.

Also notice that these lines may have been untouched since a commit that may be a lot older than v1.8.0, but we told the command to stop at v1.8.0 from the command line, so these are all attributed to that range boundary.

If you count the number of lines in the whole output from the above command, that will show the number of lines in the fetch-pack.c file at the v1.8.1 release. If you count the lines that do not begin with a caret, that counts the lines added in the new release.

added_to_file () {
  old="$1" new="$2" path="$3"
  git blame -M -C -w -s "$old".."$new" -- "$path" |
  grep -v '^^' |
  wc -l

This may be sufficient as a starting point, but we are not all interested in checking each and every commit between the two releases (e.g. the commit 745f7a8c in the above example is not the v1.8.1 release and the only thing we care about is that the line is new in the new release; we do not care where in the development cycle leading to the release it was added), so it is a waste of computational cycles.

Fortunately, you can tell git blame to pretend as if the commit tagged as v1.8.1 release were a direct and sole child of the commit tagged as v1.8.0 release with the -S option. First, you prepare a graft file to describe the parent-child relationship.

added_to_file () {
  old="$1" new="$2" path="$3"
  cat >"$graft" <<-EOF
  $new $old
  git blame -M -C -w -s "$old".."$new" -- "$path" |

The graft file lists each commit object and its parent. The above snippet says that the $new commit has a single parent, which is $old, and $old commit does not have any parent. This lets us lie to git blame that our history consists of only two commits, and one is a direct child of the other.

With this, we can tell how much new material was introduced to the given path in the new release, but what about the material removed from the old release? We can compute it in a similar way with a twist. You take a path in the old release, and pretend as if the old release were the direct child of the new release. We compute what we have added if we started from release v1.8.1 and development led to the contents of v1.8.0, like this:

removed_from_file () {
  old="$1" new="$2" path="$3"
  cat >"$graft" <<-EOF
  $old $new
  git blame -M -C -w -s "$new".."$old" -- "$path" |
  grep -v '^^' |
  wc -l

By tying these two helper functions with a list of paths that existed in the two releases, you can compute the amount of real changes made to reach the new release, but this article is getting a bit too long, so I'll leave it to another installment. We will use the added_to_file helper to construct added_to_commit function like this:

added_to_commit () {
  old=$(git rev-parse "$1^0")
  new=$(git rev-parse "$2^0")
  list_paths_in_commit "$new" |
  while read path
    added_to_file "$old" "$new" "$path"
  done | {
    while read count
      total=$(( $total + $count ))
    echo $total

Monday, March 18, 2013

A bit annoyed by LinkedIn Endorsements

A few times a week, I get "X endorsed your skills and expertise" e-mail messages from LinkedIn, listing people from my past and present. One of the embarrassing ones I saw the other day was an endorsement on "Linux Kernel", made by somebody who used to work as a receptionist at a small company I used to be at several years ago. She didn't know (and need to know) what technical work I did back then, I do not think she changed her career to know what technical work I do these days, and most importantly, I do not do the Kernel X-<.

And then today I got endorsement from a few Git people on "Ruby", but I know they know I do not do Ruby (not that I hate the language or its ecosystem; it is just I didn't get around to touch it).

I was told by the former receptionist that LinkedIn nags every once in a while to give endorsement to others and it is very easy to click on it, only to dismiss the nagging message, and ending up giving such irrelevant endorsements.

It is mildly annoying. Just as annoying as that big red "unread count" number I see on the right top corner of the Gmail window.

Grumpy I am.

Thursday, March 14, 2013

Measuring Project Activities (1)

Earlier, I showed a handful of metrics to view the level of activities in Git project, grouped by its release cycle, and promised to expllain how you can compute similar numbers for your projects.

This is the first of such posts. This post covers the very basics.

How long did a cycle last?

Each release is given a release tag. The latest I tagged for Git project was v1.8.2 and the release before that was v1.8.1. The release cycle began when I tagged v1.8.1 and ended when I tagged v1.8.2. As each commit in Git records commit timestamp and author timestamp, we can use diffrence between the commit timestamps of the two release.

We can ask git log to give us the timestamp for one commit:

  git log -1 --format=%ct $commit

The --format= option lets us ask for various pieces of information, and %ct requests the committer timestamp, expressed as number of seconds since midnight of January 1, 1970 (epoch). You can use %ci for the same information but in ISO 8601 format, i.e. YYYY-MM-DD HH:MM:SS; see git log --help and look for "PRETTY FORMATS" for other possibilities.

So the part, given two commits, that computes the number of days between them, becomes something like this:

handle () {
  old="$1^0" new="$2^0"
  oldtime=$(git log -1 --format=%ct "$old")
  newtime=$(git log -1 --format=%ct "$new")
  days=$(( ($newtime - $oldtime) / (60 * 60 * 24) ))

We ask the commit timestamps for the two commits in seconds since epoch, take the difference, and divide that by number of seconds in a day.

How many commits do we have in the cycle?

This is a single-liner.
git log has a way to list commits in a specified range, and the range we want can be expressed as: "We are interested in commits that are ancestors of v1.8.2, but we are not interested in commits that are ancestors of v1.8.1" (as the latter is the set of commits that happened before v1.8.1).

In a merge-heavy project like Git, however, merge commits make up a significant part of the history. A logical change that consists of three patches may start its life as three commits on a topic branch to be tested, and later when it proves to be sound gets merged to the mainline with a merge commit, at which point the mainline gains 4 commits (the original three plus the merge commit). That means the real change is only 75% of the history in the example.

Of course, merging other people's work is an important part of the work done in the project, so you may want to count merge commits as well. The choice is up to you.

When I counted commits for Git project in the earlier article, I chose not to include merges, so the part that computes the number of commits between two given commits becomes:

handle () {
  old="$1^0" new="$2^0"
  commits=$(git log --oneline --no-merges "$old..$new" | wc -l)

Drop --no-merges if you want to count your merges. The --oneline option is to show a single line of output per commit; by counting the lines in the output from that command with wc -l, we can count the number of commits.

As we are not interested in the contents of the output (we are just counting the number of lines), we can also use git rev-list that only shows the commit object name, if you want.

  commits=$(git rev-list --no-merges "$old..$new" | wc -l)

How many contributors did we have in the cycle?

You can list the names and e-mails of people who authored commits in a specified range in two ways.

Using the git log --format we saw earlier, we can ask the name %an and e-mail %ae of the author, i.e.

  git log --no-merges --format="%an <%ae>" "$old..$new"

You can count the unique lines in the output from this command. That is the list of your contributors. The end result will become something like this:

  authors=$(git log ...the same as above... | sort -u | wc -l)

The other way is to use the git shortlog command designed specifically for this purpose.

  git shortlog --no-merges -s -e "$old..$new"

The command without -e option only shows the names (and with it, names and e-mails). It lists commits made by each author along with the author name when run without -s option (and with it, the number of commits and the author's name on the same line). So the number of lines in the output from the above command is the number of your contributors.

  authors=$(git shortlog --no-merges -s -e "$old..$new" | wc -l)

Again, if you want to count merges, drop --no-merges from the command line.

How many new contributors have we added during the cycle?

This is a bit trickier than the previous one. The idea is to list contributors we already had in the entire history before v1.8.1, and subtract that from the list of contributors in the entire history up to v1.8.2. The remainder are the newcomers you want to welcome when writing your release notes.

The contributors in the entire history leading to a commit can be listed with a helper function:

authors () {
  git log --no-merges --format="%an <%ae>" "$@" | sort -u

and we can write the entire thing using the helper function like so:

handle () {
  old="$1^0" new="$2^0"
  authors "$old" >/tmp/old
  authors "$new" >/tmp/new
  new_authors=$(comm -13 /tmp/old /tmp/new | wc -l)

The authors helper function will write the authors for the old history and new history into two temporary files, both in sorted order, and using comm -13, we list lines that only appear in /tmp/new to see who are the new contributors.

In the next installment of this series, let's count the changes made to the codebase by these commits we counted in this article.

Wednesday, March 13, 2013

So how well are we doing lately?

Git v1.8.2 has 630+ non-merge commits since v1.8.1 release.

Averaged over time, the impact of each individual commit is about the same (because we reject an oversized patch that does too many unrelated things and have the submitter split it into multiple patches), so if one release cycle has twice the non-merge commits compared to another cycle, we can say it was about twice as busy.

But from time to time, it is good to measure our progress with different metrics to see how different metrics correlate with each other.

The table at the end lists how long each release cycle lasted, how many non-merge commits we had in the cycle and in each day in the cycle and how many lines of code (counting only *.[ch] source files) were affected in the cycle and in each day in the cycle. The "lines of code affected" are counted in two different ways.

We can see that the latest release cycle was relatively active, while the previous cycle that overlapped the year-end holidays was slow, both from commit count (348 vs 635) and one modification count (2921 vs 5881), but not with the other modification count (6047 vs 6355).

The "modified" number is computed with "git blame -C -C -C -w" to track line-level movements, while the "modified2" number is computed with "git diff -M" which is less accurate when the code gets refactored, and that is where the above apparent discrepancy comes from.

The v1.8.1 cycle has a lot smaller real changes (measured by "git blame") than apparent changes (measured by "git diff -M") because it created two new files by splitting two existing files.  The measurement based on "git blame -C" knows to consider bulk movement of lines by such a change as a non-event, but it will show up in "git diff --stat -M" output as a large change.

 builtin/fetch-pack.c | 950 +-------------------------------------------------------------
 builtin/send-pack.c  | 333 ----------------------
 fetch-pack.c         | 951 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 send-pack.c          | 344 +++++++++++++++++++++++

So the "modified" number is a better indication of how much actual work is done, but it is painfully expensive to compute.

I'll later explain how to use blame and diff to compute these numbers for your projects in a separate post.

releasedayscommitscommit/daymodifiedmod/day modified2mod2/day

Git 1.8.2

The latest feature release Git v1.8.2 is now available at the usual places. The release has commits from 1200+ contributors, among which ~30 are new names in this cycle (~90 people contributed 630+ changes since v1.8.1 release).

I've already mentioned backward-compatibility notes and notable new features, so I won't repeat them.

Have fun.

Thursday, March 7, 2013

Git 1.8.2-rc3

This is the third and planned-to-be-the-final release candidate for the upcoming 1.8.2 release. Hopefully we can have the final sometime next week.

Reviewing the draft release notes, I see that there aren't many earth-shattering new features, but there are quite a few niceties around the fringes. To name a few at random:

  • Command line completion (in contrib/completion) has learned what can be "git add"ed and what are irrelevant. You would not want "git add hello.<TAB>" to offer choices between hello.c and hello.h when you only have changes in hello.c and hello.h is unmodified.
  • The patterns used in .gitignore and .gitattributes files can have double-asterisk /**/ to match 0 or more levels of directories.
  • In the documentation, we consistently refer to Git the software as "Git", not "git" or "GIT" (the last one was a poor-man's attempt for imitating "Git" rendered in SMALL CAPS).
  • "git branch" used to accept nonsense parameters from the command line and silently ignored them (e.g. "git branch -m A B C"). Such an erroneous input is checked more carefully.
  • "git log" and friends can be told to use the same mailmap mechanism used by "git shortlog" to canonicalize the user names.
  • "git log --grep=<pattern>" first converts the log messages to i18n.logoutputencoding before matching them against the pattern.

For notable backward compatibility issues (read: there is none yet ;-), please refer to the earlier article on 1.8.2-rc2.

Monday, March 4, 2013

Living in the browser seems to be possible

My primary Git work environment lives inside a GNU screen session that is running forever. In the first virtual terminal is an instance of Emacs, and it also is running forever. That is where I read e-mails and process patches. I have several virtual terminals in this screen session, and no matter where I physically am, I am connected to this screen session over SSH whenever I am working on Git.

Usually I open two or three Gnome terminals on my notebook and from these terminals go over SSH to the said screen session, but I tried Secure Shell in Chrome on Friday evening. An earlier version of this extension I tried long time ago did not support anything but password authentication, but the recent versions seem to grok private key authentication just fine. It is still rough in that there is no UI to tweak font sizes &c, but a few essentials can be tweaked from the JavaScript console (I hear some of you say "eek" already) and these tweaks survive browser or machine restart. I do not mind doing "set once and forget" configuration by hand.

Following its FAQ page, I came up with the following to let me get going:

term_.prefs_.set('background-color', 'white')
term_.prefs_.set('foreground-color', 'black')
term_.prefs_.set('font-family', 'monospace')
term_.prefs_.set('font-size', 12)
term_.prefs_.set('scrollbar-visible', false)

After opening the terminal window (make sure to set the extension to open in its own window—otherwise an innocent \C-w will close the terminal), I clicked to "Inspect Element", then went to "Console" to open the JavaScript console, and then typed the above four lines. Enter the connection parameter and I have a working and usable terminal. Happiness.

I didn't really mean to, but I ended up not opening any program other than Chrome over the weekend. It was rather a fun experience.

I do have to run a few local programs from time to time (e.g. GnuCash, Gimp, and Kid3 come to mind), and I do not think I can switch entirely to a Chromebook yet, but I should be able to survive for a few days with only Chrome.

Sunday, March 3, 2013

Git 1.8.2 release candidate #2

The upcoming release is taking shape and I am hoing that not many things will change until the final one. I just tagged the second release candidate 1.8.2-rc2 before going to bed.

There are a handful of behaviour change that are worth noting.
  • "git push $there tag v1.2.3" used to allow replacing a tag v1.2.3 that already exists in the repository $there, if the rewritten tag you are pushing points at a commit that is a descendant of a commit that the old tag v1.2.3 points at. This was found to be error prone and starting with this release, any attempt to update an existing ref under refs/tags/ hierarchy will fail, without "--force".
  • When "git add -u" and "git add -A", that does not specify what paths to add on the command line, is run from inside a subdirectory, the scope of the operation has always been limited to the subdirectory. Many users found this counter-intuitive, given that "git commit -a" and other commands operate on the entire tree regardless of where you are. In this release, these commands give warning in such a case and encourage the user to say "git add -u/-A ." instead when restricting the scope to the current directory.
  • At Git 2.0 (not *this* one), we plan to change these commands without pathspec to operate on the entire tree. Forming a habit to type "." when you mean to limit the command to the current working directory will protect you against the planned future change, and that is the whole point of the new message (there will be no configuration variable to squelch this warning---it goes against the "habit forming" objective).
For exciting new features, please refer to the draft release notes.