How to use git shortlog to get contributor commit counts
Overview
This article is a detailed, step-by-step guide about how git's built-in command git shortlog
can be used to generate a simple and useful contributor - the total number of commits per contributor.
It also explains how handle contributors who changed their names, which is the most common problem when calculating contributor stats.
Or you can read the TLDR of course;
TLDR;
To create useful contributor commit statistics, you need to create a .mailmap
file in the repository and add all required contributor mappings in the format correct name <correct@email> wrong name <wrong@email>
before running git shortlog -sne
.
A real-world Example
To make this more practical, let's focus on a real-world example. For this article I have decided to use AvaloniaUI but the specific repository doesn't matter.
To get started, we will first clone the project so that we have local git repository to work with and then enter the newly created directory:
git clone https://github.com/AvaloniaUI/Avalonia
cd Avalonia
Out-of-the-box experience
Now that we are inside a repo we can test it by invoking git shortlog
:
git shortlog
As the output we get the following:
0x90d (8):
Fix datagrid right click selection
Merge branch 'master' into master
Update DefaultMenuInteractionHandler.cs
Update DataGridColumn.cs
Update DataGridColumnHeader.cs
Update DataGridColumn.cs
Update DataGridColumn.cs
Update DataGridColumn.cs
0xDB6 (1):
Defer execution of UnregisterClass to fix RWM atom table leak (#9700)
3dfxuser (1):
Add null check for TextInputMethodClient in OnSelectionChanged() method
ARSolog (3):
Update DataAnnotationsValidationPluginTests.cs
Update ExpressionObserverTests_DataValidation.cs
Merge branch 'master' into patch-1
Abdella Solomon (3):
Merge branch 'master' into master
Merge branch 'master' into master
Changed text foreground of progressbar from template binding to SystemControlForegroundBaseHighBrush
...
After taking a closer look - what exactly do we get as an output?
We get an alphabetically sorted list of names of contributors with their total number of commits that this contributor has authored in brackets, followed by list of the commit messages of those commits, from the beginning up to the current commit.
Now this is what we get when are running git shortlog
with it's defaults. Now let's see what options we have to modify those.
Getting rid of the text
Chances are that we are more interested in the raw numbers than in the commit message text. To get rid of the noisy messages, we will use the --summary
flag (shorthand: -s
).
Let's try it out:
git shortlog -s
...and we get the following output:
8 0x90d
1 0xDB6
1 3dfxuser
3 ARSolog
3 Abdella Solomon
...
Nice! This condensed the output and left us just with the name and the number of authored commits.
Sorting
By default, the sorting is done alphabetically, but we might want to sort the output by the number of commits descending. This way, we will get the contributors who authored the most commits at the top.
Of course we could sort it with an external tool, eg. excel, but that's not necessary as this is supported out of the box via the --numbered
(shorthand: -n
) flag.
Let's combine this flag with the previously shown summary -s
flag and run the commands again:
git shortlog -sn
...and we get the following output:
6787 Steven Kirk
3378 Dan Walmsley
2761 Max Katz
2093 Nikita Tsukanov
1323 Jumar Macato
1041 danwalmsley
...
Authors Vs. Committers
You might have noticed that in the previous chapters, I've used the word 'author' instead of 'committer'. The reason is that they are technically not the same. You see in git, a commit is associated with the name and email addresses of two persons - the author and the committer and they not necessarily the same person.
So when we are counting commits, we can choose if we want to look at the committer or the author of commit. By default, git shortlog
counts the author - which is probably what most people are interested in anyway, but we could tell it to count the number of committed commits instead of the authored commits via the --group=committer
(shorthand: -c
) flag.
Just for demonstration purposes, let's add the -c
flag:
git shortlog -snc
...and see what we get:
8707 GitHub
5289 Steven Kirk
2695 Dan Walmsley
1541 Nikita Tsukanov
840 Max Katz
579 Jumar Macato
...
This shows that the most commits were done by GitHub, probably when a pull request was merged. As this not what we are interested in, we will stick to the default here and get rid of the -c
flag.
Names Vs. Names + Emails
By default, a contributor is represented in this output by his or her username. But as mentioned before, a commit also holds the email address of the contributor.
We can tell git shortlog
to show the email address of the contributor as well as the username by using the --email
(shorthand: -e
) flag.
Let's try that out and add the -e
flag (and also remove the committer -c
flag)
git shortlog -sne
4391 Steven Kirk <grokys@gmail.com>
3226 Dan Walmsley <dan@walms.co.uk>
2761 Max Katz <maxkatz6@outlook.com>
2092 Nikita Tsukanov <keks9n@gmail.com>
2023 Steven Kirk <grokys@users.noreply.github.com>
1041 danwalmsley <dan@walms.co.uk>
786 Jumar Macato <hikari.netto23@gmail.com>
...
Okay, nice that worked, but wait a minute, something different happened also...
Problems with changing usernames
Previously, when we ran git shortlog -sn
the top contributor had 6787 authored commits. Now he only has 4391? What happened?
Well, because we specified that we want to see the email address in addition to the name git had to count the commits for each name and email address combination. By default, it groups only on the name and so all commits who have the same author name but potentially different email addresses are added together.
In this case the top contributor used different email addresses which results in different totals, depending if you group by username alone or by a combination of username and email address.
So this leads us towards the realization that a contributor is only identified by their configured username and email address at the time of the commit - there is no contributor id or something similar that ties all their commits together.
Does this imply it's better to forgo using the email address to get the correct stats? No, because it doesn't solve the underlying problem and contributors might not only have changed their email addresses, but also their usernames.
Let's double check and go back and take a look at the output of gitlog -sn
again:
6787 Steven Kirk
3378 Dan Walmsley
2761 Max Katz
2093 Nikita Tsukanov
1323 Jumar Macato
1041 danwalmsley
...
We can see that Contributor #2 and #6 are actually the same person, but git can't - it sums them up separately.
Maybe this is an edgecase and mostly a theoretical problem?
No, it's very real and you will be hard pressed to find a repo in the wild in which this never happened, which is exactly why we didn't start with git init
but used git clone
!
Maybe it's only about some 1 or 2 commits and it doesn't really matter in the grand scheme of things?
This will depend on the repo - specifically it's age. The older the repo is, the more likely it is that the contributors changed their name or email, maybe simply because they setup a new system and chose a different alias.
In a new repo it might not matter, but in our example it does and we need to find a solution.
Fixing Usernames via mailmap
Can we update them somehow?
No, I am afraid not, as the author and committer are part of the commit itself and the commit is identified by it's SHA1, so there is no way to retroactivly change them.
On the plus side, the combinations of a contributor's username and email are not endless. Most users have only one, but some might have a handful different combinations in the worst case. This makes it quite feasible to do the mapping ourselves.
So we could do that by using an external tool, like excel, but once again - there is a built-in way, which is better!
This can be achieved via a not that well-known feature of git, called mailmap. Mailmap is based on a simple text file, called .mailmap
by default. Each line of that file specifies a single mapping rule that translates the name and email associated with a commit before the totals are tallied.
There are couple different supported formats, the most comprehensive is:
correct name <correct@email> wrong name <wrong@email>
(incl. the <
and >
signs!)
Let's create a file called .mailmap
with the following content:
Dan Walmsley <dan@walms.co.uk> danwalmsley <dan@walms.co.uk>
and run the command again:
git shortlog -sne
we will now get:
4391 Steven Kirk <grokys@gmail.com>
4267 Dan Walmsley <dan@walms.co.uk>
2761 Max Katz <maxkatz6@outlook.com>
2092 Nikita Tsukanov <keks9n@gmail.com>
2023 Steven Kirk <grokys@users.noreply.github.com>
786 Jumar Macato <hikari.netto23@gmail.com>
766 Benedikt Stebner <Gillibald@users.noreply.github.com>
...
We can see that git has combined the two different names and email combinations into a single contributor! To recap, we previously got:
...
3226 Dan Walmsley <dan@walms.co.uk>
...
1041 danwalmsley <dan@walms.co.uk>
but now thanks to our .mailmap
file we get:
...
4267 Dan Walmsley <dan@walms.co.uk>
So git has automically read in the .mailmap
file and used it to map the names of the author of the commit before tallying them up.
When we take a look at the output we can also spot the next duplicate, let's add a mapping for it also to the .mailmap
file:
Dan Walmsley <dan@walms.co.uk> danwalmsley <dan@walms.co.uk>
Steven Kirk <grokys@gmail.com> Steven Kirk <grokys@users.noreply.github.com>
and let's try again:
6414 Steven Kirk <grokys@gmail.com>
4267 Dan Walmsley <dan@walms.co.uk>
2761 Max Katz <maxkatz6@outlook.com>
2092 Nikita Tsukanov <keks9n@gmail.com>
786 Jumar Macato <hikari.netto23@gmail.com>
We can see that this approach is working!
It requires a little upfront work to setup the mappings, but once they are done, it's easy to update the statistics as it only requires running git shortlog -sne
again.