library(dplyr)
library(ggplot2)
library(gh)
library(readr)
library(reshape2)
library(stringr)
library(tidyr)
library(urltools)
library(zoo)

Overview

Project version control commit histories are the authoritative open history of an open source community. That is not to say that contributions outside of commits are less important or interesting, but version control commit histories are clearly documented points in time associated with a project’s artifacts.

Traditional analysis of commit histories tend to focus on simple summary statistics like numbers of commits or lines of code often to create a leaderboard ranking of authors. This report takes history analysis a step further by looking at author trends from a macro level, rather than at an indvidual level. The goal of this analysis is to discover what the commit history can tell us about a project’s overall activity level, trends, and growth potential.

This report serves two purposes. The original inspiration came from a request to better understand how Github was determining the contributor metrics it was reporting on it’s website for the Tensorflow project. The second purpose is to test and refine participation metrics from other ongoing research efforts that could identify a high activity, growing project.

This notebook lives in Github and is the first in a series: https://github.com/missaugustina/tensorflow_reports

Commit History

Use the “git log” command to extract commit history. The “pretty” option allows the log output to be formatted.

git_log_cmd <- paste('cd ', params$tf_path, 
                     '; git log ', params$tf_sha, 
                     ' --merges ',
                     ' --date=short --pretty=tformat:"%cd|%cn|%ce|%h" > ', params$tf_out, sep='')
system(git_log_cmd)

git_log_cmd

## [1] "cd tensorflow; git log  --merges  --date=short --pretty=tformat:\"%cd|%cn|%ce|%h\" > ../data/tensorflow_gitlog_committers.txt"

tf_raw <- read.csv("data/tensorflow_gitlog_committers.txt", header = FALSE, sep = "|", quote="",
                     col.names=c("git_log_date", "name", "email", "sha"))

# add date formatting so these sort properly
tf_parsed_dates <- tf_raw %>% 
  mutate(commit_date=as.Date(git_log_date, tz="UTC"),
         commit_month=as.yearmon(commit_date),
         name=str_to_lower(name),
         email=str_to_lower(email)) %>% 
  select(commit_date, commit_month, name, email, sha) %>%
  separate(email, c("username", "host"), sep="@", remove=FALSE) %>%
  mutate(domain=suffix_extract(host)$domain,
         suffix=suffix_extract(host)$suffix,
         is_edu=str_detect(suffix, ".edu"))

top_committers <- tf_parsed_dates %>%
  group_by(email) %>%
  summarise(name=first(name), domain=first(domain), num_commits=n())

top_committers_domain_summary <- top_committers %>%
  group_by(domain) %>%
  summarise(num_commits=sum(num_commits), num_authors=n())

write_csv(top_committers, "data/tf_top_committers.csv")
write_csv(top_committers_domain_summary, "data/tf_top_committers_domain_summary.csv")

Normalize Authors

Authors use different names and email addresses, so to get an accurate count of unique authors, we need to consolidate them. This method uses the first email address found in the commit log for an author as the authoritative one. While other methods should be explored and compared for better accuracy, this is sufficient to identify unique authors.

Previous consolidation attempts found the following authors having deviations. Because they were a minority, they are simply updated manually.

# manual updates for known authors that make things complicated
tf_parsed_emails <- tf_parsed_dates %>% 
  mutate(email=ifelse(email=="danmane@gmail.com"| email=="dandelion@google.com",
                         "danmane@google.com", email),
         name=ifelse(name=="yuan (terry) tang", "terry tang", name))

# known email provider domains
email_provider_domains <- c("gmail.com", 
                            "users.noreply.github.com",
                            "hotmail.com",
                            "googlemail.com",
                            "qq.com",
                            "126.com",
                            "163.com",
                            "outlook.com",
                            "me.com",
                            "live.com",
                            "yahoo.com", # manual verification showed these are not yahoo employees
                            "yahoo.fr")

rm(tf_parsed_dates)

# create an authoritative email address
# all matching emails should have the same sha
tf_gh_authors_by_email <- tf_parsed_emails %>%
  arrange(desc(commit_date)) %>%
  group_by(email, name) %>%
  summarise(num_commits = n(), 
            last_commit=max(commit_date))

# TOOD: update to use igraph network
source("normalize_authors.R")
tf_gh_authors_lookup <- build_authors_lookup(tf_gh_authors_by_email, email_provider_domains)

tf_parsed_authors <- merge(tf_parsed_emails, tf_gh_authors_lookup, by=c("email"), all=TRUE)
tf_parsed_authors <- unique(tf_parsed_authors)

# check that each SHA only has one author
tf_parsed_authors_check <- tf_parsed_authors %>% group_by(sha) %>% mutate(n=n()) %>% filter(n>1)

# TODO use an assert
paste("Commits with more than one author (should be zero):", nrow(tf_parsed_authors_check))

## [1] "Commits with more than one author (should be zero): 0"

# exclude July for summaries
tf_parsed <- tf_parsed_authors

Email Domain

# Use host to identify author affiliation
# commits per host
tf_authors_hosts <- tf_parsed %>%
  group_by(author, email_id_host) %>%
  summarise(num_commits = n()) %>%
  group_by(email_id_host) %>%
  summarise(total_authors=n(), total_commits=sum(num_commits))

# Commits and authors per host per month
tf_authors_hosts_month <- tf_parsed %>%
  group_by(commit_month, author, email_id_host) %>%
  summarise(num_commits = n()) %>%
  group_by(commit_month, email_id_host) %>%
  summarise(num_authors=n(), num_commits=sum(num_commits))

tf_authors_hosts_month_merge <- merge(tf_authors_hosts_month, 
                                      tf_authors_hosts %>% 
                                        select(email_id_host, total_authors, total_commits),
                                      by="email_id_host")

Data Summaries

By Month

Commits are grouped by month and author to determine frequencies. Commit frequency bins are created using the rounded natural log of the total commits for the author. The output is a list of total commits and authors per month.

# commits per author
tf_commits_by_author <- tf_parsed %>% 
  group_by(commit_month, author) %>% 
  summarise(num_author_commits=n())

# authors per month
tf_authors <- tf_commits_by_author %>% 
  group_by(commit_month) %>% 
  summarise(num_authors=n())

# commits per month
tf_commits_by_month <- tf_parsed %>% group_by(commit_month) %>% 
  summarise(num_commits=n())

# merge authors and commits per month
tf_authors_commits = merge(tf_authors, tf_commits_by_month, by=c("commit_month"))

# merge authors + commits per month with commits per author
tf_commits_by_author <- merge(tf_commits_by_author, tf_authors_commits, by=c("commit_month"))

# Calculate commits per author percent and log
# the rounded log is used for frequency buckets 
tf_commits_by_author <- tf_commits_by_author %>% 
  mutate(commits_pct = num_author_commits/num_commits,
         author_commits_log = round(log(num_author_commits)))

# Calculate number of months the author shows up in
tf_commits_by_author <- tf_commits_by_author %>%
  group_by(author) %>%
  mutate(commits_months=n())

# determine min and max for each log bucket
tf_commits_by_author <- tf_commits_by_author %>% 
  group_by(author_commits_log) %>%
  mutate(commits_min = min(num_author_commits), 
         commits_max = max(num_author_commits))

# Authors with just one commit month
tf_commits_months_single <- tf_commits_by_author %>% 
  filter(commits_months == 1) %>%
  group_by(commit_month) %>%
  summarise(num_authors_single=n())

tf_authors_commits <- merge(tf_authors_commits, tf_commits_months_single, by=c("commit_month"))

# clean up
rm(tf_authors, tf_commits_by_month, tf_commits_months_single)

By Author

Commits are grouped by month and author and commit frequency bin.

# remove the "tensorflow bot"
tf_commits_by_author_nobot <- tf_commits_by_author %>% filter(author != "a. unique tensorflower")

tf_commits_by_author_summary <- tf_commits_by_author_nobot %>%
  group_by(author, author_commits_log) %>%
  summarise(
    commit_freq = n(),# Months Pattern was Observed
    commits_months = first(commits_months), 
    commit_freq_pct = round(commit_freq/commits_months, 2), # proportion of months pattern was observed
    author_commits_min = min(num_author_commits),
    author_commits_max = max(num_author_commits),
    commits_min = first(commits_min), 
    commits_max = first(commits_max)) %>%
  group_by(author) %>%
  mutate(
    freq_bins = n(), # number of months with different patterns
    freq_bins_pct = round(freq_bins/commits_months, 2)) # proportion of months w/ same patterns
# ex: 5 months -> commits log = 1, 1, 2, 3, 2 -> frequency = 1:2, 2:2, 3:1
# commit_freq_pct: 1:2/5, 2:2/5, 3:1/5 <- out of 5 months how often that commit log was observed
# freq_bins_pct: 3/5 months share the same commit log

Authors

Unique Authors per Month

The number of unique authors has grown steadily per month. With a huge spike in March 2017.

ggplot(data = tf_authors_commits, aes(x = factor(commit_month))) +
  geom_point(aes(y = num_authors, color="Authors"), group=1) +
  geom_line(aes(y = num_authors, color="Authors"), group=1) +
  ylab("Count") +
  xlab("Month") +
  theme(axis.text.x = element_text(angle = 90, hjust = 1))

Proportion of Authors by Email Domain

How many Google authors do we see? Tensorflow was originally a closed-source Google project that Google open sourced. It does not have an open governance model. Identifying the proportion of contributors that come from Google indicates the level of involvement from non-Google affliates.

tf_authors_hosts_month_google <- tf_authors_hosts_month %>% 
  mutate(is_google=ifelse(email_id_host=="google.com" | email_id_host=="tensorflow.org", 
                                 "Google",
                                 ifelse(email_id_host=="gmail.com", "Gmail", 
                                        ifelse(email_id_host=="github.com", "Github", "Not Google"))))

# All emails vs google.com
ggplot(tf_authors_hosts_month_google,
       aes(x=factor(commit_month), y=num_authors, fill=is_google)) + 
  geom_bar(stat="identity", position="dodge") +
  xlab("Months") +
  ylab("Authors") +
  theme(axis.text.x = element_text(angle = 90, hjust = 1))

# Exclude Google + email-provider addresses
# more than one overall author (otherwise there are too many to plot)
tf_authors_hosts_merge_not_google <- tf_authors_hosts_month_merge %>% 
  filter( !email_id_host %in% email_provider_domains 
          & !(email_id_host %in% c("google.com",
                                     "tensorflow.org",
                                     "github.com",
                                     "petewarden.com",
                                     "naml.us",
                                     "andyselle.com",
                                   "gunan-glaptop2.roam.corp.google.com")))

ggplot(tf_authors_hosts_merge_not_google,
       aes(x=factor(commit_month), y=num_authors, fill=reorder(email_id_host, -total_authors))) + 
  geom_bar(stat="identity", position="dodge") +
  xlab("Months") +
  ylab("Authors") +
  theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
  guides(fill=guide_legend(title="Hosts", ncol = 2))

tf_committers_not_google <- tf_authors_hosts_merge_not_google %>%
  select(email_id_host) %>%
  inner_join(tf_parsed_emails, by=c("email_id_host" = "host"))

write_csv(tf_committers_not_google, "data/tf_committers_not_google.csv")

Authors vs Commits

Unique Authors vs Total Commits per Month

Overall in the Tensorflow project, the number of authors and commits are not only increasing but appear to be correlated. Logically one can assume that more authors means more commits.

ggplot(data = tf_authors_commits, aes(x = factor(commit_month))) +
  geom_point(aes(y = num_authors*5, color="Authors (scaled)"), group=1) +
  geom_line(aes(y = num_authors*5, color="Authors (scaled)"), group=1) +
  geom_point(aes(y = num_commits, color="Commits"), group=2) +
  geom_line(aes(y = num_commits, color="Commits"), group=2) +
  ylab("Count") +
  xlab("Month") +
  theme(axis.text.x = element_text(angle = 90, hjust = 1))

ggplot(data = tf_authors_commits, aes(x = num_commits, y = num_authors)) +
  geom_point() +
  geom_smooth(method="lm", se=FALSE) +
  xlab("Commits") +
  ylab("Authors")

Unique Authors vs Commits per Author per Month

How are commits per author distributed as the number of authors per month increases?

As the number of authors increases the proportion of commits becomes more evenly distributed, so we see a downward trend in percent of commits per author. Early on in the project, more authors had a higher proportion of the commits.

The second plot filters out the “A. Unique TensorFlower” bot which is clearly responsible for the largest proportion of commits as the number of authors increases. When internal Google commits are pushed upstream, authors that can’t be matched to Github accounts get this account instead.

ggplot(data = tf_commits_by_author, aes(x = num_authors)) +
  geom_jitter(aes(y=num_author_commits/2, color="Commits")) +
  geom_jitter(aes(y=commits_pct * 100, color="Commits %")) + # scaled for comparison
  geom_smooth(aes(y=commits_pct*100), method="lm", se=FALSE) +
  xlab("Authors") +
  ylab("Commits per Author (Scaled")

ggplot(data = tf_commits_by_author %>% filter(author != "a. unique tensorflower"), 
       aes(x = num_authors)) +
  geom_jitter(aes(y=num_author_commits/2, color="Commits")) +
  geom_jitter(aes(y=commits_pct*100, color="Commits %")) + # scaled for comparison
  geom_smooth(aes(y=commits_pct*100), method="lm", se=FALSE) +
  xlab("Authors") +
  ylab("Commits per Author (Scaled)")

Commits per Author

Commits Per Author Frequency

What is the overall distribution for commits per author? How many commits do most authors make per month?

Commits per author were grouped into bins based on the rounded natural log. The maximum value of that set is reported in the density plots below.

The highest per month commit frequency seen is 1-4 per month.

ggplot(tf_commits_by_author_nobot, aes(x=factor(commits_max))) + 
  geom_density() +
  xlab("Commits per Author")

How many months have authors been contributing? Earlier analysis indicated that there were less authors in 2015 when the project started than there are in 2017.

The majority of Authors only have commits for 6 or less months.

ggplot(tf_commits_by_author_nobot, aes(x=factor(commits_months))) + 
  geom_bar() +
  xlab("Author Months")

Commits per Author Deviation

Commit deviation is a possible indicator of new committer proportion. As commit activity per author increases, so does the variability of commit activity from month to month. A project with a higher proportion of contributors with fewer months of commit history could be an indication of high developer interest. “Drive bys” are contributions from people do not continue to stay involved with the project, although they may continue to consume it. A high number of “drive bys” could indicate a strong developer user base.

Do authors make roughly the same number of commits each month?

The plot below shows what percentage of months the author comitted on have the same commit activity. In other words, for what percentage of months was the commit pattern observed?

This density plot would indicate that authors are fairly consistent in terms of their monthly commit activity. This consistency is likely due to the fact that the majority of authors have less than 6 months of commits.

ggplot(tf_commits_by_author_summary, 
       aes(x=commit_freq_pct*100)) + 
  geom_density() +
  xlab("% of Months with Observed Commit Pattern")

The next plot looks at how much variation there was amongst commit patterns for authors for each month. What percentage of months share an observed commit pattern? If an author has the same commit pattern for 5 out of their 10 months, then 50% of that author’s months share a commit pattern. The higher this percentage, the lower the overall deviation.

This density plot would indicate that a majority of authors show 100% shared commit pattern. This consistency is likely due to the fact that the majority of authors have less than 6 months of commits.

ggplot(tf_commits_by_author_summary, 
       aes(x=freq_bins_pct*100)) + 
  geom_density() +
  xlab("% of Months with Shared Commit Pattern")

Do authors who have contributed for more than 6 months show more deviation in their commit patterns?

Authors with greater than 6 months of commit history represent about 7% of all authors (71/1002). The majority of authors only have one month of commit activity.

ggplot(tf_commits_by_author_summary %>% filter(commits_months > 6), 
       aes(x=commit_freq_pct*100)) + 
  geom_bar() +
  xlab("% of Months with Observed Commit Pattern") +
  xlim(0,100)

ggplot(tf_commits_by_author_summary %>% filter(commits_months > 6), 
       aes(x=freq_bins_pct*100)) + 
  geom_bar() +
  xlab("% of Months with Shared Commit Pattern") +
  xlim(0,100)

When did authors with only one commit month make that commit?

ggplot(data = tf_authors_commits, aes(x = factor(commit_month))) +
  geom_point(aes(y = num_authors_single, color="Singles"), group=2) +
  geom_line(aes(y = num_authors_single, color="Singles"), group=2) +
  ylab("Count") +
  xlab("Month") +
  theme(axis.text.x = element_text(angle = 90, hjust = 1))

What proportion of authors each month do single authors represent?

Around 25-40% of authors each month will only ever commit for that month. When there are less authors in a particular month, the proportion of single month committers is significantly lower.

tf_authors_commits <- tf_authors_commits %>% mutate(authors_single_pct=num_authors_single/num_authors)

ggplot(data = tf_authors_commits, aes(x = factor(commit_month))) +
  geom_point(aes(y = num_authors, color="Authors"), group=1) +
  geom_line(aes(y = num_authors, color="Authors"), group=1) +
  geom_point(aes(y = num_authors_single, color="Singles"), group=2) +
  geom_line(aes(y = num_authors_single, color="Singles"), group=2) +
  ylab("Count") +
  xlab("Month") +
  theme(axis.text.x = element_text(angle = 90, hjust = 1))

ggplot(data = tf_authors_commits, aes(x = factor(commit_month), y = authors_single_pct*100)) +
  geom_point(group=1) +
  geom_line(group=1) +
  ylab("% of Authors w/ 1 Month") +
  xlab("Month") +
  theme(axis.text.x = element_text(angle = 90, hjust = 1))

Commit Author Frequency vs Deviation

Another indicator of project participation levels is the overall number of commits per author compared with commit frequency. Do authors that commit more frequently tend to have significantly more commits as well? In the Tensorflow project we see a small increase.

ggplot(tf_commits_by_author_nobot %>% filter(commit_month > "Nov 2015"), 
       aes(x=factor(commits_months), y=round(commits_pct,2)*100, colour=factor(commits_max))) + 
  geom_jitter() +
  xlab("# Months w/ Commits") +
  ylab("% Commits per Author") +
  guides(colour=guide_legend(title="Commits"))

Commits per Author Percent Change

Early on in the project there were less authors as is indicated by the higher proportion of commits. As the project has progressed however, the number of commits is more distributed.

The “tensor-gardener” is the only “author” consistently responsible for a high proportion of the commit activity. This comes from Google internally when employees make commits and they can’t be linked to an existing Github account. The proportion of these “Google” commits appears to be gradually decreasing which could indicate an increase in non-Google authors.

ggplot(data = tf_commits_by_author, 
       aes(x = factor(commit_month), y = commits_pct, fill=author)) +
  geom_bar(stat="identity", position="stack") +
  ylab("Commits") +
  xlab("Month") +
  theme(legend.position="none") +
  theme(axis.text.x = element_text(angle = 90, hjust = 1))

# Excluded the "bot" account
ggplot(data = tf_commits_by_author %>% filter(author != "a. unique tensorflower"), 
       aes(x = factor(commit_month), y = commits_pct, fill=author)) +
  geom_bar(stat="identity", position="stack") +
  ylab("Commits") +
  xlab("Month") +
  theme(legend.position="none") +
  theme(axis.text.x = element_text(angle = 90, hjust = 1))

Does the commit percentage change increase or decrease over time? That is, does an author need to make more or less commits to achieve the same commit proportion?

If project activity is gradually increasing, then we should see a decrease. That is, as more authors engage with a project and make more commits, an author needs to make more commits to acheive the same commit proportion. If project activity is becoming more distributed, authors who started out with a higher percentage of commits should gradually start representing a lower percentage. If, however, project activity is not well distributed, then we would contiue to see a high percentage of commits from the original authors, thus indicating a decreasing or neutral percent change.

ggplot(tf_commits_by_author_nobot %>% 
         filter(commit_month > "Nov 2015" & author_commits_log > 1 & author_commits_log < 5), 
       aes(x=factor(commit_month), y=commits_pct*100, colour=factor(commits_max))) + 
  geom_jitter() +
  xlab("Month") +
  ylab("Commits per Author (%)") +
  guides(colour=guide_legend(title="Commits")) +
  theme(axis.text.x = element_text(angle = 90, hjust = 1))

Is any one group responsible for most of the single-month contributors?

Authors commiting with google.com email domains are only responsible for a small proportion of the single commit month activity. While it’s the largest proportion of any non-email-provider host, it’s not likely that Google is dominating the single-month committer population.

tf_commits_hosts <- merge(tf_commits_by_author %>% filter(commits_months == 1), 
                          tf_parsed %>% select(author, commit_month, email_id_host, email_id),
                          by=c("author", "commit_month"))

tf_commits_hosts <- tf_commits_hosts %>% 
  group_by(commit_month, author, email_id_host) %>%
  summarise(commits=n(), email_id=first(email_id)) %>%
  group_by(commit_month, author) %>%
  mutate(num_hosts=n_distinct(email_id_host)) %>%
  group_by(commit_month, email_id_host) %>%
  mutate(host_authors=n_distinct(author)) %>%
  group_by(email_id_host) %>%
  mutate(total_authors=n_distinct(author))

ggplot(tf_commits_hosts %>% filter(host_authors > 1),
       aes(x=factor(commit_month), y=num_hosts, fill=reorder(email_id_host, -total_authors))) + 
  geom_bar(stat="identity") +
  xlab("Months") +
  ylab("Authors w/ 1 Commit Month") +
  theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
  guides(fill=guide_legend(title="Hosts"))

Conclusions

The key advantage of git log analysis is that all of the information is available locally (once the repository has been cloned) and does not take up much space. This is particularly advantageous for older projects that may have a large number of commits.

The key disadvantage is the inability to easily consolidate authors. Company affiliations identified through email address also need to be normalized and a change in company needs to be properly indicated for the new months. This was not fully addressed in this analysis. Tensorflow is a fairly young project so it isn’t as significant here as it would be with an older project where more authors may have changed companies.

The analysis of Tensorflow’s commit history suggests that a growing project will trend towards a smaller proportion of overall commits per month per author. The overall number of commits per author on its own does not provide much insight into overall community health or growth. The proportion of commits per author and whether the same number of commits per author goes up or down in proportion, however, is an indicator of activity distribution and is a potentially useful metric.

Proposed Metrics

Commit Months

Frequency of months with at least one commit. Shows consistency in involvement over time.

Commit Delta

Rate of change in commit frequency. An overall low rate of change suggests a high proportion of new authors.

Commit Deviation

Frequency of change in commit frequency. A high frequency suggests a high proportion of new authors.

Commit Percent Change

Commit frequency percent change. A high decrease suggests an increase in overall commit distribution per author.

Next Steps

Apply the proposed metrics to a larger sample of projects to see how they compare.
Improve author normalization
Improve company-domain identification
Compare domain identification with Github profile company

References

“Why Google Stores Billions of Lines of Code in a Single Repository” Communications of the ACM, Vol. 59 No. 7, Pages 78-87. https://cacm.acm.org/magazines/2016/7/204032-why-google-stores-billions-of-lines-of-code-in-a-single-repository/fulltext
“How the TensorFlow team handles open source support” O’Reilly Ideas. May 4, 2017. https://www.oreilly.com/ideas/how-the-tensorflow-team-handles-open-source-support

Tensorflow Committers

Augustina Ragwitz

October 9, 2017