1 Introduction

This notebook presents the functionality of the following functions from metric.R that measure developer engagement: engagement_communication(), productivity_author_commits(), and productivity_author_churn().

The engagement_sentiment() metric is covered separately in sentiment_analysis.Rmd, as it requires an additional pipeline to obtain the data.

rm(list = ls())
seed <- 1
set.seed(seed)
require(kaiaulu)
require(data.table)
require(yaml)
require(stringi)
require(knitr)
require(magrittr)
require(jsonlite)
require(ggplot2)

2 Project Configuration File

Analyzing open source projects often requires some manual work on your part to find where the open source project hosts its codebase and mailing list. Instead of hard-coding this on Notebooks, we keep this information in a project configuration file. Here’s the minimal information this Notebook requires in a project configuration file:

data_path:
  project_website: https://apr.apache.org/
  git_url: https://github.com/apache/apr
  git: ../rawdata/git_repo/APR/.git
  
filter:
  keep_filepaths_ending_with:
    - cpp
    - c
    - h
    - java
    - js
    - py
    - cc
  remove_filepaths_containing:
    - test

As you can see, the project configuration file is a simple bullet list. This is by design: We want the files to be human readable, so you can share by e-mail, include as appendix, or even attach as supplemental material in a conference. To facilitate formatting and commenting on files, we use .YAML, instead of plain .txt or markdown.

What this file tells the R Notebook is where to find the git log on your computer, and where you got it from in the first place. Currently we don’t really use project_website nor git_url but it is strongly encouraged, as we have encountered in the past projects with multiple mirrors where the git log contained discrepancies, making reproducing prior analysis much harder.

In the project configuration file we also specify filters. They tell this notebook what to keep after the . in a filename (i.e. what it ends with), and what it should not keep based on any word within the entire filepath name. For example, in APR unit tests are prefix with the word _test. In trying to reproduce related work, we found neglecting file filters led metrics such as churn blown out of proportion, for it includes many nonsensical file changes.

The file makes all assumptions explicit to you when using the code. Note these assumptions are not universal: They are particular to this project alone. This is why this lives in a project configuration file, instead of the codebase. Kaiaulu git repository /conf folder includes a few existing projects with that information. The idea is that we save time on the long run without having to look again on the project website manually.

The following code block reads the information explained just now:

tool <- parse_config("../tools.yml")
conf <- parse_config("../conf/kaiaulu.yml")
perceval_path <- get_tool_project("perceval", tool)
git_repo_path <- get_git_repo_path(conf)
git_branch <- get_git_branches(conf)[1]
nvdfeed_folder_path <- get_nvdfeed_folder_path(conf)

# Filters
file_extensions <- get_file_extensions(conf)
substring_filepath <- get_substring_filepath(conf)

This is all the project configuration files are used for. If you inspect the variables above, you will see they are just strings. As a reminder, the tools.yml is where you store the file paths to third party software on your computer. Please see Kaiaulu’s README.md for details. As a rule of thumb, any R Notebooks in Kaiaulu load the project configuration file at the start, much like you would normally initialize variables at the start of your source code.

3 Engagement Metric

3.1 Obtaining the Input Data

We can obtain commit comments, such as those from GitHub, by following download_github_issue_comments.Rmd. We will focus on “Issues and PR Comments by Date Range” for the Kaiaulu repository. The updated_lower_bound_comment was set to “2026-01-23” so that we can grab issue and PR comments from Kaiaulu since that date. You can change the value of updated_lower_bound_comment to obtain comments from a different date range. We’ll read in the json files we downloaded as data tables and bind them to one data table as our input data for the engagement_communication() function. You can change the save_path_issue_or_pr_comments variable to the path where you saved the json files for issue and PR comments on your computer.

# Set the path to the folder where you saved the json files for issue and PR comments
save_path_issue_or_pr_comments <- "../../rawdata/kaiaulu/github/sailuh_kaiaulu/issue_or_pr_comment/"

comment_files <- list.files(
  save_path_issue_or_pr_comments,
  full.names = TRUE,
  pattern = "\\.json$"
)

comment_json_list <- lapply(comment_files, jsonlite::read_json)
comment_dt_list <- lapply(comment_json_list, kaiaulu::github_parse_project_issue_or_pr_comments)
all_issue_or_pr_comments <- data.table::rbindlist(comment_dt_list, fill = TRUE)
# Ensure UTC timestamp type
all_issue_or_pr_comments[, created_at := as.POSIXct(created_at, tz = "UTC")]
# If you want to see the head of the table for our comments data below
# kable(head(all_issue_or_pr_comments, 2))

3.2 Obtaining the Output Table

Here, we will define the rolling window as 90 days (around 3 months), which is the default value for the quit_lag parameter in the engagement_communication() function. If you want to change the value of lag, you can do so by passing a different value in the function call. The output table will contain the number of messages sent by each developer in the past 90 days at each point in time, which is one way to measure their engagement with the project. Note that the engagement_communication() function can be used with any type of communication data, such as issue comments or pull request comments.

quit_lag = 90
engagement_communication_output <- engagement_communication(
  timestamp = all_issue_or_pr_comments$created_at,
  author_login = all_issue_or_pr_comments$comment_user_login,
  quit_lag = quit_lag
)
kable(head(engagement_communication_output))
author_login timestamp message_count
BenjyNStrauss 2024-08-02 3
BenjyNStrauss 2024-08-03 4
CorneJB 2021-05-06 1
CorneJB 2021-05-11 2
CorneJB 2021-05-15 4
CorneJB 2021-05-16 5

3.3 Visualizing the Output Tables as a Time Series

To visualize the output table as a time series, we can use the ggplot2 package. Visualizing the engagement_communication() metric as a time series is useful in observing the engagement of an author over time through the number of messages they send. One way to think of it is that in conjunction with the productivity metrics, we can use the engagement metrics to understand if an author is becoming more or less engaged over time with their sentiment scores reflecting such increase or decrease in engagement. Reminder that the data in the time series below covers the communication data we obtained from GitHub issue and PR comments since “2026-01-23” as explained in the previous section.

ggplot(engagement_communication_output, aes(x = timestamp, y = message_count, color = author_login)) +
  geom_line() +
  labs(title = paste0("Author Engagement on ", quit_lag, "-day Time Window"),
       x = "Date",
       y = "Number of Messages",
       color = "Author"
       ) +
       theme_minimal() +
       theme(plot.title = element_text(hjust = 0.5))

4 Productivity Metrics

4.1 Obtaining the Git Log

Our first step is to parse the git log. Many of the variables in Kaiaulu are tables, which makes it easier to export and manually inspect the data at any step of the analysis.

git_checkout(git_branch,git_repo_path)
## [1] "Your branch is up to date with 'origin/master'."
project_git <- parse_gitlog(perceval_path,git_repo_path)
#project_git <- parse_gitlog(perceval_path,git_repo_path,save_path)
#project_git <- readRDS(save_path)

4.2 Filter files

We may also want to filter files, to include only source code and exclude test files for example.

project_git <- project_git  %>%
  filter_by_file_extension(file_extensions,"file_pathname")  %>% 
  filter_by_filepath_substring(substring_filepath,"file_pathname")

4.3 Obtaining the Output Tables

We must first convert all timestamps to the same timezone. Here, we use UTC.

project_git$author_datetimetz <- as.POSIXct(
  project_git$author_datetimetz,
  format = "%a %b %d %H:%M:%S %Y %z",
  tz = "UTC"
)

Now, we can obtain the output tables from productivity_author_commits and productivity_author_churn. We will save them to a variable called productivity_author_commits_output and productivity_author_churn_output respectively and keep the default value of lag to be 90 days (around 3 months) to define the rolling window. If you want to change the value of lag, you can do so by passing a different value in the function call. For example, if you want to use a 120-day (around 4 months) rolling window, you can set lag = 120 in the function call.

lag = 90
productivity_author_commits_output <- productivity_author_commits(project_git, lag = lag)
productivity_author_churn_output <- productivity_author_churn(project_git, lag = lag)

We’ll show the head of the output tables for the two functions below.

kable(head(productivity_author_commits_output))
author_name_email author_datetimetz author_total_commits
Carlos Paradis 2020-05-23 15:12:20 1
Carlos Paradis 2020-05-24 02:47:52 2
Carlos Paradis 2020-05-24 05:50:05 3
Carlos Paradis 2020-05-24 06:26:53 4
Carlos Paradis 2020-05-24 10:33:18 5
Carlos Paradis 2020-05-25 06:14:46 6
kable(head(productivity_author_churn_output))
author_name_email author_datetimetz lines_added lines_removed author_churn
Carlos Paradis 2020-05-23 15:12:20 35 0 35
Carlos Paradis 2020-05-24 02:47:52 59 2 61
Carlos Paradis 2020-05-24 05:50:05 103 6 109
Carlos Paradis 2020-05-24 06:26:53 105 6 111
Carlos Paradis 2020-05-24 10:33:18 132 15 147
Carlos Paradis 2020-05-25 06:14:46 149 30 179

4.4 Visualizing the Output Tables as a Time Series

To visualize the output tables as a time series, we can use the ggplot2 package. Visualizing the productivity metrics as a time series is useful in observing the engagement of an author over time through the number of messages they send and their churn. One way to think of it is that in conjunction with the engagement metrics, we can use the productivity metrics to understand if an author is becoming more or less engaged over time with their sentiment scores reflecting such increase or decrease in engagement.

4.4.1 Plotting the Time Series for the Top Authors

We’ll select the top 10 authors based on their number of commits and churn, and then plot the time series for these authors.

top_n_authors <- 10

The timeseries below shows the rolling window for author productivity by unique commits.

# Compute the top authors based on unique commits
top_authors_commits <- productivity_author_commits_output[
    ,.(max_commits = max(author_total_commits, na.rm = TRUE)),
    by = author_name_email
    # Sort max commits in descending order and select the top n authors
][order(-max_commits)][1:top_n_authors, author_name_email]

# Select only the rows corresponding to the top authors based on unique commits
plot_commits_dt <- productivity_author_commits_output[
    author_name_email %in% top_authors_commits
    ][order(author_name_email, author_datetimetz)]

ggplot(plot_commits_dt, aes(x = author_datetimetz, y = author_total_commits, color = author_name_email)) +
    geom_line() +
    labs(title = paste0("Author Productivity on ", lag, "-day Time Window"),
         x = "Date",
         y = "Number of Commits",
         color = "Author"
         ) +
         theme_minimal() +
         theme(plot.title = element_text(hjust = 0.5))

The timeseries below shows the rolling window for author productivity by churn.

# Compute the top authors based on churn
top_authors_churn <- productivity_author_churn_output[
    ,.(max_churn = max(author_churn, na.rm = TRUE)),
    by = author_name_email
    # Sort max churn in descending order and select the top n authors
][order(-max_churn)][1:top_n_authors, author_name_email]

# Select only the rows corresponding to the top authors based on churn
plot_churn_dt <- productivity_author_churn_output[
   author_name_email %in% top_authors_churn
   ][order(author_name_email, author_datetimetz)]


ggplot(plot_churn_dt, aes(x = author_datetimetz, y = author_churn, color = author_name_email)) +
    geom_line() +
    labs(title = paste0("Author Productivity on ", lag, "-day Time Window"),
         x = "Date",
         y = "Churn",
         color = "Author"
         ) +
         theme_minimal() +
         theme(plot.title = element_text(hjust = 0.5))

5 Comparing the Time Series of Different Engagement Metrics

We can compare the time series of an author for all of the metrics across the rolling window. To do this, we will first select one author. Then, we will plot the metrics on the same graph to compare their trends over time.

one_author <- "Carlos Paradis <carlosviansi@gmail.com>"
one_author_login <-"carlosparadis"

# Filter to one author for engagement communication
engagement_dt <- engagement_communication_output[
  author_login == one_author_login,
  .(author_datetimetz = timestamp, author_total_messages = message_count)
][order(author_datetimetz)]

# Filter to one author for commits
commits_dt <- productivity_author_commits_output[
  author_name_email == one_author,
  .(author_datetimetz, author_total_commits)
][order(author_datetimetz)]

# Filter to one author for churn
churn_dt <- productivity_author_churn_output[
  author_name_email == one_author,
  .(author_datetimetz, author_churn)
][order(author_datetimetz)]

# Join on timestamp so all metrics share the same x-axis
plot_dt <- merge(commits_dt, churn_dt, by = "author_datetimetz", all = TRUE)
plot_dt <- merge(plot_dt, engagement_dt, by = "author_datetimetz", all = TRUE)

# Make long table for plotting all metrics
plot_long <- rbindlist(list(
  plot_dt[, .(author_datetimetz, metric = "Communication - Messages", value = author_total_messages)],
  plot_dt[, .(author_datetimetz, metric = "Commits", value = author_total_commits)],
  plot_dt[, .(author_datetimetz, metric = "Churn",   value = author_churn)]
))

ggplot(plot_long, aes(x = author_datetimetz, y = value, color = metric)) +
  geom_line() +
  labs(
    title = paste0("Rolling Window Metrics on ", lag, "-day Time Window: ", one_author),
    x = "Date",
    y = "Count",
    color = "Metric"
  ) +
  theme_minimal() +
  theme(plot.title = element_text(hjust = 0.5))
## Warning: Removed 302 rows containing missing values (`geom_line()`).