This notebook presents the functionality of the following functions
from metric.R that measure developer engagement:
engagement_communication(),
productivity_author_commits(), and
productivity_author_churn().
The engagement_sentiment() metric is covered separately
in sentiment_analysis.Rmd, as it requires an additional
pipeline to obtain the data.
rm(list = ls())
seed <- 1
set.seed(seed)
require(kaiaulu)
require(data.table)
require(yaml)
require(stringi)
require(knitr)
require(magrittr)
require(jsonlite)
require(ggplot2)
Analyzing open source projects often requires some manual work on your part to find where the open source project hosts its codebase and mailing list. Instead of hard-coding this on Notebooks, we keep this information in a project configuration file. Here’s the minimal information this Notebook requires in a project configuration file:
data_path:
project_website: https://apr.apache.org/
git_url: https://github.com/apache/apr
git: ../rawdata/git_repo/APR/.git
filter:
keep_filepaths_ending_with:
- cpp
- c
- h
- java
- js
- py
- cc
remove_filepaths_containing:
- test
As you can see, the project configuration file is a simple bullet list. This is by design: We want the files to be human readable, so you can share by e-mail, include as appendix, or even attach as supplemental material in a conference. To facilitate formatting and commenting on files, we use .YAML, instead of plain .txt or markdown.
What this file tells the R Notebook is where to find the git log on
your computer, and where you got it from in the first place. Currently
we don’t really use project_website nor
git_url but it is strongly encouraged, as we have
encountered in the past projects with multiple mirrors where the git log
contained discrepancies, making reproducing prior
analysis much harder.
In the project configuration file we also specify filters. They tell
this notebook what to keep after the . in a filename
(i.e. what it ends with), and what it should not keep
based on any word within the entire filepath name. For example, in APR
unit tests are prefix with the word _test. In trying to
reproduce related work, we found neglecting file filters led metrics
such as churn blown out of proportion, for it includes many
nonsensical file changes.
The file makes all assumptions explicit to you when using the code.
Note these assumptions are not universal: They are particular to this
project alone. This is why this lives in a project configuration file,
instead of the codebase. Kaiaulu git repository /conf
folder includes a few existing projects with that information. The idea
is that we save time on the long run without having to look again on the
project website manually.
The following code block reads the information explained just now:
tool <- parse_config("../tools.yml")
conf <- parse_config("../conf/kaiaulu.yml")
perceval_path <- get_tool_project("perceval", tool)
git_repo_path <- get_git_repo_path(conf)
git_branch <- get_git_branches(conf)[1]
nvdfeed_folder_path <- get_nvdfeed_folder_path(conf)
# Filters
file_extensions <- get_file_extensions(conf)
substring_filepath <- get_substring_filepath(conf)
This is all the project configuration files are used for. If you
inspect the variables above, you will see they are just strings. As a
reminder, the tools.yml is where you store the file paths
to third party software on your computer. Please see Kaiaulu’s
README.md for details. As a rule of thumb, any R Notebooks
in Kaiaulu load the project configuration file at the start, much like
you would normally initialize variables at the start of your source
code.
We can obtain commit comments, such as those from GitHub, by
following download_github_issue_comments.Rmd. We will focus
on “Issues and PR Comments by Date Range” for the Kaiaulu repository.
The updated_lower_bound_comment was set to “2026-01-23” so
that we can grab issue and PR comments from Kaiaulu since that date. You
can change the value of updated_lower_bound_comment to
obtain comments from a different date range. We’ll read in the json
files we downloaded as data tables and bind them to one data table as
our input data for the engagement_communication() function.
You can change the save_path_issue_or_pr_comments variable
to the path where you saved the json files for issue and PR comments on
your computer.
# Set the path to the folder where you saved the json files for issue and PR comments
save_path_issue_or_pr_comments <- "../../rawdata/kaiaulu/github/sailuh_kaiaulu/issue_or_pr_comment/"
comment_files <- list.files(
save_path_issue_or_pr_comments,
full.names = TRUE,
pattern = "\\.json$"
)
comment_json_list <- lapply(comment_files, jsonlite::read_json)
comment_dt_list <- lapply(comment_json_list, kaiaulu::github_parse_project_issue_or_pr_comments)
all_issue_or_pr_comments <- data.table::rbindlist(comment_dt_list, fill = TRUE)
# Ensure UTC timestamp type
all_issue_or_pr_comments[, created_at := as.POSIXct(created_at, tz = "UTC")]
# If you want to see the head of the table for our comments data below
# kable(head(all_issue_or_pr_comments, 2))
Here, we will define the rolling window as 90 days (around 3 months),
which is the default value for the quit_lag parameter in
the engagement_communication() function. If you want to
change the value of lag, you can do so by passing a different value in
the function call. The output table will contain the number of messages
sent by each developer in the past 90 days at each point in time, which
is one way to measure their engagement with the project. Note that the
engagement_communication() function can be used with any
type of communication data, such as issue comments or pull request
comments.
quit_lag = 90
engagement_communication_output <- engagement_communication(
timestamp = all_issue_or_pr_comments$created_at,
author_login = all_issue_or_pr_comments$comment_user_login,
quit_lag = quit_lag
)
kable(head(engagement_communication_output))
| author_login | timestamp | message_count |
|---|---|---|
| BenjyNStrauss | 2024-08-02 | 3 |
| BenjyNStrauss | 2024-08-03 | 4 |
| CorneJB | 2021-05-06 | 1 |
| CorneJB | 2021-05-11 | 2 |
| CorneJB | 2021-05-15 | 4 |
| CorneJB | 2021-05-16 | 5 |
To visualize the output table as a time series, we can use the
ggplot2 package. Visualizing the
engagement_communication() metric as a time series is
useful in observing the engagement of an author over time through the
number of messages they send. One way to think of it is that in
conjunction with the productivity metrics, we can use the engagement
metrics to understand if an author is becoming more or less engaged over
time with their sentiment scores reflecting such increase or decrease in
engagement. Reminder that the data in the time series below covers the
communication data we obtained from GitHub issue and PR comments since
“2026-01-23” as explained in the previous section.
ggplot(engagement_communication_output, aes(x = timestamp, y = message_count, color = author_login)) +
geom_line() +
labs(title = paste0("Author Engagement on ", quit_lag, "-day Time Window"),
x = "Date",
y = "Number of Messages",
color = "Author"
) +
theme_minimal() +
theme(plot.title = element_text(hjust = 0.5))
Our first step is to parse the git log. Many of the variables in Kaiaulu are tables, which makes it easier to export and manually inspect the data at any step of the analysis.
git_checkout(git_branch,git_repo_path)
## [1] "Your branch is up to date with 'origin/master'."
project_git <- parse_gitlog(perceval_path,git_repo_path)
#project_git <- parse_gitlog(perceval_path,git_repo_path,save_path)
#project_git <- readRDS(save_path)
We may also want to filter files, to include only source code and exclude test files for example.
project_git <- project_git %>%
filter_by_file_extension(file_extensions,"file_pathname") %>%
filter_by_filepath_substring(substring_filepath,"file_pathname")
We must first convert all timestamps to the same timezone. Here, we use UTC.
project_git$author_datetimetz <- as.POSIXct(
project_git$author_datetimetz,
format = "%a %b %d %H:%M:%S %Y %z",
tz = "UTC"
)
Now, we can obtain the output tables from
productivity_author_commits and
productivity_author_churn. We will save them to a variable
called productivity_author_commits_output and
productivity_author_churn_output respectively and keep the
default value of lag to be 90 days (around 3 months) to define the
rolling window. If you want to change the value of lag, you can do so by
passing a different value in the function call. For example, if you want
to use a 120-day (around 4 months) rolling window, you can set
lag = 120 in the function call.
lag = 90
productivity_author_commits_output <- productivity_author_commits(project_git, lag = lag)
productivity_author_churn_output <- productivity_author_churn(project_git, lag = lag)
We’ll show the head of the output tables for the two functions below.
kable(head(productivity_author_commits_output))
| author_name_email | author_datetimetz | author_total_commits |
|---|---|---|
| Carlos Paradis carlosviansi@gmail.com | 2020-05-23 15:12:20 | 1 |
| Carlos Paradis carlosviansi@gmail.com | 2020-05-24 02:47:52 | 2 |
| Carlos Paradis carlosviansi@gmail.com | 2020-05-24 05:50:05 | 3 |
| Carlos Paradis carlosviansi@gmail.com | 2020-05-24 06:26:53 | 4 |
| Carlos Paradis carlosviansi@gmail.com | 2020-05-24 10:33:18 | 5 |
| Carlos Paradis carlosviansi@gmail.com | 2020-05-25 06:14:46 | 6 |
kable(head(productivity_author_churn_output))
| author_name_email | author_datetimetz | lines_added | lines_removed | author_churn |
|---|---|---|---|---|
| Carlos Paradis carlosviansi@gmail.com | 2020-05-23 15:12:20 | 35 | 0 | 35 |
| Carlos Paradis carlosviansi@gmail.com | 2020-05-24 02:47:52 | 59 | 2 | 61 |
| Carlos Paradis carlosviansi@gmail.com | 2020-05-24 05:50:05 | 103 | 6 | 109 |
| Carlos Paradis carlosviansi@gmail.com | 2020-05-24 06:26:53 | 105 | 6 | 111 |
| Carlos Paradis carlosviansi@gmail.com | 2020-05-24 10:33:18 | 132 | 15 | 147 |
| Carlos Paradis carlosviansi@gmail.com | 2020-05-25 06:14:46 | 149 | 30 | 179 |
To visualize the output tables as a time series, we can use the
ggplot2 package. Visualizing the productivity metrics as a
time series is useful in observing the engagement of an author over time
through the number of messages they send and their churn. One way to
think of it is that in conjunction with the engagement metrics, we can
use the productivity metrics to understand if an author is becoming more
or less engaged over time with their sentiment scores reflecting such
increase or decrease in engagement.
We can compare the time series of an author for all of the metrics across the rolling window. To do this, we will first select one author. Then, we will plot the metrics on the same graph to compare their trends over time.
one_author <- "Carlos Paradis <carlosviansi@gmail.com>"
one_author_login <-"carlosparadis"
# Filter to one author for engagement communication
engagement_dt <- engagement_communication_output[
author_login == one_author_login,
.(author_datetimetz = timestamp, author_total_messages = message_count)
][order(author_datetimetz)]
# Filter to one author for commits
commits_dt <- productivity_author_commits_output[
author_name_email == one_author,
.(author_datetimetz, author_total_commits)
][order(author_datetimetz)]
# Filter to one author for churn
churn_dt <- productivity_author_churn_output[
author_name_email == one_author,
.(author_datetimetz, author_churn)
][order(author_datetimetz)]
# Join on timestamp so all metrics share the same x-axis
plot_dt <- merge(commits_dt, churn_dt, by = "author_datetimetz", all = TRUE)
plot_dt <- merge(plot_dt, engagement_dt, by = "author_datetimetz", all = TRUE)
# Make long table for plotting all metrics
plot_long <- rbindlist(list(
plot_dt[, .(author_datetimetz, metric = "Communication - Messages", value = author_total_messages)],
plot_dt[, .(author_datetimetz, metric = "Commits", value = author_total_commits)],
plot_dt[, .(author_datetimetz, metric = "Churn", value = author_churn)]
))
ggplot(plot_long, aes(x = author_datetimetz, y = value, color = metric)) +
geom_line() +
labs(
title = paste0("Rolling Window Metrics on ", lag, "-day Time Window: ", one_author),
x = "Date",
y = "Count",
color = "Metric"
) +
theme_minimal() +
theme(plot.title = element_text(hjust = 0.5))
## Warning: Removed 302 rows containing missing values (`geom_line()`).