This notebook presents the functionality of the following functions
from metric.R that measure developer engagement:
engagement_communication(),
productivity_author_commits(), and
productivity_author_churn().
The engagement_sentiment() metric is covered separately
in sentiment_analysis.Rmd, as it requires an additional
pipeline to obtain the data.
rm(list = ls())
seed <- 1
set.seed(seed)
require(kaiaulu)
require(data.table)
require(yaml)
require(stringi)
require(knitr)
require(magrittr)
require(jsonlite)
require(ggplot2)
require(gt)
Analyzing open source projects often requires some manual work on your part to find where the open source project hosts its codebase and mailing list. Instead of hard-coding this on Notebooks, we keep this information in a project configuration file. Here’s the minimal information this Notebook requires in a project configuration file:
data_path:
project_website: https://apr.apache.org/
git_url: https://github.com/apache/apr
git: ../rawdata/git_repo/APR/.git
filter:
keep_filepaths_ending_with:
- cpp
- c
- h
- java
- js
- py
- cc
remove_filepaths_containing:
- test
As you can see, the project configuration file is a simple bullet list. This is by design: We want the files to be human readable, so you can share by e-mail, include as appendix, or even attach as supplemental material in a conference. To facilitate formatting and commenting on files, we use .YAML, instead of plain .txt or markdown.
What this file tells the R Notebook is where to find the git log on
your computer, and where you got it from in the first place. Currently
we don’t really use project_website nor
git_url but it is strongly encouraged, as we have
encountered in the past projects with multiple mirrors where the git log
contained discrepancies, making reproducing prior
analysis much harder.
In the project configuration file we also specify filters. They tell
this notebook what to keep after the . in a filename
(i.e. what it ends with), and what it should not keep
based on any word within the entire filepath name. For example, in APR
unit tests are prefix with the word _test. In trying to
reproduce related work, we found neglecting file filters led metrics
such as churn blown out of proportion, for it includes many
nonsensical file changes.
The file makes all assumptions explicit to you when using the code.
Note these assumptions are not universal: They are particular to this
project alone. This is why this lives in a project configuration file,
instead of the codebase. Kaiaulu git repository /conf
folder includes a few existing projects with that information. The idea
is that we save time on the long run without having to look again on the
project website manually.
The following code block reads the information explained just now:
tool <- parse_config("../tools.yml")
conf <- parse_config("../conf/kaiaulu.yml")
perceval_path <- get_tool_project("perceval", tool)
git_repo_path <- get_git_repo_path(conf)
git_branch <- get_git_branches(conf)[1]
# Filters
file_extensions <- get_file_extensions(conf)
substring_filepath <- get_substring_filepath(conf)
lag <- 30
This is all the project configuration files are used for. If you
inspect the variables above, you will see they are just strings. As a
reminder, the tools.yml is where you store the file paths
to third party software on your computer. Please see Kaiaulu’s
README.md for details. As a rule of thumb, any R Notebooks
in Kaiaulu load the project configuration file at the start, much like
you would normally initialize variables at the start of your source
code.
We can obtain commit comments, such as those from GitHub, by
following download_github_issue_comments.Rmd. We will focus
on “Issues and PR Comments by Date Range” for the Kaiaulu repository.
The updated_lower_bound_comment was set to the default of
2024-04-25 so that we can grab issue and PR comments from
Kaiaulu since that date. You can change the value of
updated_lower_bound_comment to obtain comments from a
different date range. We’ll read in the json files we downloaded as data
tables and bind them to one data table as our input data for the
engagement_communication() function. You can change the
save_path_issue_or_pr_comments variable to the path where
you saved the json files for issue and PR comments on your computer.
# Set the path to the folder where you saved the json files for issue and PR comments
save_path_issue_or_pr_comments <- "../../rawdata/kaiaulu/github/sailuh_kaiaulu/issue_or_pr_comment/"
comment_files <- list.files(
save_path_issue_or_pr_comments,
full.names = TRUE,
pattern = "\\.json$"
)
comment_json_list <- lapply(comment_files, jsonlite::read_json)
comment_dt_list <- lapply(comment_json_list, kaiaulu::github_parse_project_issue_or_pr_comments)
all_issue_or_pr_comments <- data.table::rbindlist(comment_dt_list, fill = TRUE)
# Ensure UTC timestamp type
all_issue_or_pr_comments[, created_at := as.POSIXct(created_at, tz = "UTC")]
head(all_issue_or_pr_comments) %>%
gt(auto_align = FALSE)
| comment_id | html_url | issue_url | created_at | updated_at | comment_user_login | author_association | body |
|---|---|---|---|---|---|---|---|
| 615692483 | https://github.com/sailuh/kaiaulu/issues/2#issuecomment-615692483 | https://api.github.com/repos/sailuh/kaiaulu/issues/2 | 2020-04-18 | 2020-04-18T08:48:14Z | carlosparadis | MEMBER | Quick search for Ctags on Codeface to see where it is used, in hoping to find how functions can be extracted efficiently across an entire git log: https://github.com/siemens/codeface/search?q=ctags&unscoped_q=ctags |
| 615702209 | https://github.com/sailuh/kaiaulu/issues/2#issuecomment-615702209 | https://api.github.com/repos/sailuh/kaiaulu/issues/2 | 2020-04-18 | 2020-04-18T07:49:34Z | carlosparadis | MEMBER | Code logic on how Codeface parse Exuberant Ctags to identify functions on source code using the `python ctags` package linked above: https://github.com/siemens/codeface/blob/e6640c931f76e82719982318a5cd6facf1f3df48/codeface/VCS.py#L1370-L1421 The limitation of Java and C# seems to be a consequence of how the tags are written |
| 615775861 | https://github.com/sailuh/kaiaulu/issues/2#issuecomment-615775861 | https://api.github.com/repos/sailuh/kaiaulu/issues/2 | 2020-04-18 | 2020-04-18T08:44:16Z | carlosparadis | MEMBER | # Code logic for Exuberant Ctags In the original Mitchel paper, it is noted the following: <img width="479" alt="Screen Shot 2020-04-17 at 10 35 39 PM" src="https://user-images.githubusercontent.com/17270563/79632562-d95b3000-80fb-11ea-927e-ceb865d844fb.png"> <img width="468" alt="Screen Shot 2020-04-17 at 10 35 48 PM" src="https://user-images.githubusercontent.com/17270563/79632564-dc562080-80fb-11ea-9a6d-cc946297fa0a.png"> That is all the explanation that Codeface paper will provide on how functions are extracted from a git log. Which is not sufficient to reproduce. Looking for "ctags" on codeface repo, the reference only occurs in 4 spots: https://github.com/siemens/codeface/search?q=import+ctags&unscoped_q=import+ctags # Location where ctags occur From the link above, I verified codeface/utils.py only contains minimal code to check that ctags was installed properly. cluster_py appears to use the code from VCS.py and the file name is not very suggestive either of that responsibility. ## VCS.py Ctags This test gives a minimal example of the parser in action leveraging python_ctags library: https://github.com/siemens/codeface/blob/e6640c931f76e82719982318a5cd6facf1f3df48/codeface/VCS.py#L1548-L1570 |
| 615784946 | https://github.com/sailuh/kaiaulu/issues/2#issuecomment-615784946 | https://api.github.com/repos/sailuh/kaiaulu/issues/2 | 2020-04-18 | 2020-04-18T08:51:00Z | rnkazman | NONE | In any case, we need to extract file-based info because everything we do and every metric we collect is file-based. On Fri, Apr 17, 2020 at 10:44 PM Carlos Paradis <notifications@github.com> wrote: > Code logic for Exuberant Ctags > > In the original Mitchel paper, it is noted the following: > > [image: Screen Shot 2020-04-17 at 10 35 39 PM] > <https://user-images.githubusercontent.com/17270563/79632562-d95b3000-80fb-11ea-927e-ceb865d844fb.png> > > [image: Screen Shot 2020-04-17 at 10 35 48 PM] > <https://user-images.githubusercontent.com/17270563/79632564-dc562080-80fb-11ea-9a6d-cc946297fa0a.png> > > That is all the explanation that Codeface paper will provide on how > functions are extracted from a git log. Which is not sufficient to > reproduce. Looking for "ctags" on codeface repo, the reference only occurs > in 4 spots: > > > https://github.com/siemens/codeface/search?q=import+ctags&unscoped_q=import+ctags > Location where ctags occur > > From the link above, I verified codeface/utils.py only contains minimal > code to check that ctags was installed properly. cluster_py appears to use > the code from VCS.py and the file name is not very suggestive either of > that responsibility. > VCS.py Ctags > > This test gives a minimal example of the parser in action leveraging > python_ctags library: > > > https://github.com/siemens/codeface/blob/e6640c931f76e82719982318a5cd6facf1f3df48/codeface/VCS.py#L1548-L1570 > > — > You are receiving this because you are subscribed to this thread. > Reply to this email directly, view it on GitHub > <https://github.com/sailuh/social_technical_smells/issues/2#issuecomment-615775861>, > or unsubscribe > <https://github.com/notifications/unsubscribe-auth/ACUNTW3LIAUQNSMEFURWNI3RNFR6XANCNFSM4MLHX4PA> > . > |
| 615816471 | https://github.com/sailuh/kaiaulu/issues/2#issuecomment-615816471 | https://api.github.com/repos/sailuh/kaiaulu/issues/2 | 2020-04-18 | 2020-04-18T09:20:45Z | carlosparadis | MEMBER | oops, I forgot you get spammed with notifications on anything I create here. Don't worry about this ;) It's a minor digression of curiosity on the Exuberant Ctags, which I never heard of. Somehow, this is scalable to use at every single commit of a project, and there are rare few things that can scale like that. The reason why Scitools Understand, Depends and Titan are snapshot oriented in regards to SDSM. So I am just curious how they scalated this to every single commit for large project histories. If anything, In the last comment I actually found a small enough example that I can make sense of, using some of the OO definitions from Codeface, which I will likely need to understand to extract Damian's Metrics. So it's not a total time waste. |
| 616135873 | https://github.com/sailuh/kaiaulu/issues/3#issuecomment-616135873 | https://api.github.com/repos/sailuh/kaiaulu/issues/3 | 2020-04-19 | 2020-04-19T13:33:30Z | carlosparadis | MEMBER | # Wolfgang MBOX Downloader
BASE_URL=http://mail-archives.apache.org/mod_mbox PROJECT=hbase-dev FROM=2015 TO=2020 for year in cat *.mbox > ${PROJECT}.mbox rm ${PROJECT}_*.mbox ```` 4 Perceval ApproachMbox: https://chaoss.github.io/grimoirelab-tutorial/perceval/mail.html |
## Obtaining the Output Table
Here, we will define the rolling window as 90 days (around 3 months), which is the default value for the `lag` parameter in the `engagement_communication()` function. If you want to change the value of lag, you can do so by passing a different value in the function call. The output table will contain the number of messages sent by each developer in the past 90 days at each point in time, which is one way to measure their engagement with a project. Note that the `engagement_communication()` function can be used with any type of communication data, such as issue comments or pull request comments.
```r
engagement_communication_output <- engagement_communication(
timestamp = all_issue_or_pr_comments$created_at,
author_login = all_issue_or_pr_comments$comment_user_login,
lag = lag
)
head(engagement_communication_output) %>%
gt(auto_align = FALSE)
| author_login | timestamp | message_count |
|---|---|---|
| CorneJB | 2021-06-05 | 20 |
| CorneJB | 2021-07-05 | 0 |
| CorneJB | 2021-08-04 | 2 |
| CorneJB | 2021-09-03 | 0 |
| Michelle4929 | 2026-02-26 | 19 |
| Michelle4929 | 2026-03-28 | 14 |
To visualize the output table as a time series, we can use the
ggplot2 package. Visualizing the
engagement_communication() metric as a time series is
useful in observing the engagement of an author over time through the
number of messages they send. One way to think of it is that in
conjunction with the productivity metrics, we can use the engagement
metrics to understand if an author is becoming more or less engaged over
time, with their sentiment scores reflecting such increase or decrease
in engagement. Reminder that the data in the time series below covers
the communication data we obtained from GitHub issue and PR comments
since 2024-04-25 as explained in the previous section.
ggplot(engagement_communication_output, aes(x = timestamp, y = message_count, color = author_login)) +
geom_point(alpha = 0.7) +
geom_line() +
labs(title = paste0("Author Engagement on ", lag, "-day Rolling Window"),
x = "Date",
y = "Number of Messages",
color = "Author"
) +
theme_minimal() +
theme(plot.title = element_text(hjust = 0.5))
Our first step is to parse the git log. Many of the variables in Kaiaulu are tables, which makes it easier to export and manually inspect the data at any step of the analysis.
git_checkout(git_branch,git_repo_path)
## [1] "Your branch is up to date with 'origin/master'."
project_git <- parse_gitlog(perceval_path,git_repo_path)
#project_git <- parse_gitlog(perceval_path,git_repo_path,save_path)
#project_git <- readRDS(save_path)
We may also want to filter files to include only source code and exclude test files for example.
project_git <- project_git %>%
filter_by_file_extension(file_extensions,"file_pathname") %>%
filter_by_filepath_substring(substring_filepath,"file_pathname")
We must first convert all timestamps to the same timezone. Here, we use UTC.
project_git$author_datetimetz <- as.POSIXct(
project_git$author_datetimetz,
format = "%a %b %d %H:%M:%S %Y %z",
tz = "UTC"
)
Now, we can obtain the output tables from
productivity_author_commits() and
productivity_author_churn(). We will save them to a
variable called productivity_author_commits_output and
productivity_author_churn_output respectively and keep the
default value of lag to be 90 days (around 3 months) to define the
rolling window. If you want to change the value of lag, you can do so by
passing a different value in the function call. For example, if you want
to use a 120-day (around 4 months) rolling window, you can set
lag = 120 in the function call.
productivity_author_commits_output <- productivity_author_commits(project_git, lag = lag)
productivity_author_churn_output <- productivity_author_churn(project_git, lag = lag)
We’ll show the head of the output tables for the two functions below.
head(productivity_author_commits_output) %>%
gt(auto_align = FALSE)
| author_name_email | author_datetimetz | author_total_commits |
|---|---|---|
| Carlos Paradis <carlosviansi@gmail.com> | 2020-06-22 15:12:20 | 35 |
| Carlos Paradis <carlosviansi@gmail.com> | 2020-07-22 15:12:20 | 10 |
| Carlos Paradis <carlosviansi@gmail.com> | 2020-08-21 15:12:20 | 11 |
| Carlos Paradis <carlosviansi@gmail.com> | 2021-05-18 15:12:20 | 3 |
| Carlos Paradis <carlosviansi@gmail.com> | 2021-06-17 15:12:20 | 1 |
| Carlos Paradis <carlosviansi@gmail.com> | 2021-07-17 15:12:20 | 5 |
head(productivity_author_churn_output) %>%
gt(auto_align = FALSE)
| author_name_email | author_datetimetz | lines_added | lines_removed | author_churn |
|---|---|---|---|---|
| Carlos Paradis <carlosviansi@gmail.com> | 2020-06-22 15:12:20 | 1027 | 152 | 1179 |
| Carlos Paradis <carlosviansi@gmail.com> | 2020-07-22 15:12:20 | 570 | 300 | 870 |
| Carlos Paradis <carlosviansi@gmail.com> | 2020-08-21 15:12:20 | 1043 | 294 | 1337 |
| Carlos Paradis <carlosviansi@gmail.com> | 2021-05-18 15:12:20 | 320 | 155 | 475 |
| Carlos Paradis <carlosviansi@gmail.com> | 2021-06-17 15:12:20 | 1 | 1 | 2 |
| Carlos Paradis <carlosviansi@gmail.com> | 2021-07-17 15:12:20 | 157 | 22 | 179 |
To visualize the output tables as a time series, we can use the
ggplot2 package. Visualizing the productivity metrics as a
time series is useful in observing the engagement of an author over time
through the number of commits they have pushed and their churn. One way
to think of it is that in conjunction with the engagement metrics, we
can use the productivity metrics to understand if an author is becoming
more or less engaged over time with their sentiment scores reflecting
such increase or decrease in engagement.
We can compare the time series of an author for all of the metrics across the rolling window. To do this, we will first select one author. Then, we will plot the metrics on the same graph to compare their trends over time.
one_author <- "Carlos Paradis <carlosviansi@gmail.com>"
one_author_login <-"carlosparadis"
# Filter to one author for engagement communication
engagement_dt <- engagement_communication_output[
author_login == one_author_login,
.(author_datetimetz = timestamp, author_total_messages = message_count)
][order(author_datetimetz)]
# Filter to one author for commits
commits_dt <- productivity_author_commits_output[
author_name_email == one_author,
.(author_datetimetz, author_total_commits)
][order(author_datetimetz)]
# Filter to one author for churn
churn_dt <- productivity_author_churn_output[
author_name_email == one_author,
.(author_datetimetz, author_churn)
][order(author_datetimetz)]
# Join on timestamp so all metrics share the same x-axis
plot_dt <- merge(commits_dt, churn_dt, by = "author_datetimetz", all = TRUE)
plot_dt <- merge(plot_dt, engagement_dt, by = "author_datetimetz", all = TRUE)
# Make long table with panel grouping: (Commits + Communication) vs Churn
plot_long <- rbindlist(list(
plot_dt[, .(author_datetimetz, metric = "Communication - Messages", value = author_total_messages, panel = "Commits + Communication")],
plot_dt[, .(author_datetimetz, metric = "Commits", value = author_total_commits, panel = "Commits + Communication")],
plot_dt[, .(author_datetimetz, metric = "Churn", value = author_churn, panel = "Churn")]
))
ggplot(plot_long, aes(x = author_datetimetz, y = value, color = metric)) +
geom_point(alpha = 0.7) +
geom_line() +
facet_wrap(~ panel, ncol = 1, scales = "free_y") +
scale_x_datetime(date_breaks = "3 months", date_labels = "%b %Y") +
labs(
title = paste0("Rolling Window Metrics on ", lag, "-day Time Window: ", one_author),
x = "Date",
y = "Count",
color = "Metric"
) +
theme_minimal() +
theme(
plot.title = element_text(hjust = 0.5),
strip.text = element_text(face = "bold", size = 12),
panel.spacing = unit(2, "lines"),
axis.text.x = element_text(angle = 45, hjust = 1)
)
## Warning: Removed 162 rows containing missing values (`geom_point()`).
## Warning: Removed 66 rows containing missing values (`geom_line()`).