This notebook demonstrates the full workflow for conducting sentiment analysis. Sentiment analysis is an approach which helps users detect the emotional tone of comments in order to gain insights into developer collaboration and how sentiment may influence project progress by classifying developer comments as expressing positive, negative, or neutral sentiment. This notebook applies this approach to developer communication sources such as GitHub issues, pull requests, comments, commits, JIRA discussions, and mailing lists. It guides the user through preparing data, training a sentiment-classification model, and generating predictions on new datasets to study collaboration patterns and developer interactions.
Please ensure the following R packages are installed on your computer.
rm(list = ls())
seed <- 1
set.seed(seed)
require(kaiaulu)
## Loading required package: kaiaulu
require(data.table)
## Loading required package: data.table
require(jsonlite)
## Loading required package: jsonlite
require(magrittr)
## Loading required package: magrittr
require(gt)
## Loading required package: gt
require(stringi)
## Loading required package: stringi
require(markdown)
## Loading required package: markdown
require(XML)
## Loading required package: XML
require(ggplot2)
## Loading required package: ggplot2
# Paths to Sentiment Classifier -- see sailuh/sentiment_classifier
tool <- parse_config("../tools.yml")
pysenti_path <- get_pysenti_path(tool)
pysenti_conda_env_name <- get_pysenti_conda_env_name(tool)
conda_path <- get_conda_path(tool)
# GitHub Dataset for Prediction
conf <- parse_config("../conf/kaiaulu.yml")
authors <- get_filter_by_reply_author_substring(conf)
subjects <- get_filter_by_reply_subject_substring(conf)
body <- get_filter_by_reply_body_substring(conf)
token_replacements <- get_replace_token_regex_with(conf)
# Sentiment Classifier Params
training_dataset_filepath <- get_sentiment_training_dataset_filepath(conf)
model_save_path <- get_trained_model_path(conf)
prediction_save_path <- get_prediction_dataset_filepath(conf)
We grab project data from GitHub, JIRA, and mailing lists because
these sources capture the core developer communication, including
issues, pull requests, and comments. Within these sources, the
body field generally contains the actual developer
interactions and discussions within the project. Parsing this data into
a unified reply data.table standardizes its structure, enabling
consistent pre-processing and ensuring our sentiment model can
accurately analyze human-generated content.
In order to download Github data, you need a Github token. For details, see Here. The functions in Kaiaulu will assume you have a token available, which can be passed as parameter.
conf <- parse_config("../conf/kaiaulu.yml")
save_path_issue_refresh <- get_github_issue_search_path(conf, "project_key_1")
save_path_pull_request <- get_github_pull_request_path(conf, "project_key_1")
save_path_issue_or_pr_comments <- get_github_issue_or_pr_comment_path(conf, "project_key_1")
# Make sure the above save_path_* folders exists using
dir.create(save_path_issue_refresh)
## Warning in dir.create(save_path_issue_refresh):
## '../../rawdata/kaiaulu/github/sailuh_kaiaulu/issue_search' already exists
dir.create(save_path_pull_request)
## Warning in dir.create(save_path_pull_request):
## '../../rawdata/kaiaulu/github/sailuh_kaiaulu/pull_request' already exists
dir.create(save_path_issue_or_pr_comments)
## Warning in dir.create(save_path_issue_or_pr_comments):
## '../../rawdata/kaiaulu/github/sailuh_kaiaulu/issue_or_pr_comment' already
## exists
owner <- get_github_owner(conf, "project_key_1") # Has to match github organization (e.g. github.com/sailuh)
repo <- get_github_repo(conf, "project_key_1") # Has to match github repository (e.g. github.com/sailuh/perceive)
# your file github_token (a text file) contains the GitHub token API
token <- scan("~/.ssh/github_token",what="character",quiet=TRUE)
Here, we are downloading and parsing Github issue data by date range
using the /issue_search/ API endpoint. After being parsed
into a structured data table (we will refer to it as reply data.table
for the rest of the notebook), it will contain the column ‘reply_body’,
this is what we are looking for in every data that is being parsed in
this notebook as it typically represents developer comments.
Specifically, by fetching this /issue_search/ API endpoint,
we capture the initial message of each issue in the specified
repository. For more information on API endpoints, see the
download_github_comments.Rmd notebook.
created_lower_bound_issue <- "1990-01-01"
created_upper_bound_issue <- "2021-01-01"
gh_response <- github_api_project_issue_by_date(owner,
repo,
token,
created_lower_bound_issue,
created_upper_bound_issue,
"is:issue",
verbose=TRUE)
github_api_iterate_pages(token,gh_response,
save_path_issue_refresh,
prefix="issue",
verbose=TRUE)
The downloaded data will then be parsed and formatted into a reply data.table.
# Read all JSON files from the directory
all_issue_files <- lapply(list.files(save_path_issue_refresh, full.names = TRUE), read_json)
# Parse each JSON file using the refresh parser
all_issues <- lapply(all_issue_files, github_parse_search_issues_refresh)
# Combine all the data tables
all_issues <- rbindlist(all_issues, fill = TRUE)
all_issues <- all_issues[,.(reply_id=issue_id,
in_reply_to_id=NA_character_,
reply_datetimetz=created_at,
reply_from=issue_user_login,
reply_to=NA_character_,
reply_cc=NA_character_,
reply_subject=issue_number,
reply_body=body)]
head(all_issues,3) %>%
gt(auto_align = FALSE)
| reply_id | in_reply_to_id | reply_datetimetz | reply_from | reply_to | reply_cc | reply_subject | reply_body |
|---|---|---|---|---|---|---|---|
| 3577911474 | NA | 2025-11-01T11:48:43Z | carlosparadis | NA | NA | 351 | For a while, Kaiaulu repo GitHub Actions stopped working altogether. However, this makes it harder to assess contributions. Should check the scripts and try to fix the remaining PRs to go through the checks and tests. |
| 3389397723 | NA | 2025-09-06T05:36:16Z | geraldmjhuff | NA | NA | 350 | In pages 15 and 16 of Haotian's thesis, the following capabilities are listed as present: * Removal of Automatically Generated Content:[46][1] We identified and removed system-generated content, such as system reports and auto-replies. This step was crucial to filter out irrelevant information that does not contribute to understanding the developers’ sentiments. * Elimination of Quoted Original Text:[7] In email replies, original messages are often quoted, leading to redundancy. We employed regular expressions to remove quoted text, focusing only on the new content added by the sender. * Exclusion of Large Code Segments:[7][6] To maintain the focus on communication content, large segments of code were excluded from the dataset. We identified and removed code blockers, ensuring the dataset contained primarily conversational text. The goal of this task is to look through 3 artifacts: - [ ] Haotian's thesis appendix - [ ] The diff repo containing all the changes relative to Kaiaulu (private repo) - [ ] The existing PR by Haotian and code included there To identify these and any other relevant capabilities among these 3 artifacts. We should confirm both that the 3 capabilities above are available in code form, and if anything else is available. The code is to be commited to #347 |
| 3389391289 | NA | 2025-09-06T05:31:27Z | geraldmjhuff | NA | NA | 349 | ## Relation to issue #346 The sentiment analysis notebooks (#346) depend on the Kaiaulu R scripts (#349) because those scripts handle downloading and parsing GitHub issues, PRs, and comments into structured CSVs. Without downloading and parsing steps, the classifier has no text data to analyze. A pipeline also eventually has to be established so that the sentiment analysis takes in those data. ## Project Description The goal of this issue is to review open Pull Requests from prior contributors associated to GitHub API. - [ ] In order to do so, the existing Kaiaulu master branch and GitHub functionality should be used to see if it is in working order. Namely, I should test the Docs GitHub Section and associated Notebook. - [ ] Subsequently, I should test what is pending for the added functionality for GitHub to be merged into the master branch (a code review request for @carlosparadis should be issued once I believe it is ready). - [ ] Evaluate PR #301 which provides an `/exec` interface to Kaiaulu's GitHub. Execs are used to call Kaiaulu functionality via command line. Exec scripts are the primary way to interface to other tools, such as the Kaiaulu React Austin is working on. - [ ] List here other PRs or issues associated to GitHub that are open. |
Using the /issue_or_pr_comment/ endpoint, we capture
subsequent messages that are after the initial message in both issues
and pull requests.
updated_lower_bound_comment <- "2024-04-25"
gh_response <- github_api_project_issue_or_pr_comments_by_date(owner = owner,
repo = repo,
token = token,
since = updated_lower_bound_comment,
verbose=TRUE)
github_api_iterate_pages(token,gh_response,
save_path_issue_or_pr_comments,
prefix="issue",
verbose=TRUE)
all_issue_or_pr_comments <- lapply(list.files(save_path_issue_or_pr_comments,
full.names = TRUE),read_json)
all_issue_or_pr_comments <- lapply(all_issue_or_pr_comments,
github_parse_project_issue_or_pr_comments)
all_issue_or_pr_comments <- rbindlist(all_issue_or_pr_comments,fill=TRUE)
all_issue_or_pr_comments <- all_issue_or_pr_comments[,.(reply_id=comment_id,
in_reply_to_id=NA_character_,
reply_datetimetz=created_at,
reply_from=comment_user_login,
reply_to=NA_character_,
reply_cc=NA_character_,
reply_subject=issue_url,
reply_body=body)]
head(all_issue_or_pr_comments,3) %>%
gt(auto_align = FALSE)
| reply_id | in_reply_to_id | reply_datetimetz | reply_from | reply_to | reply_cc | reply_subject | reply_body |
|---|---|---|---|---|---|---|---|
| 615692483 | NA | 2020-04-18T07:42:11Z | carlosparadis | NA | NA | https://api.github.com/repos/sailuh/kaiaulu/issues/2 | Quick search for Ctags on Codeface to see where it is used, in hoping to find how functions can be extracted efficiently across an entire git log: https://github.com/siemens/codeface/search?q=ctags&unscoped_q=ctags |
| 615702209 | NA | 2020-04-18T07:49:34Z | carlosparadis | NA | NA | https://api.github.com/repos/sailuh/kaiaulu/issues/2 | Code logic on how Codeface parse Exuberant Ctags to identify functions on source code using the `python ctags` package linked above: https://github.com/siemens/codeface/blob/e6640c931f76e82719982318a5cd6facf1f3df48/codeface/VCS.py#L1370-L1421 The limitation of Java and C# seems to be a consequence of how the tags are written |
| 615775861 | NA | 2020-04-18T08:44:16Z | carlosparadis | NA | NA | https://api.github.com/repos/sailuh/kaiaulu/issues/2 | # Code logic for Exuberant Ctags In the original Mitchel paper, it is noted the following: <img width="479" alt="Screen Shot 2020-04-17 at 10 35 39 PM" src="https://user-images.githubusercontent.com/17270563/79632562-d95b3000-80fb-11ea-927e-ceb865d844fb.png"> <img width="468" alt="Screen Shot 2020-04-17 at 10 35 48 PM" src="https://user-images.githubusercontent.com/17270563/79632564-dc562080-80fb-11ea-9a6d-cc946297fa0a.png"> That is all the explanation that Codeface paper will provide on how functions are extracted from a git log. Which is not sufficient to reproduce. Looking for "ctags" on codeface repo, the reference only occurs in 4 spots: https://github.com/siemens/codeface/search?q=import+ctags&unscoped_q=import+ctags # Location where ctags occur From the link above, I verified codeface/utils.py only contains minimal code to check that ctags was installed properly. cluster_py appears to use the code from VCS.py and the file name is not very suggestive either of that responsibility. ## VCS.py Ctags This test gives a minimal example of the parser in action leveraging python_ctags library: https://github.com/siemens/codeface/blob/e6640c931f76e82719982318a5cd6facf1f3df48/codeface/VCS.py#L1548-L1570 |
Using the /pull_request/ endpoint, we capture the
initial message of each pull request in a specified repository.
gh_response <- github_api_project_pull_request_refresh(owner,repo,token, save_path_pull_request)
dir.create(save_path_pull_request)
github_api_iterate_pages(token,gh_response,
save_path_pull_request,
prefix="pull_request",
verbose=TRUE)
all_pr <- lapply(list.files(save_path_pull_request,
full.names = TRUE),read_json)
all_pr <- lapply(all_pr,
github_parse_project_pull_request)
all_pr <- rbindlist(all_pr,fill=TRUE)
all_pr <- all_pr[,.(reply_id=issue_id,
in_reply_to_id=NA_character_,
reply_datetimetz=created_at,
reply_from=issue_user_login,
reply_to=NA_character_,
reply_cc=NA_character_,
reply_subject=issue_number,
reply_body=body)]
head(all_pr, 3) %>%
gt(auto_align = FALSE)
| reply_id | in_reply_to_id | reply_datetimetz | reply_from | reply_to | reply_cc | reply_subject | reply_body |
|---|---|---|---|---|---|---|---|
| 2952847933 | NA | 2025-03-27T12:37:46Z | haotian1028 | NA | NA | 347 | NA |
| 2725636583 | NA | 2024-12-09T00:01:19Z | crepesAlot | NA | NA | 333 | NA |
| 2725564024 | NA | 2024-12-08T22:23:28Z | crepesAlot | NA | NA | 332 | NA |
We then combine all Github parsed replies into one unified reply data.table that will eventually be combined with Mbox and JIRA reply data.tables.
github_replies <- rbindlist(
list(all_issues, all_issue_or_pr_comments, all_pr),
fill = TRUE
)
Create directories for saving the trained model and prediction outputs. These directories ensure that the training function can store the best model and the prediction function can save generated results.
dir.create(model_save_path, recursive = TRUE, showWarnings = FALSE)
dir.create(prediction_save_path, recursive = TRUE, showWarnings = FALSE)
We will now train the sentiment model. The train function expects a
data.table with text (the comment body) and
polarity (0 = neutral, 1 =
positive, 2 = negative) columns.
Users must specify a model to train such as “bert-base-cased”. Other
text models, but not all, are available for use on the huggingface
website Here.
To determine if a text model may work, it must be compatible with
AutoTokenizer and AutoForSequenceClassification classes (e.g, For
bert-base-cased model, BertTokenizer and
BertForSequenceClassification must appear in the text models
documentation) Note, a text model that is not compatible with these
classes will not work, and not all text models that are compatible with
it can be used. (More research is necessary)
The train function fine-tunes the model by training it on the labeled
dataset. During training, the model makes predictions on the same
training data and compares them to the ground-truth labels. It evaluates
its accuracy, and keeps track of the best-performing version. The
version with the highest accuracy on this dataset is saved to the
model_save_path folder and the file is named the model you
specified (e.g., bert-base-cased)
Training data comes from sailuh/sentiment_github_dataset,
a labeled GitHub dataset of developer comments. The repository provides
several CSV options depending on your goal. Change the
fread() URL below to select the setup you want:
train_df <- fread("https://github.com/user-attachments/files/27577408/combined_all_comments.csv")
fwrite(train_df[1:20],training_dataset_filepath)
pysenti_train_model(
conda_path = conda_path,
pysenti_conda_env_name = pysenti_conda_env_name,
pysenti_path = pysenti_path,
train_data_path = training_dataset_filepath,
model_save_path = model_save_path,
model_name = "bert-base-cased"
)
Since we’ve previously renamed one of our columns to
text, the predict function can now be used.
Once a model is trained, it can be applied to any filtered reply
data.tables that has the text column to generate sentiment
predictions. The generated prediction reply data.table is then saved to
our prediction_save_path folder.
We use the github_replies table built earlier in this
notebook. We rename reply_body to text and add
a polarity placeholder column, which is what the predict
function expects.
reply_dt <- data.table::copy(github_replies)
reply_dt <- setnames(reply_dt, "reply_body", "text")
reply_dt[, polarity := 0]
pred_dt <- pysenti_predict(
conda_path = conda_path,
pysenti_conda_env_name = pysenti_conda_env_name,
pysenti_path = pysenti_path,
reply_dt = reply_dt[1:20],
model_save_path = model_save_path,
model_name = "bert-base-cased",
prediction_save_path = prediction_save_path
)
data.table::fwrite(reply_dt, prediction_save_path)
head(pred_dt,3) %>%
gt(auto_align = FALSE)
| reply_id | in_reply_to_id | reply_datetimetz | reply_from | reply_to | reply_cc | reply_subject | text | polarity |
|---|---|---|---|---|---|---|---|---|
| 3577911474 | NA | 2025-11-01 11:48:43 | carlosparadis | NA | NA | 351 | For a while, Kaiaulu repo GitHub Actions stopped working altogether. However, this makes it harder to assess contributions. Should check the scripts and try to fix the remaining PRs to go through the checks and tests. | 1 |
| 3389397723 | NA | 2025-09-06 05:36:16 | geraldmjhuff | NA | NA | 350 | In pages 15 and 16 of Haotian's thesis, the following capabilities are listed as present: * Removal of Automatically Generated Content:[46][1] We identified and removed system-generated content, such as system reports and auto-replies. This step was crucial to filter out irrelevant information that does not contribute to understanding the developers’ sentiments. * Elimination of Quoted Original Text:[7] In email replies, original messages are often quoted, leading to redundancy. We employed regular expressions to remove quoted text, focusing only on the new content added by the sender. * Exclusion of Large Code Segments:[7][6] To maintain the focus on communication content, large segments of code were excluded from the dataset. We identified and removed code blockers, ensuring the dataset contained primarily conversational text. The goal of this task is to look through 3 artifacts: - [ ] Haotian's thesis appendix - [ ] The diff repo containing all the changes relative to Kaiaulu (private repo) - [ ] The existing PR by Haotian and code included there To identify these and any other relevant capabilities among these 3 artifacts. We should confirm both that the 3 capabilities above are available in code form, and if anything else is available. The code is to be commited to #347 | 1 |
| 3389391289 | NA | 2025-09-06 05:31:27 | geraldmjhuff | NA | NA | 349 | ## Relation to issue #346 The sentiment analysis notebooks (#346) depend on the Kaiaulu R scripts (#349) because those scripts handle downloading and parsing GitHub issues, PRs, and comments into structured CSVs. Without downloading and parsing steps, the classifier has no text data to analyze. A pipeline also eventually has to be established so that the sentiment analysis takes in those data. ## Project Description The goal of this issue is to review open Pull Requests from prior contributors associated to GitHub API. - [ ] In order to do so, the existing Kaiaulu master branch and GitHub functionality should be used to see if it is in working order. Namely, I should test the Docs GitHub Section and associated Notebook. - [ ] Subsequently, I should test what is pending for the added functionality for GitHub to be merged into the master branch (a code review request for @carlosparadis should be issued once I believe it is ready). - [ ] Evaluate PR #301 which provides an `/exec` interface to Kaiaulu's GitHub. Execs are used to call Kaiaulu functionality via command line. Exec scripts are the primary way to interface to other tools, such as the Kaiaulu React Austin is working on. - [ ] List here other PRs or issues associated to GitHub that are open. | 0 |
We can utilize engagement_sentiment() to analyze how
developers, on aggregate, feel within consecutive rolling windows. This
function takes polarity labels (positive, negative, neutral), converts
them to numeric values (1, -1, 0) for each message, and aggregates them
over non-overlapping rolling windows. These polarity values are
associated with a developer’s particular message within a project.
The function implements a true rolling window approach: it creates consecutive non-overlapping windows of the specified size and pools all sentiment values within each window. Importantly, only complete windows are included in the output—any incomplete final window is discarded.
By default, the window size is set to 90 days, and the aggregate function is set to mean, which calculates the average sentiment within that window. The result includes one row per author per complete window, with the timestamp representing the end of that window. This can help us understand how sentiment fluctuates over time and identify periods of high or low engagement sentiment.
You can adjust the window parameters based on your needs. For example, a smaller window size may capture more immediate sentiment changes, while a larger window size may provide a broader view of sentiment trends.
lag <- 90
aggregate_func <- mean
scored_dt <- data.table::copy(reply_dt[1:20])
scored_dt[, polarity := as.character(pred_dt$polarity)]
# Map 2/1/0 -> label strings
scored_dt[, polarity_label := fcase(
polarity == 2L, "positive",
polarity == 1L, "neutral",
polarity == 0L, "negative",
default = NA_character_
)]
scored_dt <- fread("~/Downloads/kaiaulu_predictions.csv")
scored_dt <- scored_dt[,.(created_at,comment_user_login,polarity)]
engagement_sentiment_dt <- engagement_sentiment(
datetimetz = scored_dt$created_at,
user_name_email = scored_dt$comment_user_login,
polarity = scored_dt$polarity,
lag = lag,
aggregate_func = aggregate_func
)
engagement_sentiment_dt <- engagement_sentiment(
datetimetz = scored_dt$reply_datetimetz,
user_name_email = scored_dt$reply_from,
polarity = scored_dt$polarity_label,
lag = lag,
aggregate_func = aggregate_func
)
head(engagement_sentiment_dt,3) %>%
gt(auto_align = FALSE)
| user_name_email | datetimetz | aggregate_polarity |
|---|---|---|
| CorneJB | 2021-08-04 23:11:41 | 0 |
| Ssunoo2 | 2024-04-25 08:06:15 | 0 |
| anthonyjlau | 2024-04-25 08:36:50 | 0 |
This plot visualizes each author’s aggregate sentiment polarity over the engagement_sentiment time window.
ggplot(engagement_sentiment_dt, aes(x = datetimetz, y = aggregate_polarity, group = user_name_email, color = user_name_email)) +
geom_line() + scale_y_continuous(limits = c(-1, 1)) +
labs(title = paste0("Author Engagement on ", lag, "-day Time Window"),
x = "Date",
y = "Aggregate Sentiment",
color = "Author"
) +
theme_minimal() +
theme(plot.title = element_text(hjust = 0.5))