This notebook showcases how to use Kaiaulu’s OpenHub API interface functions to facilitate selecting open-source projects for studies. These OpenHub API interfacing functions return data tables that contain project information such as total contributor count, total lines of code, percentage breakdown of code languages, number of contributors who made at least one commit in the past year, the number of commits in the past year, etc. The current search criteria: the organization to which the projects belong and the primary programming language used by the projects.
The notebook is unique among its peers in that it concerns itself not with one project, but their selection, and thus it does not need to use the project configuration file architecture nor the project initialization.
This notebook explains how to acquire information on a set of projects (e.g. LOC on the current date, number of contributors who made at least one commit in the past 12 months, number of commits in the past 12 months, total commit count on the current date, and total number of contributors on the current date) that reside in Openhub’s open-source project collection based on search parameters under an organization using Ohloh API.
Kaiaulu’s interface to Ohloh’s API, an API for OpenHub’s open-source project collection, relies on httr to create http GET requests that interface with Ohloh’s API. Ohloh API responds to these requests by returning an XML response file with nested tags.
Kaiaulu only defines a few API endpoints of interest (organization, portfolio_projects, project, and analysis) where the tool is currently used, and parses the returned XML output into a table keeping only fields of interest. More endpoints and/or fields of interest per endpoint can be added in the future.
OpenHub has a limit to the number of API calls per token (maximum set to 1000) per day. The current rate for your personal API token and to acquire an API key, they may be found at its website and an account is required to acquire your personal API token. The process, to create an account and register for an API key is documented on Ohloh API GitHub Page API Section, should not take more than two minutes.
The functions in Kaiaulu will assume you have an OpenHub API token available, which can be passed as parameter.
Please ensure the following R packages are installed on your computer.
rm(list = ls())
require(kaiaulu)
require(stringi)
require(data.table)
require(knitr)
require(httr)
require(gt)
To start the project selection process, we may want to restrict ourselves to focusing on projects under a specific organization. The list of organizations are found on the OpenHub website. For this notebook, we will focus on searching for projects with their primary language as Java under the “Apache Software Foundation”.
Below are a set of required variables for the openhub_*
functions.
study_name <- "placeholder"
html_url_or_name <- "https://openhub.net/orgs/apache"
language <- "java"
token <- "dI_r4aalY6_6466CehmV4ErcY2ZEAkPOw-dY8cZZFnQ"
Explanation:
A hardcoded function for folder creation to store the API responses from each endpoint:
# 1. Anchor to your current working directory
github_folder <- getwd()
# 2. Define folder paths
rawdata_folder <- file.path(github_folder, "rawdata")
openhub_folder <- file.path(rawdata_folder, "openhub")
study_folder <- file.path(openhub_folder, study_name)
organization_folder_path <- file.path(study_folder, "organization")
portfolio_project_folder_path <- file.path(study_folder, "portfolio")
project_folder_path <- file.path(study_folder, "project")
analysis_folder_path <- file.path(study_folder, "analysis")
# 3. Create all folders safely
dir.create(organization_folder_path, recursive = TRUE, showWarnings = FALSE)
dir.create(portfolio_project_folder_path, recursive = TRUE, showWarnings = FALSE)
dir.create(project_folder_path, recursive = TRUE, showWarnings = FALSE)
dir.create(analysis_folder_path, recursive = TRUE, showWarnings = FALSE)
In this section, for each endpoint, we collect the data through a series of Ohloh API requests, and parse the API responses with its corresponding parser function. These parsed API responses are data tables which are displayed for each subsection. The values from one endpoint may be extracted for use to obtain a path to the next endpoint, and the merging of data tables is important for a holistic display of the data for the list of projects.
We call openhub_api_organizations to acquire the
organization API response of our selected organization specified by
html_url_or_name. This function downloads the API response
file into the folder specified by
organization_folder_path.
openhub_api_organizations(token, organization_folder_path, html_url_or_name)
With the organization API response (only one page), we may parse this
response with its corresponding parser function,
openhub_parse_organizations, to acquire a data table with
columns representing the tags for the organization listed:
openhub_organization_api_requests <- openhub_retrieve(organization_folder_path)
openhub_organizations <- openhub_parse_organizations(openhub_organization_api_requests)
head(openhub_organizations) %>%
gt()
| Table has no data |
We then select the first organization’s “html_url” column value (e.g. “https://openhub.net/orgs/apache”) and filter to extract the short, unique handler for the organization (e.g. “apache”).
org_name <- "apache"
Following a similar process as the Organization section, we acquire
the portfolio projects for the organization, “Apache Software
Foundation”, that possess the code language specified by
language, in this case Java, by acquiring the portfolio
projects API response files and parsing these API requests into a data
table. This endpoint, portfolio projects, is iterable by pages, so we
may use openhub_api_iterate_pages on
openhub_api_portfolio_projects to acquire each page,
storing these API responses in the folder path
portfolio_project_folder_path. Currently, each page
for the portfolio_projects collection returns a maximum of 20
items, portfolio projects. To extract as many matches as
possible or up to a number of pages (if max_pages exceeds
the total pages available by the API response, it will extract the
maximum number of pages possible), the max_pages may be
removed from openhub_api_iterate_pages or set to a desired
value (e.g. “max_pages=1” returning up to 20 portfolio project items and
“max_pages=16” returning up to 320 portfolio project items),
respectively.
openhub_api_iterate_pages(token, openhub_api_portfolio_projects, portfolio_project_folder_path, org_name, max_pages=NULL)
With the portfolio project API responses saved, we may parse these
responses with its corresponding parser function,
openhub_parse_portfolio_projects, to acquire a data table
with columns representing the tags for each portfolio project listed. To
narrow down the list of projects that the next endpoint will require (to
save on API requests), we filter the data table of portfolio projects to
show only projects that have Java, specified by language,
as their primary code language.
portfolio_project_folder_path <- "/home/umar_mazhar/rawdata/openhub/placeholder/"
portfolio_projects_api_requests <- list.files(
portfolio_project_folder_path,
pattern = "^portfolioportfolio.*\\.xml$",
full.names = TRUE
)
openhub_portfolio_projects <- openhub_parse_portfolio_projects(portfolio_projects_api_requests)
openhub_portfolio_projects <- openhub_portfolio_projects[primary_language == language]
head(openhub_portfolio_projects) %>%
gt()
| name | primary_language | activity |
|---|---|---|
| Apache Tomcat | java | High |
| Apache Ant | java | Moderate |
| Apache Maven 2 | java | Not Available |
| Log4j | java | Moderate |
| Apache Xerces2 J | java | Not Available |
| Apache Commons Collections | java | Moderate |
To acquire more specific information about a portfolio project, we
need to access it in the project collection, and the link between the
portfolio_project endpoint and project endpoint is the “name” tag
(e.g. “Apache Tomcat”). Project names on OpenHub are unique. Following a
similar style of acquiring the project API responses and parsing them
with its corresponding parser function, we loop through each “name”
column in the portfolio projects’ data table
openhub_portfolio_projects. For each project name,
project_name, we gather its API responses by using
openhub_api_iterate_pages on
openhub_api_projects, storing these API response files in
the folder path project_folder_path, and specifying
max_pages to 1 (Using the API’s collection request query
command, the first API requested page will contain a project with a
matching “name” tag, thus there is no need to waste API calls to search
through the other pages for the project, so max_pages is
set to 1). The collection request query command may return, in a single
project API response file page, multiple project items, this is due to
the limitation of OpenHub’s API query command, which performs a type of
“ctrl+F” find search across all tags, instead of allowing users to
specifically query a tag for an exact match.
# Chunk 9 (Updated)
for (i in 1:nrow(openhub_portfolio_projects)) {
project_name <- openhub_portfolio_projects[["name"]][[i]]
# Fetch the data
openhub_api_iterate_pages(token, openhub_api_projects, project_folder_path, project_name, max_pages=1)
# Pause for 2 seconds to avoid rate-limiting
Sys.sleep(2)
}
With the project API responses saved, we may parse these responses
with its corresponding parser function,
openhub_parse_projects, to acquire two data tables with
columns representing the tags for each project listed. The first data
table, project_data has the columns:
project_folder_path <- "/home/umar_mazhar/rawdata/openhub/placeholder/"
projects_api_requests <- list.files(
project_folder_path,
pattern = "^projectproject.*\\.xml$",
full.names = TRUE
)
openhub_projects <- openhub_parse_projects(projects_api_requests)
project_data <- openhub_projects[[1]]
project_links <- openhub_projects[[2]]
head(project_data) %>% gt()
| name | id | html_url |
|---|---|---|
| Apache Tomcat | 3562 | https://openhub.net/p/tomcat |
| Apache Tomcat Resources Manager Module | 130685 | https://openhub.net/p/tomcat-res-mgr |
| psi-probe | 328248 | https://openhub.net/p/psi-probe |
| Apache TomEE | 632761 | https://openhub.net/p/tomee |
| puppetlabs-tomcat | 532962 | https://openhub.net/p/puppetlabs-tomcat |
| JBoss Application Server | 480 | https://openhub.net/p/jboss |
The second data table, project_links has the columns for
each project listed (if a project doesn’t have any external links listed
in its database, it will not appear in this data table):
project_links <- openhub_projects[[2]]
head(project_links) %>%
gt()
| name | id | project_link_title | project_link_category | project_link_url |
|---|---|---|---|---|
| Apache Tomcat | 3562 | Tomcat Mailing Lists | Forums | http://tomcat.apache.org/lists.html |
| Apache Tomcat | 3562 | Tomcat FAQ | Documentation | http://tomcat.apache.org/faq/ |
| Apache Tomcat | 3562 | Tomcat Bug Database / Issue Tracker | Documentation | http://tomcat.apache.org/bugreport.html |
| Apache Tomcat | 3562 | Book: Apache Tomcat Bible | Other | http://www.amazon.com/Apache-Tomcat-Bible-Jon-Eaves/dp/0764526065 |
| Apache Tomcat | 3562 | Book: Tomcat: The Definitive Guide | Other | http://www.amazon.com/Tomcat-Definitive-Guide-Jason-Brittain/dp/0596101066 |
| JBoss Application Server | 480 | JBoss Forums | Forums | http://community.jboss.org/en/jbossas |
We combine the portfolio_projects and project data tables into one
data table, openhub_combined_projects, by performing an
inner-join by “name” column. We add an additional filter, unique by “id”
column, to ensure no duplication due to the projects endpoint possibly
returning duplicated project items from the OpenHub API’s collection
request query command.
openhub_combined_projects <- unique(merge(project_data, openhub_portfolio_projects, by = "name", all = FALSE, allow.cartesian = TRUE), by = "id")
head(openhub_combined_projects) %>%
gt()
| name | id | html_url | primary_language | activity |
|---|---|---|---|---|
| Agila | 13636 | https://openhub.net/p/agila | java | Not Available |
| AntLib DotNet | 3609 | https://openhub.net/p/p_3609 | java | Very Low |
| AntLib SVN | 3610 | https://openhub.net/p/p_3610 | java | Inactive |
| Apache ACE | 347981 | https://openhub.net/p/apache_ace | java | Not Available |
| Apache Abdera | 4718 | https://openhub.net/p/p_d_4718 | java | Inactive |
| Apache Accumulo | 588272 | https://openhub.net/p/accumulo | java | High |
The previously acquired “id” tag (represented as a column) for each
project allows us to acquire the latest analysis collection for a
project, containing a multitude of important metrics. Following the same
logic as the Projects section, looping through each project in
openhub_combined_projects, we call
openhub_api_analyses to acquire the analysis API response
of our selected project specified by project_id. This
function downloads the API response file into the folder specified by
analysis_folder_path. In addition, we pass the name of the
project to openhub_api_analyses for use in the file name of
the stored API response file. The attached timestamp in the
file’s name for the analysis API response files are NOT timestamps for
when the API response file was requested, but rather they correspond to
the timestamp on OpenHub’s database for when the analysis collection was
last calculated for that specific project.
for (i in 1:nrow(openhub_combined_projects)) {
project_name <- openhub_combined_projects[["name"]][[i]]
project_id <- openhub_combined_projects[["id"]][[i]]
openhub_api_analyses(token, analysis_folder_path, project_id, project_name)
Sys.sleep(2)
}
With the analysis API response (only one page), we may parse this
response with its corresponding parser function,
openhub_parse_analyses, to acquire a data table with
columns representing the tags for each analysis listed.
openhub_parse_analyses, is
overloaded to read each coding language and uses stringi to combine
these percentages of code languages for the project).analyses_api_requests <- list.files(
"/home/umar_mazhar/rawdata/openhub/placeholder/",
pattern = "^analysis.*\\.xml$",
full.names = TRUE
)
openhub_analyses <- openhub_parse_analyses(analyses_api_requests)
head(openhub_analyses) %>%
gt()
| id | min_month | twelve_month_contributor_count | total_contributor_count | twelve_month_commit_count | total_commit_count | total_code_lines | code_languages |
|---|---|---|---|---|---|---|---|
| 13636 | 2004-10-01 | 0 | 3 | 0 | 65 | 46754 | 73% Java, 15% XML, 6% XML Schema, 6% 5 Other |
| 3609 | 2005-04-01 | 2 | 4 | 3 | 88 | 7487 | 30% HTML, 52% Java, 17% XML, 1% 2 Other |
| 3610 | 2005-04-01 | 0 | 8 | 0 | 45 | 2600 | 66% Java, 17% HTML, 11% XML, 6% XSL Transformation |
| 4718 | 2008-12-01 | 0 | 12 | 0 | 181 | 74058 | 80% Java, 19% XML, 1% 5 Other |
| 588272 | 2011-10-01 | 17 | 215 | 807 | 14940 | 525579 | 91% Java, 5% JavaScript, 4% 7 Other |
| 347981 | 2009-05-01 | 0 | 25 | 0 | 2352 | 155233 | 82% Java, 11% XML, 5% CSS, 2% 2 Other |
Lastly, we combine the combined portfolio_projects and project data
table, openhub_combined_projects, with the analysis data
table, openhub_analyses, into one data table,
openhub_combined_data, by performing an inner-join by “id”
column.
openhub_combined_data <- merge(openhub_combined_projects, openhub_analyses, by = "id", all = FALSE)
head(openhub_combined_data) %>%
gt()
| id | name | html_url | primary_language | activity | min_month | twelve_month_contributor_count | total_contributor_count | twelve_month_commit_count | total_commit_count | total_code_lines | code_languages |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 10483 | Apache FtpServer | https://openhub.net/p/ftpserver | java | Not Available | 2003-03-01 | 1 | 29 | 1 | 1521 | 45466 | 93% Java, 7% 6 Other |
| 10942 | Apache Shindig | https://openhub.net/p/shindig | java | Inactive | 2007-12-01 | 0 | 74 | 0 | 6788 | 348487 | 57% Java, 35% JavaScript, 6% XML, 2% 8 Other |
| 11082 | Tiles | https://openhub.net/p/tiles | java | Not Available | 2007-01-01 | 0 | 9 | 0 | 785 | 32314 | 60% Java, 26% HTML, 14% XML, <1% 2 Other |
| 11242 | Apache Click | https://openhub.net/p/apache-click | java | Not Available | 2004-12-01 | 0 | 18 | 0 | 5749 | 139054 | 72% Java, 11% JavaScript, 9% HTML, 8% 3 Other |
| 11590 | Apache XML-RPC | https://openhub.net/p/p_11590 | java | Not Available | 2014-01-01 | 0 | 5 | 0 | 1 | 11168 | 91% Java, 9% XML |
| 12001 | Apache Avalon | https://openhub.net/p/p_12001 | java | Inactive | 2000-11-01 | 0 | 41 | 0 | 9651 | 156787 | 60% Java, 27% XML, 9% C#, 4% 10 Other |
If you’re interested in verifying if a project labels their commits with issue IDs and whether they have unique issue types (i.e. “bug”, “feature”, “security bug”, “refactoring”, etc), which is outside of the scope of this notebook, please review this short explanation:
To check the issue IDs, it requires you to parse the project’s code git log. Then you can use this function on the resulting table. See this notebook for example usage. This notebook uses the regex written in the project configuration file, which is a regex. The user will need to manually figure out from the git log if any can be found, to then specify in Kaiaulu configuration file, to then have Kaiaulu calculate the metric. There is no other way to automate that since the conventions used vary across projects, if at all used.
# 1. Install and load the Excel export tool
library(writexl)
checkpoint_data <- openhub_combined_data[, c(
"name", "total_code_lines", "min_month",
"total_commit_count", "total_contributor_count",
"twelve_month_commit_count", "twelve_month_contributor_count",
"activity", "html_url"
)]
# Convert numeric columns from text to numbers
checkpoint_data$total_code_lines <- as.numeric(checkpoint_data$total_code_lines)
checkpoint_data$total_commit_count <- as.numeric(checkpoint_data$total_commit_count)
checkpoint_data$total_contributor_count <- as.numeric(checkpoint_data$total_contributor_count)
checkpoint_data$twelve_month_commit_count <- as.numeric(checkpoint_data$twelve_month_commit_count)
checkpoint_data$twelve_month_contributor_count <- as.numeric(checkpoint_data$twelve_month_contributor_count)
checkpoint_data$min_month <- as.Date(paste0(checkpoint_data$min_month, "-01"))
checkpoint_data$kloc <- round(checkpoint_data$total_code_lines / 1000, 1)
checkpoint_data$age_years <- round(as.numeric(Sys.Date() - checkpoint_data$min_month) / 365.25, 1)
checkpoint_data$jira_confirmed <- "Manual check needed"
five_years_ago <- Sys.Date() - (5 * 365.25)
filtered_data <- subset(checkpoint_data,
total_code_lines >= 250000 &
min_month <= five_years_ago &
total_commit_count >= 5000
)
write_xlsx(filtered_data, "apache_strict_criteria_candidates.xlsx")
print(paste("Projects found:", nrow(filtered_data)))
## [1] "Projects found: 68"