1 Introduction

1.1 Motivation

This notebook showcases how to use Kaiaulu’s OpenHub API interface functions to facilitate selecting open-source projects for studies. These OpenHub API interfacing functions return data tables that contain project information such as total contributor count, total lines of code, percentage breakdown of code languages, number of contributors who made at least one commit in the past year, the number of commits in the past year, etc. The current search criteria: the organization to which the projects belong and the primary programming language used by the projects.

The notebook is unique among its peers in that it concerns itself not with one project, but their selection, and thus it does not need to use the project configuration file architecture nor the project initialization.

1.2 Purpose

This notebook explains how to acquire information on a set of projects (e.g. LOC on the current date, number of contributors who made at least one commit in the past 12 months, number of commits in the past 12 months, total commit count on the current date, and total number of contributors on the current date) that reside in Openhub’s open-source project collection based on search parameters under an organization using Ohloh API.

Kaiaulu’s interface to Ohloh’s API, an API for OpenHub’s open-source project collection, relies on httr to create http GET requests that interface with Ohloh’s API. Ohloh API responds to these requests by returning an XML response file with nested tags.

Kaiaulu only defines a few API endpoints of interest (organization, portfolio_projects, project, and analysis) where the tool is currently used, and parses the returned XML output into a table keeping only fields of interest. More endpoints and/or fields of interest per endpoint can be added in the future.

1.3 Create a Personal API Token

OpenHub has a limit to the number of API calls per token (maximum set to 1000) per day. The current rate for your personal API token and to acquire an API key, they may be found at its website and an account is required to acquire your personal API token. The process, to create an account and register for an API key is documented on Ohloh API GitHub Page API Section, should not take more than two minutes.

The functions in Kaiaulu will assume you have an OpenHub API token available, which can be passed as parameter.

1.4 Libraries

Please ensure the following R packages are installed on your computer.

rm(list = ls())
require(kaiaulu)
require(stringi)
require(data.table)
require(knitr)
require(httr)
require(gt)

2 Configuration Section

2.1 Selecting an OpenHub Organization

To start the project selection process, we may want to restrict ourselves to focusing on projects under a specific organization. The list of organizations are found on the OpenHub website. For this notebook, we will focus on searching for projects with their primary language as Java under the “Apache Software Foundation”.

Below are a set of required variables for the openhub_* functions.

study_name <- "placeholder"
html_url_or_name <- "https://openhub.net/orgs/apache"
language <- "java"
token <- "dI_r4aalY6_6466CehmV4ErcY2ZEAkPOw-dY8cZZFnQ"

Explanation:

  • study_name: The name of the study for project selection that will be used to create a folder system to store API responses.
  • html_url_or_name: Either the URL for the organization page on OpenHub (e.g. “https://openhub.net/orgs/apache”) or the short, unique, handle for the organization (e.g. “apache”).
  • language: The code language to filter for projects only containing the specified code language (e.g. “java”).
  • token: The file named “openhub_token” containing the OpenHub API Token.

2.2 API Response Folder Storage

A hardcoded function for folder creation to store the API responses from each endpoint:

# 1. Anchor to your current working directory
github_folder <- getwd()

# 2. Define folder paths
rawdata_folder <- file.path(github_folder, "rawdata")
openhub_folder <- file.path(rawdata_folder, "openhub")
study_folder   <- file.path(openhub_folder, study_name)

organization_folder_path      <- file.path(study_folder, "organization")
portfolio_project_folder_path <- file.path(study_folder, "portfolio")
project_folder_path           <- file.path(study_folder, "project")
analysis_folder_path          <- file.path(study_folder, "analysis")

# 3. Create all folders safely
dir.create(organization_folder_path,      recursive = TRUE, showWarnings = FALSE)
dir.create(portfolio_project_folder_path, recursive = TRUE, showWarnings = FALSE)
dir.create(project_folder_path,           recursive = TRUE, showWarnings = FALSE)
dir.create(analysis_folder_path,          recursive = TRUE, showWarnings = FALSE)

3 Collecting and Parsing Data via Ohloh API

In this section, for each endpoint, we collect the data through a series of Ohloh API requests, and parse the API responses with its corresponding parser function. These parsed API responses are data tables which are displayed for each subsection. The values from one endpoint may be extracted for use to obtain a path to the next endpoint, and the merging of data tables is important for a holistic display of the data for the list of projects.

3.1 Organizations

We call openhub_api_organizations to acquire the organization API response of our selected organization specified by html_url_or_name. This function downloads the API response file into the folder specified by organization_folder_path.

openhub_api_organizations(token, organization_folder_path, html_url_or_name)

With the organization API response (only one page), we may parse this response with its corresponding parser function, openhub_parse_organizations, to acquire a data table with columns representing the tags for the organization listed:

  • name: The name of the organization.
  • html_url_projects: The URL to the organization on OpenHub’s website corresponding to a list of portfolio projects for the organization.
  • portfolio_projects: The number of portfolio projects under the organization.
openhub_organization_api_requests <- openhub_retrieve(organization_folder_path)
openhub_organizations <- openhub_parse_organizations(openhub_organization_api_requests)
head(openhub_organizations) %>%
  gt()
Table has no data

We then select the first organization’s “html_url” column value (e.g. “https://openhub.net/orgs/apache”) and filter to extract the short, unique handler for the organization (e.g. “apache”).

org_name <- "apache"

3.2 Portfolio Projects

Following a similar process as the Organization section, we acquire the portfolio projects for the organization, “Apache Software Foundation”, that possess the code language specified by language, in this case Java, by acquiring the portfolio projects API response files and parsing these API requests into a data table. This endpoint, portfolio projects, is iterable by pages, so we may use openhub_api_iterate_pages on openhub_api_portfolio_projects to acquire each page, storing these API responses in the folder path portfolio_project_folder_path. Currently, each page for the portfolio_projects collection returns a maximum of 20 items, portfolio projects. To extract as many matches as possible or up to a number of pages (if max_pages exceeds the total pages available by the API response, it will extract the maximum number of pages possible), the max_pages may be removed from openhub_api_iterate_pages or set to a desired value (e.g. “max_pages=1” returning up to 20 portfolio project items and “max_pages=16” returning up to 320 portfolio project items), respectively.

openhub_api_iterate_pages(token, openhub_api_portfolio_projects, portfolio_project_folder_path, org_name, max_pages=NULL)

With the portfolio project API responses saved, we may parse these responses with its corresponding parser function, openhub_parse_portfolio_projects, to acquire a data table with columns representing the tags for each portfolio project listed. To narrow down the list of projects that the next endpoint will require (to save on API requests), we filter the data table of portfolio projects to show only projects that have Java, specified by language, as their primary code language.

  • name: The name of the portfolio project.
  • primary_language: The primary code language used by the portfolio project.
  • activity: The portfolio project’s activity level (Very Low, Low, Moderate, High, and Very High).
portfolio_project_folder_path <- "/home/umar_mazhar/rawdata/openhub/placeholder/"

portfolio_projects_api_requests <- list.files(
  portfolio_project_folder_path,
  pattern = "^portfolioportfolio.*\\.xml$",
  full.names = TRUE
)

openhub_portfolio_projects <- openhub_parse_portfolio_projects(portfolio_projects_api_requests)
openhub_portfolio_projects <- openhub_portfolio_projects[primary_language == language]
head(openhub_portfolio_projects) %>%
  gt()
name primary_language activity
Apache Tomcat java High
Apache Ant java Moderate
Apache Maven 2 java Not Available
Log4j java Moderate
Apache Xerces2 J java Not Available
Apache Commons Collections java Moderate

3.3 Projects

To acquire more specific information about a portfolio project, we need to access it in the project collection, and the link between the portfolio_project endpoint and project endpoint is the “name” tag (e.g. “Apache Tomcat”). Project names on OpenHub are unique. Following a similar style of acquiring the project API responses and parsing them with its corresponding parser function, we loop through each “name” column in the portfolio projects’ data table openhub_portfolio_projects. For each project name, project_name, we gather its API responses by using openhub_api_iterate_pages on openhub_api_projects, storing these API response files in the folder path project_folder_path, and specifying max_pages to 1 (Using the API’s collection request query command, the first API requested page will contain a project with a matching “name” tag, thus there is no need to waste API calls to search through the other pages for the project, so max_pages is set to 1). The collection request query command may return, in a single project API response file page, multiple project items, this is due to the limitation of OpenHub’s API query command, which performs a type of “ctrl+F” find search across all tags, instead of allowing users to specifically query a tag for an exact match.

# Chunk 9 (Updated)
for (i in 1:nrow(openhub_portfolio_projects)) {
  project_name <- openhub_portfolio_projects[["name"]][[i]]
  
  # Fetch the data
  openhub_api_iterate_pages(token, openhub_api_projects, project_folder_path, project_name, max_pages=1)
  
  # Pause for 2 seconds to avoid rate-limiting
  Sys.sleep(2) 
}

With the project API responses saved, we may parse these responses with its corresponding parser function, openhub_parse_projects, to acquire two data tables with columns representing the tags for each project listed. The first data table, project_data has the columns:

  • name: The name of the project.
  • id: The project’s unique ID.
  • html_url: The project’s url to the current Project’s details page on OpenHub.
project_folder_path <- "/home/umar_mazhar/rawdata/openhub/placeholder/"

projects_api_requests <- list.files(
  project_folder_path,
  pattern = "^projectproject.*\\.xml$",
  full.names = TRUE
)

openhub_projects <- openhub_parse_projects(projects_api_requests)
project_data <- openhub_projects[[1]]
project_links <- openhub_projects[[2]]
head(project_data) %>% gt()
name id html_url
Apache Tomcat 3562 https://openhub.net/p/tomcat
Apache Tomcat Resources Manager Module 130685 https://openhub.net/p/tomcat-res-mgr
psi-probe 328248 https://openhub.net/p/psi-probe
Apache TomEE 632761 https://openhub.net/p/tomee
puppetlabs-tomcat 532962 https://openhub.net/p/puppetlabs-tomcat
JBoss Application Server 480 https://openhub.net/p/jboss

The second data table, project_links has the columns for each project listed (if a project doesn’t have any external links listed in its database, it will not appear in this data table):

  • name: The name of the project.
  • id: The project’s unique ID.
  • project_link_title: The title description of the specific external link associated with the project.
  • project_link_category: The category description of the specific external link associated with the project.
  • project_link_url: The URL of the specific external link associated with the project.
project_links <- openhub_projects[[2]]

head(project_links) %>%
  gt()
name id project_link_title project_link_category project_link_url
Apache Tomcat 3562 Tomcat Mailing Lists Forums http://tomcat.apache.org/lists.html
Apache Tomcat 3562 Tomcat FAQ Documentation http://tomcat.apache.org/faq/
Apache Tomcat 3562 Tomcat Bug Database / Issue Tracker Documentation http://tomcat.apache.org/bugreport.html
Apache Tomcat 3562 Book: Apache Tomcat Bible Other http://www.amazon.com/Apache-Tomcat-Bible-Jon-Eaves/dp/0764526065
Apache Tomcat 3562 Book: Tomcat: The Definitive Guide Other http://www.amazon.com/Tomcat-Definitive-Guide-Jason-Brittain/dp/0596101066
JBoss Application Server 480 JBoss Forums Forums http://community.jboss.org/en/jbossas

We combine the portfolio_projects and project data tables into one data table, openhub_combined_projects, by performing an inner-join by “name” column. We add an additional filter, unique by “id” column, to ensure no duplication due to the projects endpoint possibly returning duplicated project items from the OpenHub API’s collection request query command.

openhub_combined_projects <- unique(merge(project_data, openhub_portfolio_projects, by = "name", all = FALSE, allow.cartesian = TRUE), by = "id")
head(openhub_combined_projects) %>%
  gt()
name id html_url primary_language activity
Agila 13636 https://openhub.net/p/agila java Not Available
AntLib DotNet 3609 https://openhub.net/p/p_3609 java Very Low
AntLib SVN 3610 https://openhub.net/p/p_3610 java Inactive
Apache ACE 347981 https://openhub.net/p/apache_ace java Not Available
Apache Abdera 4718 https://openhub.net/p/p_d_4718 java Inactive
Apache Accumulo 588272 https://openhub.net/p/accumulo java High

3.4 Analyses

The previously acquired “id” tag (represented as a column) for each project allows us to acquire the latest analysis collection for a project, containing a multitude of important metrics. Following the same logic as the Projects section, looping through each project in openhub_combined_projects, we call openhub_api_analyses to acquire the analysis API response of our selected project specified by project_id. This function downloads the API response file into the folder specified by analysis_folder_path. In addition, we pass the name of the project to openhub_api_analyses for use in the file name of the stored API response file. The attached timestamp in the file’s name for the analysis API response files are NOT timestamps for when the API response file was requested, but rather they correspond to the timestamp on OpenHub’s database for when the analysis collection was last calculated for that specific project.

for (i in 1:nrow(openhub_combined_projects)) {
  project_name <- openhub_combined_projects[["name"]][[i]]
  project_id <- openhub_combined_projects[["id"]][[i]]
  openhub_api_analyses(token, analysis_folder_path, project_id, project_name)
  Sys.sleep(2)
}

With the analysis API response (only one page), we may parse this response with its corresponding parser function, openhub_parse_analyses, to acquire a data table with columns representing the tags for each analysis listed.

  • id: The project’s unique ID.
  • min_month: OpenHub’s first recorded year and month of the project’s data (typically the date of the project’s first commit, YYYY-MM format).
  • twelve_month_contributor_count: The number of contributors who made at least one commit to the project source code in the past twelve months.
  • total_contributor_count: The total number of contributors who made at least one commit to the project source code since the project’s inception.
  • twelve_month_commit_count: The total number of commits to the project source code in the past twelve months.
  • total_commit_count: The total number of commits to the project source code since the project’s inception.
  • total_code_lines: The most recent total count of all source code lines.
  • code_languages: A language breakdown with percentages for each substantial (as determined by OpenHub, less contributing languages are grouped and renamed as “Other”) contributing language in the project’s source code (The parser, openhub_parse_analyses, is overloaded to read each coding language and uses stringi to combine these percentages of code languages for the project).
analyses_api_requests <- list.files(
  "/home/umar_mazhar/rawdata/openhub/placeholder/",
  pattern = "^analysis.*\\.xml$",
  full.names = TRUE
)

openhub_analyses <- openhub_parse_analyses(analyses_api_requests)
head(openhub_analyses) %>%
  gt()
id min_month twelve_month_contributor_count total_contributor_count twelve_month_commit_count total_commit_count total_code_lines code_languages
13636 2004-10-01 0 3 0 65 46754 73% Java, 15% XML, 6% XML Schema, 6% 5 Other
3609 2005-04-01 2 4 3 88 7487 30% HTML, 52% Java, 17% XML, 1% 2 Other
3610 2005-04-01 0 8 0 45 2600 66% Java, 17% HTML, 11% XML, 6% XSL Transformation
4718 2008-12-01 0 12 0 181 74058 80% Java, 19% XML, 1% 5 Other
588272 2011-10-01 17 215 807 14940 525579 91% Java, 5% JavaScript, 4% 7 Other
347981 2009-05-01 0 25 0 2352 155233 82% Java, 11% XML, 5% CSS, 2% 2 Other

3.5 Combining the Data

Lastly, we combine the combined portfolio_projects and project data table, openhub_combined_projects, with the analysis data table, openhub_analyses, into one data table, openhub_combined_data, by performing an inner-join by “id” column.

openhub_combined_data <- merge(openhub_combined_projects, openhub_analyses, by = "id", all = FALSE)
head(openhub_combined_data) %>%
  gt()
id name html_url primary_language activity min_month twelve_month_contributor_count total_contributor_count twelve_month_commit_count total_commit_count total_code_lines code_languages
10483 Apache FtpServer https://openhub.net/p/ftpserver java Not Available 2003-03-01 1 29 1 1521 45466 93% Java, 7% 6 Other
10942 Apache Shindig https://openhub.net/p/shindig java Inactive 2007-12-01 0 74 0 6788 348487 57% Java, 35% JavaScript, 6% XML, 2% 8 Other
11082 Tiles https://openhub.net/p/tiles java Not Available 2007-01-01 0 9 0 785 32314 60% Java, 26% HTML, 14% XML, <1% 2 Other
11242 Apache Click https://openhub.net/p/apache-click java Not Available 2004-12-01 0 18 0 5749 139054 72% Java, 11% JavaScript, 9% HTML, 8% 3 Other
11590 Apache XML-RPC https://openhub.net/p/p_11590 java Not Available 2014-01-01 0 5 0 1 11168 91% Java, 9% XML
12001 Apache Avalon https://openhub.net/p/p_12001 java Inactive 2000-11-01 0 41 0 9651 156787 60% Java, 27% XML, 9% C#, 4% 10 Other

4 Relevant Information

4.1 Commit Issue Coverage

If you’re interested in verifying if a project labels their commits with issue IDs and whether they have unique issue types (i.e. “bug”, “feature”, “security bug”, “refactoring”, etc), which is outside of the scope of this notebook, please review this short explanation:

To check the issue IDs, it requires you to parse the project’s code git log. Then you can use this function on the resulting table. See this notebook for example usage. This notebook uses the regex written in the project configuration file, which is a regex. The user will need to manually figure out from the git log if any can be found, to then specify in Kaiaulu configuration file, to then have Kaiaulu calculate the metric. There is no other way to automate that since the conventions used vary across projects, if at all used.

# 1. Install and load the Excel export tool
library(writexl)

checkpoint_data <- openhub_combined_data[, c(
  "name", "total_code_lines", "min_month",
  "total_commit_count", "total_contributor_count",
  "twelve_month_commit_count", "twelve_month_contributor_count",
  "activity", "html_url"
)]

# Convert numeric columns from text to numbers
checkpoint_data$total_code_lines <- as.numeric(checkpoint_data$total_code_lines)
checkpoint_data$total_commit_count <- as.numeric(checkpoint_data$total_commit_count)
checkpoint_data$total_contributor_count <- as.numeric(checkpoint_data$total_contributor_count)
checkpoint_data$twelve_month_commit_count <- as.numeric(checkpoint_data$twelve_month_commit_count)
checkpoint_data$twelve_month_contributor_count <- as.numeric(checkpoint_data$twelve_month_contributor_count)

checkpoint_data$min_month <- as.Date(paste0(checkpoint_data$min_month, "-01"))
checkpoint_data$kloc <- round(checkpoint_data$total_code_lines / 1000, 1)
checkpoint_data$age_years <- round(as.numeric(Sys.Date() - checkpoint_data$min_month) / 365.25, 1)
checkpoint_data$jira_confirmed <- "Manual check needed"

five_years_ago <- Sys.Date() - (5 * 365.25)

filtered_data <- subset(checkpoint_data,
  total_code_lines >= 250000 &
  min_month <= five_years_ago &
  total_commit_count >= 5000
)

write_xlsx(filtered_data, "apache_strict_criteria_candidates.xlsx")
print(paste("Projects found:", nrow(filtered_data)))
## [1] "Projects found: 68"