1 Introduction
2 Configuration Section
- 2.1 Selecting an OpenHub Organization
- 2.2 API Response Folder Storage
3 Collecting and Parsing Data via Ohloh API
4 Relevant Information
- 4.1 Commit Issue Coverage

1 Introduction

1.1 Motivation

This notebook showcases how to use Kaiaulu’s OpenHub API interface functions to facilitate selecting open-source projects for studies. These OpenHub API interfacing functions return data tables that contain project information such as total contributor count, total lines of code, percentage breakdown of code languages, number of contributors who made at least one commit in the past year, the number of commits in the past year, etc. The current search criteria: the organization to which the projects belong and the primary programming language used by the projects.

The notebook is unique among its peers in that it concerns itself not with one project, but their selection, and thus it does not need to use the project configuration file architecture nor the project initialization.

1.2 Purpose

This notebook explains how to acquire information on a set of projects (e.g. LOC on the current date, number of contributors who made at least one commit in the past 12 months, number of commits in the past 12 months, total commit count on the current date, and total number of contributors on the current date) that reside in Openhub’s open-source project collection based on search parameters under an organization using Ohloh API.

Kaiaulu’s interface to Ohloh’s API, an API for OpenHub’s open-source project collection, relies on httr to create http GET requests that interface with Ohloh’s API. Ohloh API responds to these requests by returning an XML response file with nested tags.

Kaiaulu only defines a few API endpoints of interest (organization, portfolio_projects, project, and analysis) where the tool is currently used, and parses the returned XML output into a table keeping only fields of interest. More endpoints and/or fields of interest per endpoint can be added in the future.

1.3 Create a Personal API Token

OpenHub has a limit to the number of API calls per token (maximum set to 1000) per day. The current rate for your personal API token and to acquire an API key, they may be found at its website and an account is required to acquire your personal API token. The process, to create an account and register for an API key is documented on Ohloh API GitHub Page API Section, should not take more than two minutes.

The functions in Kaiaulu will assume you have an OpenHub API token available, which can be passed as parameter.

1.4 Libraries

Please ensure the following R packages are installed on your computer.

rm(list = ls())
require(kaiaulu)
require(stringi)
require(data.table)
require(knitr)
require(httr)
require(gt)

2 Configuration Section

2.1 Selecting an OpenHub Organization

To start the project selection process, we may want to restrict ourselves to focusing on projects under a specific organization. The list of organizations are found on the OpenHub website. For this notebook, we will focus on searching for projects with their primary language as Java under the “Apache Software Foundation”.

Below are a set of required variables for the openhub_* functions.

study_name <- "placeholder"
html_url_or_name <- "https://openhub.net/orgs/apache"
language <- "java"
token <- "dI_r4aalY6_6466CehmV4ErcY2ZEAkPOw-dY8cZZFnQ"

Explanation:

study_name: The name of the study for project selection that will be used to create a folder system to store API responses.
html_url_or_name: Either the URL for the organization page on OpenHub (e.g. “https://openhub.net/orgs/apache”) or the short, unique, handle for the organization (e.g. “apache”).
language: The code language to filter for projects only containing the specified code language (e.g. “java”).
token: The file named “openhub_token” containing the OpenHub API Token.

2.2 API Response Folder Storage

A hardcoded function for folder creation to store the API responses from each endpoint:

# 1. Anchor to your current working directory
github_folder <- getwd()

# 2. Define folder paths
rawdata_folder <- file.path(github_folder, "rawdata")
openhub_folder <- file.path(rawdata_folder, "openhub")
study_folder   <- file.path(openhub_folder, study_name)

organization_folder_path      <- file.path(study_folder, "organization")
portfolio_project_folder_path <- file.path(study_folder, "portfolio")
project_folder_path           <- file.path(study_folder, "project")
analysis_folder_path          <- file.path(study_folder, "analysis")

# 3. Create all folders safely
dir.create(organization_folder_path,      recursive = TRUE, showWarnings = FALSE)
dir.create(portfolio_project_folder_path, recursive = TRUE, showWarnings = FALSE)
dir.create(project_folder_path,           recursive = TRUE, showWarnings = FALSE)
dir.create(analysis_folder_path,          recursive = TRUE, showWarnings = FALSE)

3 Collecting and Parsing Data via Ohloh API

In this section, for each endpoint, we collect the data through a series of Ohloh API requests, and parse the API responses with its corresponding parser function. These parsed API responses are data tables which are displayed for each subsection. The values from one endpoint may be extracted for use to obtain a path to the next endpoint, and the merging of data tables is important for a holistic display of the data for the list of projects.

3.1 Organizations

We call openhub_api_organizations to acquire the organization API response of our selected organization specified by html_url_or_name. This function downloads the API response file into the folder specified by organization_folder_path.

openhub_api_organizations(token, organization_folder_path, html_url_or_name)

With the organization API response (only one page), we may parse this response with its corresponding parser function, openhub_parse_organizations, to acquire a data table with columns representing the tags for the organization listed:

name: The name of the organization.
html_url_projects: The URL to the organization on OpenHub’s website corresponding to a list of portfolio projects for the organization.
portfolio_projects: The number of portfolio projects under the organization.

openhub_organization_api_requests <- openhub_retrieve(organization_folder_path)
openhub_organizations <- openhub_parse_organizations(openhub_organization_api_requests)
head(openhub_organizations) %>%
  gt()

Table has no data

We then select the first organization’s “html_url” column value (e.g. “https://openhub.net/orgs/apache”) and filter to extract the short, unique handler for the organization (e.g. “apache”).

org_name <- "apache"

3.2 Portfolio Projects

Following a similar process as the Organization section, we acquire the portfolio projects for the organization, “Apache Software Foundation”, that possess the code language specified by language, in this case Java, by acquiring the portfolio projects API response files and parsing these API requests into a data table. This endpoint, portfolio projects, is iterable by pages, so we may use openhub_api_iterate_pages on openhub_api_portfolio_projects to acquire each page, storing these API responses in the folder path portfolio_project_folder_path. Currently, each page for the portfolio_projects collection returns a maximum of 20 items, portfolio projects. To extract as many matches as possible or up to a number of pages (if max_pages exceeds the total pages available by the API response, it will extract the maximum number of pages possible), the max_pages may be removed from openhub_api_iterate_pages or set to a desired value (e.g. “max_pages=1” returning up to 20 portfolio project items and “max_pages=16” returning up to 320 portfolio project items), respectively.

openhub_api_iterate_pages(token, openhub_api_portfolio_projects, portfolio_project_folder_path, org_name, max_pages=NULL)

With the portfolio project API responses saved, we may parse these responses with its corresponding parser function, openhub_parse_portfolio_projects, to acquire a data table with columns representing the tags for each portfolio project listed. To narrow down the list of projects that the next endpoint will require (to save on API requests), we filter the data table of portfolio projects to show only projects that have Java, specified by language, as their primary code language.

name: The name of the portfolio project.
primary_language: The primary code language used by the portfolio project.
activity: The portfolio project’s activity level (Very Low, Low, Moderate, High, and Very High).

portfolio_project_folder_path <- "/home/umar_mazhar/rawdata/openhub/placeholder/"

portfolio_projects_api_requests <- list.files(
  portfolio_project_folder_path,
  pattern = "^portfolioportfolio.*\\.xml$",
  full.names = TRUE
)

openhub_portfolio_projects <- openhub_parse_portfolio_projects(portfolio_projects_api_requests)
openhub_portfolio_projects <- openhub_portfolio_projects[primary_language == language]
head(openhub_portfolio_projects) %>%
  gt()

name	primary_language	activity
Apache Tomcat	java	High
Apache Ant	java	Moderate
Apache Maven 2	java	Not Available
Log4j	java	Moderate
Apache Xerces2 J	java	Not Available
Apache Commons Collections	java	Moderate

3.3 Projects

To acquire more specific information about a portfolio project, we need to access it in the project collection, and the link between the portfolio_project endpoint and project endpoint is the “name” tag (e.g. “Apache Tomcat”). Project names on OpenHub are unique. Following a similar style of acquiring the project API responses and parsing them with its corresponding parser function, we loop through each “name” column in the portfolio projects’ data table openhub_portfolio_projects. For each project name, project_name, we gather its API responses by using openhub_api_iterate_pages on openhub_api_projects, storing these API response files in the folder path project_folder_path, and specifying max_pages to 1 (Using the API’s collection request query command, the first API requested page will contain a project with a matching “name” tag, thus there is no need to waste API calls to search through the other pages for the project, so max_pages is set to 1). The collection request query command may return, in a single project API response file page, multiple project items, this is due to the limitation of OpenHub’s API query command, which performs a type of “ctrl+F” find search across all tags, instead of allowing users to specifically query a tag for an exact match.

# Chunk 9 (Updated)
for (i in 1:nrow(openhub_portfolio_projects)) {
  project_name <- openhub_portfolio_projects[["name"]][[i]]
  
  # Fetch the data
  openhub_api_iterate_pages(token, openhub_api_projects, project_folder_path, project_name, max_pages=1)
  
  # Pause for 2 seconds to avoid rate-limiting
  Sys.sleep(2) 
}

With the project API responses saved, we may parse these responses with its corresponding parser function, openhub_parse_projects, to acquire two data tables with columns representing the tags for each project listed. The first data table, project_data has the columns:

name: The name of the project.
id: The project’s unique ID.
html_url: The project’s url to the current Project’s details page on OpenHub.

project_folder_path <- "/home/umar_mazhar/rawdata/openhub/placeholder/"

projects_api_requests <- list.files(
  project_folder_path,
  pattern = "^projectproject.*\\.xml$",
  full.names = TRUE
)

openhub_projects <- openhub_parse_projects(projects_api_requests)
project_data <- openhub_projects[[1]]
project_links <- openhub_projects[[2]]
head(project_data) %>% gt()

name	id	html_url
Apache Tomcat	3562	https://openhub.net/p/tomcat
Apache Tomcat Resources Manager Module	130685	https://openhub.net/p/tomcat-res-mgr
psi-probe	328248	https://openhub.net/p/psi-probe
Apache TomEE	632761	https://openhub.net/p/tomee
puppetlabs-tomcat	532962	https://openhub.net/p/puppetlabs-tomcat
JBoss Application Server	480	https://openhub.net/p/jboss

The second data table, project_links has the columns for each project listed (if a project doesn’t have any external links listed in its database, it will not appear in this data table):

name: The name of the project.
id: The project’s unique ID.
project_link_title: The title description of the specific external link associated with the project.
project_link_category: The category description of the specific external link associated with the project.
project_link_url: The URL of the specific external link associated with the project.

project_links <- openhub_projects[[2]]

head(project_links) %>%
  gt()

name	id	project_link_title	project_link_category	project_link_url
Apache Tomcat	3562	Tomcat Mailing Lists	Forums	http://tomcat.apache.org/lists.html
Apache Tomcat	3562	Tomcat FAQ	Documentation	http://tomcat.apache.org/faq/
Apache Tomcat	3562	Tomcat Bug Database / Issue Tracker	Documentation	http://tomcat.apache.org/bugreport.html
Apache Tomcat	3562	Book: Apache Tomcat Bible	Other	http://www.amazon.com/Apache-Tomcat-Bible-Jon-Eaves/dp/0764526065
Apache Tomcat	3562	Book: Tomcat: The Definitive Guide	Other	http://www.amazon.com/Tomcat-Definitive-Guide-Jason-Brittain/dp/0596101066
JBoss Application Server	480	JBoss Forums	Forums	http://community.jboss.org/en/jbossas

We combine the portfolio_projects and project data tables into one data table, openhub_combined_projects, by performing an inner-join by “name” column. We add an additional filter, unique by “id” column, to ensure no duplication due to the projects endpoint possibly returning duplicated project items from the OpenHub API’s collection request query command.

openhub_combined_projects <- unique(merge(project_data, openhub_portfolio_projects, by = "name", all = FALSE, allow.cartesian = TRUE), by = "id")
head(openhub_combined_projects) %>%
  gt()

name	id	html_url	primary_language	activity
Agila	13636	https://openhub.net/p/agila	java	Not Available
AntLib DotNet	3609	https://openhub.net/p/p_3609	java	Very Low
AntLib SVN	3610	https://openhub.net/p/p_3610	java	Inactive
Apache ACE	347981	https://openhub.net/p/apache_ace	java	Not Available
Apache Abdera	4718	https://openhub.net/p/p_d_4718	java	Inactive
Apache Accumulo	588272	https://openhub.net/p/accumulo	java	High

3.4 Analyses

The previously acquired “id” tag (represented as a column) for each project allows us to acquire the latest analysis collection for a project, containing a multitude of important metrics. Following the same logic as the Projects section, looping through each project in openhub_combined_projects, we call openhub_api_analyses to acquire the analysis API response of our selected project specified by project_id. This function downloads the API response file into the folder specified by analysis_folder_path. In addition, we pass the name of the project to openhub_api_analyses for use in the file name of the stored API response file. The attached timestamp in the file’s name for the analysis API response files are NOT timestamps for when the API response file was requested, but rather they correspond to the timestamp on OpenHub’s database for when the analysis collection was last calculated for that specific project.

for (i in 1:nrow(openhub_combined_projects)) {
  project_name <- openhub_combined_projects[["name"]][[i]]
  project_id <- openhub_combined_projects[["id"]][[i]]
  openhub_api_analyses(token, analysis_folder_path, project_id, project_name)
  Sys.sleep(2)
}

With the analysis API response (only one page), we may parse this response with its corresponding parser function, openhub_parse_analyses, to acquire a data table with columns representing the tags for each analysis listed.

id: The project’s unique ID.
min_month: OpenHub’s first recorded year and month of the project’s data (typically the date of the project’s first commit, YYYY-MM format).
twelve_month_contributor_count: The number of contributors who made at least one commit to the project source code in the past twelve months.
total_contributor_count: The total number of contributors who made at least one commit to the project source code since the project’s inception.
twelve_month_commit_count: The total number of commits to the project source code in the past twelve months.
total_commit_count: The total number of commits to the project source code since the project’s inception.
total_code_lines: The most recent total count of all source code lines.
code_languages: A language breakdown with percentages for each substantial (as determined by OpenHub, less contributing languages are grouped and renamed as “Other”) contributing language in the project’s source code (The parser, openhub_parse_analyses, is overloaded to read each coding language and uses stringi to combine these percentages of code languages for the project).

analyses_api_requests <- list.files(
  "/home/umar_mazhar/rawdata/openhub/placeholder/",
  pattern = "^analysis.*\\.xml$",
  full.names = TRUE
)

openhub_analyses <- openhub_parse_analyses(analyses_api_requests)
head(openhub_analyses) %>%
  gt()

id	min_month	twelve_month_contributor_count	total_contributor_count	twelve_month_commit_count	total_commit_count	total_code_lines	code_languages
13636	2004-10-01	0	3	0	65	46754	73% Java, 15% XML, 6% XML Schema, 6% 5 Other
3609	2005-04-01	2	4	3	88	7487	30% HTML, 52% Java, 17% XML, 1% 2 Other
3610	2005-04-01	0	8	0	45	2600	66% Java, 17% HTML, 11% XML, 6% XSL Transformation
4718	2008-12-01	0	12	0	181	74058	80% Java, 19% XML, 1% 5 Other
588272	2011-10-01	17	215	807	14940	525579	91% Java, 5% JavaScript, 4% 7 Other
347981	2009-05-01	0	25	0	2352	155233	82% Java, 11% XML, 5% CSS, 2% 2 Other

3.5 Combining the Data

Lastly, we combine the combined portfolio_projects and project data table, openhub_combined_projects, with the analysis data table, openhub_analyses, into one data table, openhub_combined_data, by performing an inner-join by “id” column.

openhub_combined_data <- merge(openhub_combined_projects, openhub_analyses, by = "id", all = FALSE)
head(openhub_combined_data) %>%
  gt()

id	name	html_url	primary_language	activity	min_month	twelve_month_contributor_count	total_contributor_count	twelve_month_commit_count	total_commit_count	total_code_lines	code_languages
10483	Apache FtpServer	https://openhub.net/p/ftpserver	java	Not Available	2003-03-01	1	29	1	1521	45466	93% Java, 7% 6 Other
10942	Apache Shindig	https://openhub.net/p/shindig	java	Inactive	2007-12-01	0	74	0	6788	348487	57% Java, 35% JavaScript, 6% XML, 2% 8 Other
11082	Tiles	https://openhub.net/p/tiles	java	Not Available	2007-01-01	0	9	0	785	32314	60% Java, 26% HTML, 14% XML, <1% 2 Other
11242	Apache Click	https://openhub.net/p/apache-click	java	Not Available	2004-12-01	0	18	0	5749	139054	72% Java, 11% JavaScript, 9% HTML, 8% 3 Other
11590	Apache XML-RPC	https://openhub.net/p/p_11590	java	Not Available	2014-01-01	0	5	0	1	11168	91% Java, 9% XML
12001	Apache Avalon	https://openhub.net/p/p_12001	java	Inactive	2000-11-01	0	41	0	9651	156787	60% Java, 27% XML, 9% C#, 4% 10 Other

4 Relevant Information

4.1 Commit Issue Coverage

If you’re interested in verifying if a project labels their commits with issue IDs and whether they have unique issue types (i.e. “bug”, “feature”, “security bug”, “refactoring”, etc), which is outside of the scope of this notebook, please review this short explanation:

To check the issue IDs, it requires you to parse the project’s code git log. Then you can use this function on the resulting table. See this notebook for example usage. This notebook uses the regex written in the project configuration file, which is a regex. The user will need to manually figure out from the git log if any can be found, to then specify in Kaiaulu configuration file, to then have Kaiaulu calculate the metric. There is no other way to automate that since the conventions used vary across projects, if at all used.

# 1. Install and load the Excel export tool
library(writexl)

checkpoint_data <- openhub_combined_data[, c(
  "name", "total_code_lines", "min_month",
  "total_commit_count", "total_contributor_count",
  "twelve_month_commit_count", "twelve_month_contributor_count",
  "activity", "html_url"
)]

# Convert numeric columns from text to numbers
checkpoint_data$total_code_lines <- as.numeric(checkpoint_data$total_code_lines)
checkpoint_data$total_commit_count <- as.numeric(checkpoint_data$total_commit_count)
checkpoint_data$total_contributor_count <- as.numeric(checkpoint_data$total_contributor_count)
checkpoint_data$twelve_month_commit_count <- as.numeric(checkpoint_data$twelve_month_commit_count)
checkpoint_data$twelve_month_contributor_count <- as.numeric(checkpoint_data$twelve_month_contributor_count)

checkpoint_data$min_month <- as.Date(paste0(checkpoint_data$min_month, "-01"))
checkpoint_data$kloc <- round(checkpoint_data$total_code_lines / 1000, 1)
checkpoint_data$age_years <- round(as.numeric(Sys.Date() - checkpoint_data$min_month) / 365.25, 1)
checkpoint_data$jira_confirmed <- "Manual check needed"

five_years_ago <- Sys.Date() - (5 * 365.25)

filtered_data <- subset(checkpoint_data,
  total_code_lines >= 250000 &
  min_month <= five_years_ago &
  total_commit_count >= 5000
)

write_xlsx(filtered_data, "apache_strict_criteria_candidates.xlsx")
print(paste("Projects found:", nrow(filtered_data)))

## [1] "Projects found: 68"

OpenHub API Interfacing for Project Search