This series covers the processes I have followed analyzing the latest releases of the Libraries.io dataset. If you are looking for a deep dive into any of the theory or mathematics behind my analysis, this is not the series for you. This post is primarily to introduce the Libraries.io dataset to people new to data analysis.

Background

About a year ago, I began working as a program manager in an open source software development team. I was very new to the open source community (and still am!), but I knew the ecosystem was growing and diversifying at a rapid pace due to companies and governments adopting open source software at rates that would have been incomprehensible 20 years ago. For example, Kubernetes is an open source containerization and orchestration software that was created by Google, grew a massive software ecosystem and is now embraced by enterprises around the globe like Microsoft, Amazon and Alibaba. In fact, Microsoft is so smitten by Kubernetes, the company recently deprecated its Azure Container Service in favor of the newly created Azure Kubernetes Service.



Companies are adopting and monetizing open source software for a variety of reasons, but how much do we truly know about open source development? When speaking to my manager about my interests in using data analysis within the context of open source development, she asked me a compelling question that I had no clue how to answer: “What are the most prominent open source APIs being offered on GitHub or other large repository hosts by Infrastructure as a Service (IaaS) Cloud Service Providers (CSPs) and what is their purpose?” The question only raise more questions for me, like What other repo hosts are used on a large scales besides GitHub? Why focus on APIs? and How do I define prominence?. I knew it would take time to formulate a confident response to the question; this is where my journey - and this multi part series - began.

Part 1 - What is a SourceRank score?

Introduction to Git

I was new to both the open source community and data analysis, so I started my research with the basics. I knew that projects were made up of code that were held in repositories (like folders) which were typically hosted on sites like Github. Having that basic information, I began further research into the greater software development process.

Development

Teams of open source software developers build applications using Git based version control repository management services, which have become key components of modern software development workflows. These services provide automated distribution of open source software between the people who develop it (developers) and the people who use it (users). Version control ensures that the entire distribution channel is provided with either the newest, safest version of necessary code elements (for users), or the less stable, “in progress” code elements (for developers), in parallel. This means multiple versions of code must exist in tandem without causing usability issues to either users or developers.




Git workflow from developer perspective.
Image source:Lessons Learned Teaching Git



The code elements are packaged into Projects and Repositories, managed using Package Managers, and housed on repository hosting sites, the most popular being Github, GitLab and Bitbucket. Collections of software projects that are developed and evolve together - due to dependencies and shared developer communities - make up software ecosystems. To answer initial questions about project prominence as well as more targeted questions about IaaS CSP APIs, I knew I’d need to find a data set that focused on total ecosystems, i.e. projects, repositories, and package managers. .

Data

After some trial and error, I discovered the Libraries.io dataset, which is the output of an open source project (recently acquired by Tidelift) that indexes over 25 million other open source projects across the web.



Libraries.io gathers data from 36 package managers and 3 source code repositories. We track over 2.7m unique open source packages, 33m repositories and 235m interdependencies between them. This gives Libraries.io a unique understanding of open source software.
source: Zenodo



What does this mean? Every few months the Libraries.io team releases a dataset that can be analyzed to discover information about open source software ecosystems. The dataset released in March 2018 contains seven csv files, listed below. Because I was specifically interested in projects and repositories, I decided to begin with the Projects With Related Repository Fields file.



CSV file listing for Libraries.io dataset
File name Description
Projects A project is a piece of software available on any one of the 34 package managers supported by Libraries.io.
Versions A Libraries.io version is an immutable published version of a Project from a package manager. Not all package managers have a concept of publishing versions, often relying directly on tags/branches from a revision control tool.
Tags A tag is equivalent to a tag in a revision control system. Tags are sometimes used instead of Versions where a package manager does not use the concept of versions. Tags are often semantic version numbers.
Dependencies Dependencies describe the relationship between a project and the software it builds upon. Dependencies belong to Version. Each Version can have different sets of dependencies. Dependencies point at a specific Version or range of versions of other projects.
Repositories A Libraries.io repository represents a publically accessible source code repository from either github.com, gitlab.com or bitbucket.org. Repositories are distinct from Projects, they are not distributed via a package manager and typically an application for end users rather than component to build upon.
Repository dependencies A repository dependency is a dependency upon a Version from a package manager has been specified in a manifest file, either as a manually added dependency committed by a user or listed as a generated dependency listed in a lockfile that has been automatically generated by a package manager and committed.
Projects with related Repository fields This is an alternative projects export that denormalizes a projects related source code repository inline to reduce the need to join between two data sets.


According to Benjamin Nickols (creator and contributor to Libraries.io):

Projects and repositories are one of the key distinctions made in this dataset. Projects are typically the components distributed through one or more package managers. Repositories may belong to a project but most frequently they are consumers, incorporating projects into an application or service.
source: Opensource.com



The Projects with related Repository fields file has a total of 2,556,311 project records with 64 variables including status, keywords, repository size, language, stars count and more. Using these values, users can generate insights about projects hosted on the most popular repository hosting sites.

Exploratory Analysis

I began my analysis with some high level exploration of the data. I first wanted to get a sense of how the populations of projects were distributed across repository hosts, and what methods were being used to assess their popularity. To start, I found that the 2.5 million unique projects in the dataset span the three most popular repository hosting sites, with GitHub hosting the vast majority of both projects and repositories.





Interestingly, 25% of project ID’s had unknown repository hosting sites. It is unclear where these projects are hosted. A quick look, however, gives us an idea of what platforms are typical when hosting is unknown.

## Selecting by prj_count_no_host

Even with 25% of the population hosted on unknown sites, it is clear that GitHub dominates the open source repository hosting market in terms of projects, repositories and (as evidenced below) dependencies. According to the charts below, when repositories and projects are dependent on the funcitonality of other repositories and projects, there is a greater chance that they are hosted on GitHub.





After looking at the entirety of the project population, it was clear that I needed a way to remove some of the “noise” within the data to differentiate popular projects from unpopular ones. Out of the 60+ variables available to me, what could be used to define success criteria in order to whittle dataset down to a manageable set of projects that could be considered “prominent”? It seemed like there were two ways to score the success of a project: Repository Stars Count and Sourcerank.



Field Field Purpose
Repository Stars Count Number of stars on the repository, only available for GitHub and GitLab.
SourceRank Libraries.io defined score based on quality, popularity and community metrics.



To start my study, I chose to use an individual project’s SourceRank score as the definition of success because it spanned all repository hosts (Gitlab, Github and Bitbucket), as opposed to stars count which is not a feature on Bitbucket. Also, the SourceRank score is more granular, in that it is specific to each unique project, whereas stars count is specific to a full repository (which could host many projects). According to the SourceRank repository, Sourcerank “…is the metric that Libraries.io calculates for each project to produce a number that can be used for sorting in lists and weighting in search results as well as encouraging good practises in open source projects to improve quality and discoverability.” source: GitHub.com


SourceRank is the name for the algorithm that we use to index search results. The maximum score for SourceRank is currently around 30 points.Our analysis is broken down into: Code, Community, Distribution, Documentation, Usage.

source: Libraries.io



First, I looked at some general metrics in order to understand how the entire population of projects were distributed across the SourceRank scores, which range from 0 to 32 (higher is better). As we can see on the graphs below, 92.9% of all projects have SourceRank scores that fall below 10 points. In comparison, 92.3% of projects with SourceRank scores >= 10 hold the highest average number of dependent projects.



Figure 1


Figure 2



What does this tell us? Essentially, the majority of projects across GitHub, GitLab and Bitbucket - ~92% - depend heavily on a very small percentage - ~7% - of projects, which tend to have SourceRank scores >= 10. If any of these highly rated projects were to cease maintenance, a large portion of our software infrastructure would be affected. Taking this new data point in mind, I chose to focus my analysis on projects that had SourceRank scores of 10 or above. Doing this allowed me to remove the noise of less useful or “dead” repositories, leaving me with a dataset consisting of roughly 200,00 records, most of which were hosted on GitHub.



I then used Rstudio to clean up the data to get a final sample size of 198,558 records and 46 variables. The table below lists the chosen final variables, along with the variable abbreviations, and their initial cardinalities; some variables were re-binned to lower cardinalities in the analysis.

Final Variable List
Long Name Short Name Cardinality Definition
_id PjID 0 The unique primary key of the project in the Libraries.io database.
repository_id ReID 0 The unique primary key of the repository for this project in the Libraries.io database.
status PjStat 4 Either Active, Deprecated, Unmaintained, Help Wanted, Removed, no value also means active. Updated when detected by Libraries.io or submitted manually by Libraries.io user via “project suggection” feature.
repository_host_type ReHost 3 Which website the repository is hosted on, either GitHub, GitLab or Bitbucket.
repository_forks ReFork 3 Is the repository a fork of another.
repository_issues_enabled ReIsEn 2 Is the bug tracker enabled for this repository?.
repository_wiki_enabled ReWiEn 2 Is the wiki enabled for this repository?.
repository_pages_enabled RePgEn 2 Is GitHub pages enabled for this repository? only possible for GitHub.
repository_status ReStat 6 Either Active, Deprecated, Unmaintained, Help Wanted, Removed, no value also means active. Updated when detected by Libraries.io or su. manually by Libraries.io user via “repo suggection” feature.
repository_pull_requests_enabled RePREn 4 Are pull requests enabled for this repository? Only available for GitLab repositories.
LatestRelPublishYear PjLaRelY 16 Time of the latest release detected by Libraries.io (ordered by semver, falling back to publish date for invalid semver).
LatestRelPublishMonth PjLaRelM 14 Time of the latest release detected by Libraries.io (ordered by semver, falling back to publish date for invalid semver).
LatestRelPublishDay PjLaRelD 31 Time of the latest release detected by Libraries.io (ordered by semver, falling back to publish date for invalid semver).
RepoCreatedYear ReCrYr 14 Timestamp of when the repository was created on the host.
RepoCreatedMonth ReCrMo 14 Timestamp of when the repository was created on the host.
RepoCreatedDay ReCrDa 31 Timestamp of when the repository was created on the host.
RepoUpdatedYear ReUpYr 8 Timestamp of when the repository was last saved by Libraries.io.
RepoUpdatedMonth ReUpMo 14 Timestamp of when the repository was last saved by Libraries.io.
RepoUPdatedDay ReUpDa 31 Timestamp of when the repository was last saved by Libraries.io.
RepoLastPushYear ReLaPuYr 7 Timestamp of when the repository was last pushed to, only available for GitHub repositories.
RepoLastPushMonth ReLaPuMo 13 Timestamp of when the repository was last pushed to, only available for GitHub repositories.
RepoLastPushDay ReLaPuDa 31 Timestamp of when the repository was last pushed to, only available for GitHub repositories.
RepoLastSyncYear ReLaSyYr 6 Timestamp of when Libraries.io last synced the repository from the host API.
RepoLastSyncMonth ReLaSyMo 15 Timestamp of when Libraries.io last synced the repository from the host API.
RepoLastSyncDay ReLaSyDa 32 Timestamp of when Libraries.io last synced the repository from the host API.
platform_bin_PM PjPlBPM 6 name of the package manager the project is available on
language_bin100 PjLaBA 37 Primary programming language the project is written in, pulled from the repository if source is hosted on GitHub.
language_bin1000 PjLaBB 20 Primary programming language the project is written in, pulled from the repository if source is hosted on GitHub.
language_bin9000 PjLaBC 7 Primary programming language the project is written in, pulled from the repository if source is hosted on GitHub.
repository_language_bin100 ReLaBA 34 Primary programming language the project is written in, only available for GitHub and Bitbucket.
repository_language_bin1000 ReLaBB 20 Primary programming language the project is written in, only available for GitHub and Bitbucket.
repository_language_bin9000 ReLaBC 9 Primary programming language the project is written in, only available for GitHub and Bitbucket.
licenses_bin1000 PjLiBA 14 Comma separated array of SPDX identifiers for licenses declared in package manager meta data or submitted manually by Libraries.io user via “project suggection” feature.
repo_licenses_bin1000 ReLiBA 12 SPDX identifier of the license of the repository, only available for GitHub repositories.
project_key_iaas Pjkent 14 Key enterprise IaaS Cloud Service Providers
versions_count PjVerCo 16 Number of published versions of the project found by Libraries.io.
dependent_projects_count PjDepPjC 7 Number of other projects that declare the project as a dependency in one or more of their versions.
dependent_repositories_count PjDepReC 12 The total count of open source repositories that list the project as a dependency as detected by Libraries.io.
repository_size ReSize 10 Size of the repository in kilobytes, only available for GitHub and Bitbucket.
repository_stars_count ReStCo 11 Number of stars on the repository, only available for GitHub and GitLab.
repository_forks_count ReFkCo 12 Number of forks of this repository.
repository_open_issues_count ReOpIsC 12 Number of open issues on the repository bug tracker, only available for GitHub and GitLab.
repository_watchers_count ReWaC 12 Number of subscribers to all notifications for the repository, only available for GitHub and Bitbucket.
repository_contributors_count ReCoCo 12 Number of unique contributors that have committed to the default branch.
repository_sourcerank ReSouR 11 Libraries.io defined score based on quality, popularity and community metrics.
sourcerank PjSouRa 2 Libraries.io defined score based on quality, popularity and community metrics.


Methodology



I still wanted a clearer understanding of what made up a project’s SourceRank score and which variables within my dataset had an effect on the score, so I could really understand what “prominent” in this case truly meant. To do this, I ran exploratory data mining models using Occam (an acronym for “Organizational Complexity Computation And Modeling”), a software package developed at Portland State University that specializes in a discrete, probabilistic modeling method called Reconstructability Analysis (RA). A good RA model is one that captures information (successful predictions), reduces uncertainty (%ΔH) and has low complexity (fewer degrees of freedom (df)).If you’d like to elarn more about RA modeling methodology, you can read more about it here.

The aim of my study was to gain insight into which variables had a relationship with “prominent” projects - projects with SourceRank scores of 10 or above. This would involve running a directed reconstructability analysis search for a model that would predict a dependent variable (SourceRank score) from a set of predictors (the other variables).



Reconstructability Analysis Model Types



Using Rstudio’s binning function, I bucketed the project SourceRank scores into two scoring intervals that were equal in content: scores that were 10-11 went into bucket 1, and scores that were 11-32 went into bucket two. I ran three searches - two “coarse”, and one “fine”; a final best model was selected from the fine search and summarized below.




Hypothesis

Ignoring repository SourceRank score and stars count (which I am assuming will be associated with a high project SourceRank score), I’m expecting the top 3-4 predictors for a higher SourceRank score to be the variables listed below.



Variable Variable Purpose Hypothesis
Repository_forks_count Number of forks of this repository. Forks are modified copies of repositories or projects. They allow developers to freely experiment with projects without affecting the original source code. I hypothesize that a higher number of  forks signals higher developer engagement, which signals high need for (atleast a portion of) that source code.
Dependent_projects_count Number of other projects that declare the project as a dependency in one or more of their versions. If a project has a high number of other projects and/or repositories that are dependent on it to function, it signals a developers “trust” in the fact that the original project will function. This means that the original project most likely follows standards and protocols, is licensed properly, adheres to consistent versioning practices, and is generally a good community partner.
Dependent_repositories_count The total count of open source repositories that list the project as a dependency as detected by  Libraries.io.
Repository_contributors_count Number of unique contributors that have committed to the default branch. Projects with high numbers of contributors means that there is a large community of developers ready to deal with any  issues that may arise with the usability of the project. More contributors allows for faster fixes, less “down time” and higher levels of community engagement.



For the initial coarse search, the top 10 single predicting independent variables have been listed with their complexities (Δdf), the % of reduction of uncertainty for each dependent variable (%ΔH), the % correct (%C) and their BIC (ΔBIC) - a metric that takes accuracy and complexity into account - from independence. These metrics will make it clear which of the variables within the data set have the strongest relation to a SourceRank score. Each models’ predicting power will be compared to our independence model (baseline) of %C of 57%.



Top 5 Coarse Model - Single Predictor
Model dDF Alpha dBIC %dH(DV) %C(Data)
IV:PjdeprecPjsoura 4 0 40850.49 21.47 74.38
IV:ResourPjsoura 4 0 37776.27 19.85 72.05
IV:PjdeppjcPjsoura 3 0 23475.69 12.34 69.09
IV:RestcoPjsoura 4 0 13830.56 7.28 64.69
IV:RefkcoPjsoura 4 0 11838.29 6.24 63.86
IV:RecocoPjsoura 4 0 10707.52 5.64 63.05
IV:RewacPjsoura 4 0 7703.47 4.07 62.00
IV:ReopiscPjsoura 4 0 7186.09 3.80 61.26
IV:RelapuyrPjsoura 4 0 6360.29 3.36 60.67
IV:PjvercoPjsoura 4 0 6359.73 3.36 61.00



According to our results, a project’s dependent repository count - Pjdeprec - is the strongest single predictor for a projects SourceRank score. Essentially, if we had to guess whether a SourceRank score for a project will be high or low, using only these variables, we’d be able to guess correctly most often by looking at a project’s dependent repository count. In fact, if we were to take this variable into account when guessing whether or not a project will have a high (>11) or low (<11) SourceRank score, we could reduce the baseline of uncertainty in our final guess by 21%, as evidenced above, boosting our %C prediction rating to 74%.

The 2nd strongest single predictor is the repository SourceRank score, which makes sense - a project with a high SourceRank score will probably live in a repository with a high SourceRank score. Surprisingly, repository stars count is the fourth best predictor, only reducing uncertainty by 7%. Stars count is an entirely community led metric, however, that is focused on the entire repository, so it may not be truly representative of a specific project’s actual popularity.

When we look at models with multiple predictors, we see largely expected results. This time, I’m looking at variables with greater granularity and I chose to remove the repository SourceRank variable - Resour - from our view. While project and repository scores differed enough for repository SourceRank score to be a possible predicting variable, I wanted to understand more about SourceRank in general, and therefore, felt that removing this very similar variable from the overall study was necessary.



Top 5 Coarse Models - Multiple Predictors
Model dDF Alpha dBIC %dH(DV) %C(Data)
IV:RelasyyrPjdeppjcPjdeprecRefkcoPjsoura 399 0 70228.48 39.34 80.02
IV:ReprenPjdeppjcPjdeprecRefkcoPjsoura 199 0 69993.73 37.97 79.50
IV:PjdeppjcPjdeprecRefkcoPjsoura 99 0 69877.88 37.29 79.38
IV:ReisenPjdeppjcPjdeprecRefkcoPjsoura 199 0 69575.03 37.75 79.56
IV:RelapuyrPjdeppjcPjdeprecRefkcoPjsoura 499 0 69256.55 39.45 80.01


The resulting models are ones that hold the strongest predicting power as a whole. As we can see above, the strongest predicting model for the coarse multi predictor model here is

IV:RelasyyrPjdeppjcPjdeprecRefkcoPjsoura

Which translates to

Repository last sync year + Dependent project count + Dependent repository count + Repository fork count = Project SourceRank Score Prediction



There is a four way relationship between the listed variables to predict a project’s SourceRank score. This model will reduce prediction uncertainty by 39%, bringing the percentage of data classified correctly up to 80% versus our baseline of 57%. Unfortunately, the degrees of freedom (Δdf) are very high in these models, showing high model complexity. This is undesirable, as a good model is one that captures information, reduces uncertainty, and has low complexity. Therefore, using a model with these four variables will allow us to predict a projects’ SourceRank score more accurately, however, the complexity outweighs the models accuracy.

The third model, which uses three variables instead of four, shows significantly less complexity in relation to our top model, with less than a percentage point lost in accuracy. If we were forced to pick one of these five models, the third model would be my choice, as we still see a reduction in uncertainty (37%), an increase in confidence (79%) and 99 degrees of freedom, which is still high, but lowest among these choices.

Moving on to our final model, we see a slight increase in the percentage of data that has been classified correctly vs our coarse model (80.53% vs 80.02%), along with a slight increase in the percentage of uncertainty reduction (40.26% vs 39.34%). However, we see the biggest prediction boost in the degrees of freedom (Δdf). As evidenced in our best performing coarse model, the degrees of freedom was very large (399), meaning the model was very complex. In our fine search, our degrees of freedom dropped to 15 for our best model, significantly reducing the complexity, and therefore, creating a more attractive and useful model.

Top 5 Fine Models - Multi Predictors
Model dDF Alpha dBIC %dH(DV) %C(Data)
IV:PjdeppjcPjsoura:PjdeprecPjsoura:RestcoPjsoura:RecocoPjsoura 15 0 76536.82 40.26 80.53
IV:RelapuyrPjsoura:PjdeppjcPjsoura:PjdeprecPjsoura:RestcoPjsoura 15 0 76241.70 40.11 80.45
IV:RelasyyrPjsoura:PjdeppjcPjsoura:PjdeprecPjsoura:RestcoPjsoura 14 0 76204.92 40.08 80.61
IV:PjdeppjcPjsoura:PjdeprecPjsoura:RestcoPjsoura 11 0 72468.52 38.10 79.71
IV:PjdeppjcPjsoura:PjdeprecPjsoura:RefkcoPjsoura 11 0 69852.98 36.73 79.23

After completing our three searches, we have found the best model to be:

IV:PjdeppjcPjsoura:PjdeprecPjsoura:RestcoPjsoura:RecocoPjsoura

Which translates to

(Dependent project count SourceRank prediction) + (Dependent repository count SourceRank prediction) + (Repository stars count SourceRank prediction) + (Repository contributor count SourceRank prediction) = Total Project SourceRank Score Prediction


Using Occam, we can now clearly see that the dependent project count, dependent repository count, repository stars count and repository contributor count all have a relationship with a project’s SourceRank score. If we consider each of these variables when debating a project’s prominence, we will be more likely to distinguish a prominent project (one that has a high SourceRank) from one that is not.




Conclusion

Now we have a better understanding of what makes up our definition of prominence - SourceRank scores - and how they relate to other variables in our dataset. So what are the most prominent projects? We have a list of the top 30 projects across Github, GitLab, and Bitbucket, ranked by SourceRank scores below.



name SourceRank dep_prj_cnt dep_repo_cnt repo_contributors stars url
mocha 32 146367 352184 364 14849 https://github.com/mochajs/mocha
webpack 32 53682 271743 431 38539 https://github.com/webpack/webpack
babel-core 32 63321 367370 552 26369 https://github.com/babel/babel
lodash 32 62036 385284 268 30261 https://github.com/lodash/lodash
rails 31 10995 441244 2626 38926 https://github.com/rails/rails
babel-preset-es2015 31 68019 244325 552 26369 https://github.com/babel/babel
express 31 35598 622380 224 37098 https://github.com/expressjs/express
eslint 31 85726 250151 571 10803 https://github.com/eslint/eslint
chai 31 83435 225602 135 5135 https://github.com/chaijs/chai
bundler 30 56270 89704 586 4130 https://github.com/bundler/bundler
rake 30 64231 571371 149 1213 https://github.com/ruby/rake
activesupport 30 10640 481540 2626 38926 https://github.com/rails/rails
activerecord 30 4952 423932 2626 38926 https://github.com/rails/rails
react-addons-test-utils 30 8861 37031 800 90387 https://github.com/facebook/react
rimraf 30 35606 211517 19 2303 https://github.com/isaacs/rimraf
moment 30 17218 114023 473 35880 https://github.com/moment/moment
eslint-config-airbnb 30 15960 44501 370 67559 https://github.com/airbnb/javascript
react 30 39200 259763 800 90383 https://github.com/facebook/react
react-dom 30 29807 204617 800 90383 https://github.com/facebook/react
@angular/platform-browser-dynamic 30 7541 99447 574 33929 https://github.com/angular/angular
@angular/http 30 6177 97082 574 33929 https://github.com/angular/angular
@angular/platform-browser 30 8476 100578 574 33929 https://github.com/angular/angular
redux 30 7224 82500 540 38905 https://github.com/reactjs/redux
@angular/forms 30 6111 95243 574 33929 https://github.com/angular/angular
@angular/common 30 9309 102749 574 33929 https://github.com/angular/angular
@angular/compiler 30 8998 102428 574 33929 https://github.com/angular/angular
request 30 34726 238564 291 18811 https://github.com/request/request
@angular/router 30 5215 82697 574 33929 https://github.com/angular/angular
@angular/core 30 10172 102741 574 33929 https://github.com/angular/angular
babel-cli 30 58483 102471 552 26369 https://github.com/babel/babel



Taking a look at our findings of the most prominent projects, we see that all of them have very high dependent repository and project counts, which we know are great indicators of prominence, due to our earlier reconstructability analysis. Interestingly, none of these projects are sponsored by IaaS CSPs. While I knew I wasn’t ready to answer my manager’s original question regarding APIs, I knew I could at least gain some insights into IaaS CSP sponsored projects.

Using Rstudio, I looked at specific repositories that I knew belonged to large IaaS CSPs like Google, Amazon, and Microsoft, along with IaaS focused foundations like the OpenStack Foundation and CNCF, which hosts the Kubernetes project. By creating a new column highlighting these specific CSPs and foundations, I was able to parse out the most prominent IaaS CSP sponsored projects to create the table below.



As we can see, the Google Github repository has 8 of the top ten prominent projects on our list, with the top project being Guava. Guava is a set of core libraries for Java based projects that are used daily in production services by Google employees ( find out more here ). The Apache Software Foundation makes an appearance on the list with Apache Groovy, a language for the Java platform ( find out more here ). Last on the list is an offering from Microsoft, the MSTest V2 Test Framework. This is a “test framework with which to write tests targeting .NET Framework, .NET Core and ASP.NET Core on Windows, Linux, and Mac” ( Source: Github.com. Find out more here ).

Additionally, there are two API focused projects on our list - one for Node.js and one for PHP. Both projects are offered by Google, and both are client libraries that enable developers to work with Google APIs like YouTube on their servers. Both projects are in maintenance mode (and won’t be adding new features), and both do not offer the ability to work with Google Cloud APIs - developers will have to look elsewhere for that support.


name SourceRank dep_prj_cnt dep_repo_cnt repo_contributors stars url
com.google.guava:guava 25 3768 55701 139 22690 https://github.com/google/guava.git
googleapis 25 573 4513 76 6049 https://github.com/google/google-api-nodejs-client
com.google.inject:guice 24 732 8071 46 6370 scm:git:git://github.com/google/guice.git/guice
google/apiclient 22 331 2017 92 5017 https://github.com/google/google-api-php-client
org.codehaus.groovy:groovy 22 355 2262 248 2503 scm:git:https://github.com/apache/groovy.git
traceur 22 496 4427 59 7564 https://github.com/google/traceur-compiler
eslint-config-google 22 1578 4686 9 714 https://github.com/google/eslint-config-google
material-design-icons 22 269 2059 17 33810 https://github.com/google/material-design-icons
Google.Protobuf 21 134 700 375 24237 https://github.com/google/protobuf
MSTest.TestFramework 21 63 4001 20 188 https://github.com/microsoft/testfx




## Closing

As you can see, there is a wealth of information available in the Libraries.io dataset. Overall, Github hosts the vast majority of repositories, we depend on Google open source contributions very heavily, and project dependency counts truly seem to be an indicator of project prominence. In the next part of this series, we’ll take a look at projects that are at risk due to the Bus Factor, i.e.i.e. projects that are depended upon by many other packages, but only have a handful of contributors that commit to the project.

name SourceRank dep_prj_cnt dep_repo_cnt repo_contributors stars url
Config-Any 9 NA 0 NA NA http://git.shadowcat.co.uk/gitweb/gitweb.cgi?p=p5sagit/Config-Any.git
Data-Dumper 9 1597 0 NA NA NA
Test-NoTabs 9 944 0 0 0 https://github.com/karenetheridge/Test-NoTabs
File-Slurp 9 841 0 NA NA NA
Test-PAUSE-Permissions 9 716 0 NA NA NA
MIME-Base64 9 705 0 NA NA NA
Path-Class 9 685 0 NA NA NA
Digest-MD5 9 663 0 NA NA NA
CPAN-Meta 9 621 0 23 25 https://github.com/Perl-Toolchain-Gang/CPAN-Meta
Pod-Usage 9 603 0 NA NA NA
Test-NoWarnings 9 528 0 NA NA NA
@types/yargs 8 485 872 1039 14616 https://github.com/DefinitelyTyped/DefinitelyTyped
XML-LibXML 5 460 0 15 4 https://github.com/shlomif/perl-XML-LibXML
pkg-config 9 450 0 NA NA NA
psr/container 0 442 13346 25 2162 https://github.com/php-fig/container
Class-Accessor 9 414 0 NA NA NA
Test-Differences 9 374 0 NA NA NA
commons-configuration:commons-configuration 9 365 5871 NA NA scm:svn:http://svn.apache.org/repos/asf/commons/proper/configuration/tags/CONFIGURATION_1_10RC2
hsqldb:hsqldb 8 347 3411 NA NA scm:svn:http://anonsvn.jboss.org/repos/maven/poms/jboss-parent-pom/tags/jboss-parent-5/hsqldb
Module-Runtime 9 341 0 NA NA NA
@types/prop-types 0 340 495 1039 14616 https://github.com/DefinitelyTyped/DefinitelyTyped
Retyped.dom 9 326 3 NA NA NA
ExtUtils-CBuilder 8 324 0 11 16 http://github.com/Perl-Toolchain-Gang/ExtUtils-CBuilder
File-ShareDir-Install 9 320 0 1 0 https://github.com/Perl-Toolchain-Gang/File-ShareDir-Install
child_process 0 311 686 3 41 https://github.com/npm/security-holder
CONFIG 9 303 42 NA NA NA
File-Find-Rule 7 301 0 NA NA NA
DBIx-Class 9 300 0 126 124 NA
Data-Dump 7 297 0 NA NA NA
python 9 284 0 NA NA NA
LWP-Protocol-https 8 279 0 NA NA NA
Retyped.lodash 9 278 0 NA NA NA
Digest-SHA1 7 271 0 NA NA NA
Module-Load 7 267 0 NA NA NA
Catalyst-Runtime 9 261 0 NA NA NA
Test 8 257 10 NA NA NA
Retyped.node 9 250 2 NA NA NA
@types/inquirer 8 248 360 1039 14616 https://github.com/DefinitelyTyped/DefinitelyTyped
Test-MockObject 8 247 0 NA NA NA
Test-CheckDeps 9 245 0 2 2 https://github.com/Leont/test-checkdeps
esdoc-standard-plugin 7 236 243 5 54 https://github.com/esdoc/esdoc-plugins
Any-Moose 7 228 0 8 0 https://github.com/moose/Any-Moose
typo3/cms-core 0 227 271 215 1 https://github.com/TYPO3-CMS/core
YAML-Syck 9 219 0 NA NA NA
Params-Util 7 213 0 NA NA NA
javax.servlet:jstl 0 209 1073 NA NA NA
Module-Metadata 8 209 0 20 4 https://github.com/Perl-Toolchain-Gang/Module-Metadata
common-sense 7 207 0 NA NA NA
MRO-Compat 7 206 0 5 0 https://github.com/moose/MRO-Compat
autoconf 6 203 0 NA NA NA