Analyzing Open Source Project Prominence Series

This series covers the processes I have followed analyzing the latest releases of the Libraries.io dataset. If you are looking for a deep dive into any of the theory or mathematics behind my analysis, this is not the series for you. This post is primarily to introduce the Libraries.io dataset to people new to data analysis.

Part 1 - What is a SourceRank score?

Introduction to Git

I was new to both the open source community and data analysis, so I started my research with the basics. I knew that projects were made up of code that were held in repositories (like folders) which were typically hosted on sites like Github. Having that basic information, I began further research into the greater software development process.

Development

Teams of open source software developers build applications using Git based version control repository management services, which have become key components of modern software development workflows. These services provide automated distribution of open source software between the people who develop it (developers) and the people who use it (users). Version control ensures that the entire distribution channel is provided with either the newest, safest version of necessary code elements (for users), or the less stable, “in progress” code elements (for developers), in parallel. This means multiple versions of code must exist in tandem without causing usability issues to either users or developers.

Git workflow from developer perspective. Image source:Lessons Learned Teaching Git

The code elements are packaged into Projects and Repositories, managed using Package Managers, and housed on repository hosting sites, the most popular being Github, GitLab and Bitbucket. Collections of software projects that are developed and evolve together - due to dependencies and shared developer communities - make up software ecosystems. To answer initial questions about project prominence as well as more targeted questions about IaaS CSP APIs, I knew I’d need to find a data set that focused on total ecosystems, i.e. projects, repositories, and package managers. .

Data

After some trial and error, I discovered the Libraries.io dataset, which is the output of an open source project (recently acquired by Tidelift) that indexes over 25 million other open source projects across the web.

“Libraries.io gathers data from 36 package managers and 3 source code repositories. We track over 2.7m unique open source packages, 33m repositories and 235m interdependencies between them. This gives Libraries.io a unique understanding of open source software.” source: Zenodo

What does this mean? Every few months the Libraries.io team releases a dataset that can be analyzed to discover information about open source software ecosystems. The dataset released in March 2018 contains seven csv files, listed below. Because I was specifically interested in projects and repositories, I decided to begin with the Projects With Related Repository Fields file.

CSV file listing for Libraries.io dataset
File name	Description
Projects	A project is a piece of software available on any one of the 34 package managers supported by Libraries.io.
Versions	A Libraries.io version is an immutable published version of a Project from a package manager. Not all package managers have a concept of publishing versions, often relying directly on tags/branches from a revision control tool.
Tags	A tag is equivalent to a tag in a revision control system. Tags are sometimes used instead of Versions where a package manager does not use the concept of versions. Tags are often semantic version numbers.
Dependencies	Dependencies describe the relationship between a project and the software it builds upon. Dependencies belong to Version. Each Version can have different sets of dependencies. Dependencies point at a specific Version or range of versions of other projects.
Repositories	A Libraries.io repository represents a publically accessible source code repository from either github.com, gitlab.com or bitbucket.org. Repositories are distinct from Projects, they are not distributed via a package manager and typically an application for end users rather than component to build upon.
Repository dependencies	A repository dependency is a dependency upon a Version from a package manager has been specified in a manifest file, either as a manually added dependency committed by a user or listed as a generated dependency listed in a lockfile that has been automatically generated by a package manager and committed.
Projects with related Repository fields	This is an alternative projects export that denormalizes a projects related source code repository inline to reduce the need to join between two data sets.

According to Benjamin Nickols (creator and contributor to Libraries.io):

“Projects and repositories are one of the key distinctions made in this dataset. Projects are typically the components distributed through one or more package managers. Repositories may belong to a project but most frequently they are consumers, incorporating projects into an application or service.” source: Opensource.com

The Projects with related Repository fields file has a total of 2,556,311 project records with 64 variables including status, keywords, repository size, language, stars count and more. Using these values, users can generate insights about projects hosted on the most popular repository hosting sites.

Exploratory Analysis

I began my analysis with some high level exploration of the data. I first wanted to get a sense of how the populations of projects were distributed across repository hosts, and what methods were being used to assess their popularity. To start, I found that the 2.5 million unique projects in the dataset span the three most popular repository hosting sites, with GitHub hosting the vast majority of both projects and repositories.

Interestingly, 25% of project ID’s had unknown repository hosting sites. It is unclear where these projects are hosted. A quick look, however, gives us an idea of what platforms are typical when hosting is unknown.

## Selecting by prj_count_no_host

Even with 25% of the population hosted on unknown sites, it is clear that GitHub dominates the open source repository hosting market in terms of projects, repositories and (as evidenced below) dependencies. According to the charts below, when repositories and projects are dependent on the funcitonality of other repositories and projects, there is a greater chance that they are hosted on GitHub.

After looking at the entirety of the project population, it was clear that I needed a way to remove some of the “noise” within the data to differentiate popular projects from unpopular ones. Out of the 60+ variables available to me, what could be used to define success criteria in order to whittle dataset down to a manageable set of projects that could be considered “prominent”? It seemed like there were two ways to score the success of a project: Repository Stars Count and Sourcerank.

Field	Field Purpose
Repository Stars Count	Number of stars on the repository, only available for GitHub and GitLab.
SourceRank	Libraries.io defined score based on quality, popularity and community metrics.

To start my study, I chose to use an individual project’s SourceRank score as the definition of success because it spanned all repository hosts (Gitlab, Github and Bitbucket), as opposed to stars count which is not a feature on Bitbucket. Also, the SourceRank score is more granular, in that it is specific to each unique project, whereas stars count is specific to a full repository (which could host many projects). According to the SourceRank repository, Sourcerank “…is the metric that Libraries.io calculates for each project to produce a number that can be used for sorting in lists and weighting in search results as well as encouraging good practises in open source projects to improve quality and discoverability.” source: GitHub.com

SourceRank is the name for the algorithm that we use to index search results. The maximum score for SourceRank is currently around 30 points.Our analysis is broken down into: Code, Community, Distribution, Documentation, Usage.

source: Libraries.io

First, I looked at some general metrics in order to understand how the entire population of projects were distributed across the SourceRank scores, which range from 0 to 32 (higher is better). As we can see on the graphs below, 92.9% of all projects have SourceRank scores that fall below 10 points. In comparison, 92.3% of projects with SourceRank scores >= 10 hold the highest average number of dependent projects.

Figure 1

Figure 2

What does this tell us? Essentially, the majority of projects across GitHub, GitLab and Bitbucket - ~92% - depend heavily on a very small percentage - ~7% - of projects, which tend to have SourceRank scores >= 10. If any of these highly rated projects were to cease maintenance, a large portion of our software infrastructure would be affected. Taking this new data point in mind, I chose to focus my analysis on projects that had SourceRank scores of 10 or above. Doing this allowed me to remove the noise of less useful or “dead” repositories, leaving me with a dataset consisting of roughly 200,00 records, most of which were hosted on GitHub.

I then used Rstudio to clean up the data to get a final sample size of 198,558 records and 46 variables. The table below lists the chosen final variables, along with the variable abbreviations, and their initial cardinalities; some variables were re-binned to lower cardinalities in the analysis.

Final Variable List
Long Name	Short Name	Cardinality	Definition
_id	PjID	0	The unique primary key of the project in the Libraries.io database.
repository_id	ReID	0	The unique primary key of the repository for this project in the Libraries.io database.
status	PjStat	4	Either Active, Deprecated, Unmaintained, Help Wanted, Removed, no value also means active. Updated when detected by Libraries.io or submitted manually by Libraries.io user via “project suggection” feature.
repository_host_type	ReHost	3	Which website the repository is hosted on, either GitHub, GitLab or Bitbucket.
repository_forks	ReFork	3	Is the repository a fork of another.
repository_issues_enabled	ReIsEn	2	Is the bug tracker enabled for this repository?.
repository_wiki_enabled	ReWiEn	2	Is the wiki enabled for this repository?.
repository_pages_enabled	RePgEn	2	Is GitHub pages enabled for this repository? only possible for GitHub.
repository_status	ReStat	6	Either Active, Deprecated, Unmaintained, Help Wanted, Removed, no value also means active. Updated when detected by Libraries.io or su. manually by Libraries.io user via “repo suggection” feature.
repository_pull_requests_enabled	RePREn	4	Are pull requests enabled for this repository? Only available for GitLab repositories.
LatestRelPublishYear	PjLaRelY	16	Time of the latest release detected by Libraries.io (ordered by semver, falling back to publish date for invalid semver).
LatestRelPublishMonth	PjLaRelM	14	Time of the latest release detected by Libraries.io (ordered by semver, falling back to publish date for invalid semver).
LatestRelPublishDay	PjLaRelD	31	Time of the latest release detected by Libraries.io (ordered by semver, falling back to publish date for invalid semver).
RepoCreatedYear	ReCrYr	14	Timestamp of when the repository was created on the host.
RepoCreatedMonth	ReCrMo	14	Timestamp of when the repository was created on the host.
RepoCreatedDay	ReCrDa	31	Timestamp of when the repository was created on the host.
RepoUpdatedYear	ReUpYr	8	Timestamp of when the repository was last saved by Libraries.io.
RepoUpdatedMonth	ReUpMo	14	Timestamp of when the repository was last saved by Libraries.io.
RepoUPdatedDay	ReUpDa	31	Timestamp of when the repository was last saved by Libraries.io.
RepoLastPushYear	ReLaPuYr	7	Timestamp of when the repository was last pushed to, only available for GitHub repositories.
RepoLastPushMonth	ReLaPuMo	13	Timestamp of when the repository was last pushed to, only available for GitHub repositories.
RepoLastPushDay	ReLaPuDa	31	Timestamp of when the repository was last pushed to, only available for GitHub repositories.
RepoLastSyncYear	ReLaSyYr	6	Timestamp of when Libraries.io last synced the repository from the host API.
RepoLastSyncMonth	ReLaSyMo	15	Timestamp of when Libraries.io last synced the repository from the host API.
RepoLastSyncDay	ReLaSyDa	32	Timestamp of when Libraries.io last synced the repository from the host API.
platform_bin_PM	PjPlBPM	6	name of the package manager the project is available on
language_bin100	PjLaBA	37	Primary programming language the project is written in, pulled from the repository if source is hosted on GitHub.
language_bin1000	PjLaBB	20	Primary programming language the project is written in, pulled from the repository if source is hosted on GitHub.
language_bin9000	PjLaBC	7	Primary programming language the project is written in, pulled from the repository if source is hosted on GitHub.
repository_language_bin100	ReLaBA	34	Primary programming language the project is written in, only available for GitHub and Bitbucket.
repository_language_bin1000	ReLaBB	20	Primary programming language the project is written in, only available for GitHub and Bitbucket.
repository_language_bin9000	ReLaBC	9	Primary programming language the project is written in, only available for GitHub and Bitbucket.
licenses_bin1000	PjLiBA	14	Comma separated array of SPDX identifiers for licenses declared in package manager meta data or submitted manually by Libraries.io user via “project suggection” feature.
repo_licenses_bin1000	ReLiBA	12	SPDX identifier of the license of the repository, only available for GitHub repositories.
project_key_iaas	Pjkent	14	Key enterprise IaaS Cloud Service Providers
versions_count	PjVerCo	16	Number of published versions of the project found by Libraries.io.
dependent_projects_count	PjDepPjC	7	Number of other projects that declare the project as a dependency in one or more of their versions.
dependent_repositories_count	PjDepReC	12	The total count of open source repositories that list the project as a dependency as detected by Libraries.io.
repository_size	ReSize	10	Size of the repository in kilobytes, only available for GitHub and Bitbucket.
repository_stars_count	ReStCo	11	Number of stars on the repository, only available for GitHub and GitLab.
repository_forks_count	ReFkCo	12	Number of forks of this repository.
repository_open_issues_count	ReOpIsC	12	Number of open issues on the repository bug tracker, only available for GitHub and GitLab.
repository_watchers_count	ReWaC	12	Number of subscribers to all notifications for the repository, only available for GitHub and Bitbucket.
repository_contributors_count	ReCoCo	12	Number of unique contributors that have committed to the default branch.
repository_sourcerank	ReSouR	11	Libraries.io defined score based on quality, popularity and community metrics.
sourcerank	PjSouRa	2	Libraries.io defined score based on quality, popularity and community metrics.

Methodology

I still wanted a clearer understanding of what made up a project’s SourceRank score and which variables within my dataset had an effect on the score, so I could really understand what “prominent” in this case truly meant. To do this, I ran exploratory data mining models using Occam (an acronym for “Organizational Complexity Computation And Modeling”), a software package developed at Portland State University that specializes in a discrete, probabilistic modeling method called Reconstructability Analysis (RA). A good RA model is one that captures information (successful predictions), reduces uncertainty (%ΔH) and has low complexity (fewer degrees of freedom (df)).If you’d like to elarn more about RA modeling methodology, you can read more about it here.

The aim of my study was to gain insight into which variables had a relationship with “prominent” projects - projects with SourceRank scores of 10 or above. This would involve running a directed reconstructability analysis search for a model that would predict a dependent variable (SourceRank score) from a set of predictors (the other variables).

Reconstructability Analysis Model Types

Using Rstudio’s binning function, I bucketed the project SourceRank scores into two scoring intervals that were equal in content: scores that were 10-11 went into bucket 1, and scores that were 11-32 went into bucket two. I ran three searches - two “coarse”, and one “fine”; a final best model was selected from the fine search and summarized below.

Hypothesis

Ignoring repository SourceRank score and stars count (which I am assuming will be associated with a high project SourceRank score), I’m expecting the top 3-4 predictors for a higher SourceRank score to be the variables listed below.

Variable	Variable Purpose	Hypothesis
Repository_forks_count	Number of forks of this repository.	Forks are modified copies of repositories or projects. They allow developers to freely experiment with projects without affecting the original source code. I hypothesize that a higher number of forks signals higher developer engagement, which signals high need for (atleast a portion of) that source code.
Dependent_projects_count	Number of other projects that declare the project as a dependency in one or more of their versions.	If a project has a high number of other projects and/or repositories that are dependent on it to function, it signals a developers “trust” in the fact that the original project will function. This means that the original project most likely follows standards and protocols, is licensed properly, adheres to consistent versioning practices, and is generally a good community partner.
Dependent_repositories_count	The total count of open source repositories that list the project as a dependency as detected by Libraries.io.
Repository_contributors_count	Number of unique contributors that have committed to the default branch.	Projects with high numbers of contributors means that there is a large community of developers ready to deal with any issues that may arise with the usability of the project. More contributors allows for faster fixes, less “down time” and higher levels of community engagement.

For the initial coarse search, the top 10 single predicting independent variables have been listed with their complexities (Δdf), the % of reduction of uncertainty for each dependent variable (%ΔH), the % correct (%C) and their BIC (ΔBIC) - a metric that takes accuracy and complexity into account - from independence. These metrics will make it clear which of the variables within the data set have the strongest relation to a SourceRank score. Each models’ predicting power will be compared to our independence model (baseline) of %C of 57%.

Top 5 Coarse Model - Single Predictor
Model	dDF	dBIC	%dH(DV)	%C(Data)
IV:PjdeprecPjsoura	4	40850.49	21.47	74.38
IV:ResourPjsoura	4	37776.27	19.85	72.05
IV:PjdeppjcPjsoura	3	23475.69	12.34	69.09
IV:RestcoPjsoura	4	13830.56	7.28	64.69
IV:RefkcoPjsoura	4	11838.29	6.24	63.86
IV:RecocoPjsoura	4	10707.52	5.64	63.05
IV:RewacPjsoura	4	7703.47	4.07	62.00
IV:ReopiscPjsoura	4	7186.09	3.80	61.26
IV:RelapuyrPjsoura	4	6360.29	3.36	60.67
IV:PjvercoPjsoura	4	6359.73	3.36	61.00

According to our results, a project’s dependent repository count - Pjdeprec - is the strongest single predictor for a projects SourceRank score. Essentially, if we had to guess whether a SourceRank score for a project will be high or low, using only these variables, we’d be able to guess correctly most often by looking at a project’s dependent repository count. In fact, if we were to take this variable into account when guessing whether or not a project will have a high (>11) or low (<11) SourceRank score, we could reduce the baseline of uncertainty in our final guess by 21%, as evidenced above, boosting our %C prediction rating to 74%.

The 2nd strongest single predictor is the repository SourceRank score, which makes sense - a project with a high SourceRank score will probably live in a repository with a high SourceRank score. Surprisingly, repository stars count is the fourth best predictor, only reducing uncertainty by 7%. Stars count is an entirely community led metric, however, that is focused on the entire repository, so it may not be truly representative of a specific project’s actual popularity.

When we look at models with multiple predictors, we see largely expected results. This time, I’m looking at variables with greater granularity and I chose to remove the repository SourceRank variable - Resour - from our view. While project and repository scores differed enough for repository SourceRank score to be a possible predicting variable, I wanted to understand more about SourceRank in general, and therefore, felt that removing this very similar variable from the overall study was necessary.

Top 5 Coarse Models - Multiple Predictors
Model	dDF	dBIC	%dH(DV)	%C(Data)
IV:RelasyyrPjdeppjcPjdeprecRefkcoPjsoura	399	70228.48	39.34	80.02
IV:ReprenPjdeppjcPjdeprecRefkcoPjsoura	199	69993.73	37.97	79.50
IV:PjdeppjcPjdeprecRefkcoPjsoura	99	69877.88	37.29	79.38
IV:ReisenPjdeppjcPjdeprecRefkcoPjsoura	199	69575.03	37.75	79.56
IV:RelapuyrPjdeppjcPjdeprecRefkcoPjsoura	499	69256.55	39.45	80.01

The resulting models are ones that hold the strongest predicting power as a whole. As we can see above, the strongest predicting model for the coarse multi predictor model here is

IV:RelasyyrPjdeppjcPjdeprecRefkcoPjsoura
Which translates to
Repository last sync year + Dependent project count + Dependent repository count + Repository fork count = Project SourceRank Score Prediction

There is a four way relationship between the listed variables to predict a project’s SourceRank score. This model will reduce prediction uncertainty by 39%, bringing the percentage of data classified correctly up to 80% versus our baseline of 57%. Unfortunately, the degrees of freedom (Δdf) are very high in these models, showing high model complexity. This is undesirable, as a good model is one that captures information, reduces uncertainty, and has low complexity. Therefore, using a model with these four variables will allow us to predict a projects’ SourceRank score more accurately, however, the complexity outweighs the models accuracy.

The third model, which uses three variables instead of four, shows significantly less complexity in relation to our top model, with less than a percentage point lost in accuracy. If we were forced to pick one of these five models, the third model would be my choice, as we still see a reduction in uncertainty (37%), an increase in confidence (79%) and 99 degrees of freedom, which is still high, but lowest among these choices.

Moving on to our final model, we see a slight increase in the percentage of data that has been classified correctly vs our coarse model (80.53% vs 80.02%), along with a slight increase in the percentage of uncertainty reduction (40.26% vs 39.34%). However, we see the biggest prediction boost in the degrees of freedom (Δdf). As evidenced in our best performing coarse model, the degrees of freedom was very large (399), meaning the model was very complex. In our fine search, our degrees of freedom dropped to 15 for our best model, significantly reducing the complexity, and therefore, creating a more attractive and useful model.

Top 5 Fine Models - Multi Predictors
Model	dDF	dBIC	%dH(DV)	%C(Data)
IV:PjdeppjcPjsoura:PjdeprecPjsoura:RestcoPjsoura:RecocoPjsoura	15	76536.82	40.26	80.53
IV:RelapuyrPjsoura:PjdeppjcPjsoura:PjdeprecPjsoura:RestcoPjsoura	15	76241.70	40.11	80.45
IV:RelasyyrPjsoura:PjdeppjcPjsoura:PjdeprecPjsoura:RestcoPjsoura	14	76204.92	40.08	80.61
IV:PjdeppjcPjsoura:PjdeprecPjsoura:RestcoPjsoura	11	72468.52	38.10	79.71
IV:PjdeppjcPjsoura:PjdeprecPjsoura:RefkcoPjsoura	11	69852.98	36.73	79.23

After completing our three searches, we have found the best model to be:

IV:PjdeppjcPjsoura:PjdeprecPjsoura:RestcoPjsoura:RecocoPjsoura
Which translates to
(Dependent project count SourceRank prediction) + (Dependent repository count SourceRank prediction) + (Repository stars count SourceRank prediction) + (Repository contributor count SourceRank prediction) = Total Project SourceRank Score Prediction

Using Occam, we can now clearly see that the dependent project count, dependent repository count, repository stars count and repository contributor count all have a relationship with a project’s SourceRank score. If we consider each of these variables when debating a project’s prominence, we will be more likely to distinguish a prominent project (one that has a high SourceRank) from one that is not.

Conclusion

Now we have a better understanding of what makes up our definition of prominence - SourceRank scores - and how they relate to other variables in our dataset. So what are the most prominent projects? We have a list of the top 30 projects across Github, GitLab, and Bitbucket, ranked by SourceRank scores below.

name	SourceRank	dep_prj_cnt	dep_repo_cnt	repo_contributors	stars	url
mocha	32	146367	352184	364	14849	https://github.com/mochajs/mocha
webpack	32	53682	271743	431	38539	https://github.com/webpack/webpack
babel-core	32	63321	367370	552	26369	https://github.com/babel/babel
lodash	32	62036	385284	268	30261	https://github.com/lodash/lodash
rails	31	10995	441244	2626	38926	https://github.com/rails/rails
babel-preset-es2015	31	68019	244325	552	26369	https://github.com/babel/babel
express	31	35598	622380	224	37098	https://github.com/expressjs/express
eslint	31	85726	250151	571	10803	https://github.com/eslint/eslint
chai	31	83435	225602	135	5135	https://github.com/chaijs/chai
bundler	30	56270	89704	586	4130	https://github.com/bundler/bundler
rake	30	64231	571371	149	1213	https://github.com/ruby/rake
activesupport	30	10640	481540	2626	38926	https://github.com/rails/rails
activerecord	30	4952	423932	2626	38926	https://github.com/rails/rails
react-addons-test-utils	30	8861	37031	800	90387	https://github.com/facebook/react
rimraf	30	35606	211517	19	2303	https://github.com/isaacs/rimraf
moment	30	17218	114023	473	35880	https://github.com/moment/moment
eslint-config-airbnb	30	15960	44501	370	67559	https://github.com/airbnb/javascript
react	30	39200	259763	800	90383	https://github.com/facebook/react
react-dom	30	29807	204617	800	90383	https://github.com/facebook/react
@angular/platform-browser-dynamic	30	7541	99447	574	33929	https://github.com/angular/angular
@angular/http	30	6177	97082	574	33929	https://github.com/angular/angular
@angular/platform-browser	30	8476	100578	574	33929	https://github.com/angular/angular
redux	30	7224	82500	540	38905	https://github.com/reactjs/redux
@angular/forms	30	6111	95243	574	33929	https://github.com/angular/angular
@angular/common	30	9309	102749	574	33929	https://github.com/angular/angular
@angular/compiler	30	8998	102428	574	33929	https://github.com/angular/angular
request	30	34726	238564	291	18811	https://github.com/request/request
@angular/router	30	5215	82697	574	33929	https://github.com/angular/angular
@angular/core	30	10172	102741	574	33929	https://github.com/angular/angular
babel-cli	30	58483	102471	552	26369	https://github.com/babel/babel

Taking a look at our findings of the most prominent projects, we see that all of them have very high dependent repository and project counts, which we know are great indicators of prominence, due to our earlier reconstructability analysis. Interestingly, none of these projects are sponsored by IaaS CSPs. While I knew I wasn’t ready to answer my manager’s original question regarding APIs, I knew I could at least gain some insights into IaaS CSP sponsored projects.

Using Rstudio, I looked at specific repositories that I knew belonged to large IaaS CSPs like Google, Amazon, and Microsoft, along with IaaS focused foundations like the OpenStack Foundation and CNCF, which hosts the Kubernetes project. By creating a new column highlighting these specific CSPs and foundations, I was able to parse out the most prominent IaaS CSP sponsored projects to create the table below.

As we can see, the Google Github repository has 8 of the top ten prominent projects on our list, with the top project being Guava. Guava is a set of core libraries for Java based projects that are used daily in production services by Google employees ( find out more here ). The Apache Software Foundation makes an appearance on the list with Apache Groovy, a language for the Java platform ( find out more here ). Last on the list is an offering from Microsoft, the MSTest V2 Test Framework. This is a “test framework with which to write tests targeting .NET Framework, .NET Core and ASP.NET Core on Windows, Linux, and Mac” ( Source: Github.com. Find out more here ).

Additionally, there are two API focused projects on our list - one for Node.js and one for PHP. Both projects are offered by Google, and both are client libraries that enable developers to work with Google APIs like YouTube on their servers. Both projects are in maintenance mode (and won’t be adding new features), and both do not offer the ability to work with Google Cloud APIs - developers will have to look elsewhere for that support.

name	SourceRank	dep_prj_cnt	dep_repo_cnt	repo_contributors	stars	url
com.google.guava:guava	25	3768	55701	139	22690	https://github.com/google/guava.git
googleapis	25	573	4513	76	6049	https://github.com/google/google-api-nodejs-client
com.google.inject:guice	24	732	8071	46	6370	scm:git:git://github.com/google/guice.git/guice
google/apiclient	22	331	2017	92	5017	https://github.com/google/google-api-php-client
org.codehaus.groovy:groovy	22	355	2262	248	2503	scm:git:https://github.com/apache/groovy.git
traceur	22	496	4427	59	7564	https://github.com/google/traceur-compiler
eslint-config-google	22	1578	4686	9	714	https://github.com/google/eslint-config-google
material-design-icons	22	269	2059	17	33810	https://github.com/google/material-design-icons
Google.Protobuf	21	134	700	375	24237	https://github.com/google/protobuf
MSTest.TestFramework	21	63	4001	20	188	https://github.com/microsoft/testfx

## Closing

As you can see, there is a wealth of information available in the Libraries.io dataset. Overall, Github hosts the vast majority of repositories, we depend on Google open source contributions very heavily, and project dependency counts truly seem to be an indicator of project prominence. In the next part of this series, we’ll take a look at projects that are at risk due to the Bus Factor, i.e.i.e. projects that are depended upon by many other packages, but only have a handful of contributors that commit to the project.

name	SourceRank	dep_prj_cnt	dep_repo_cnt	repo_contributors	stars	url
Config-Any	9	NA	0	NA	NA	http://git.shadowcat.co.uk/gitweb/gitweb.cgi?p=p5sagit/Config-Any.git
Data-Dumper	9	1597	0	NA	NA	NA
Test-NoTabs	9	944	0	0	0	https://github.com/karenetheridge/Test-NoTabs
File-Slurp	9	841	0	NA	NA	NA
Test-PAUSE-Permissions	9	716	0	NA	NA	NA
MIME-Base64	9	705	0	NA	NA	NA
Path-Class	9	685	0	NA	NA	NA
Digest-MD5	9	663	0	NA	NA	NA
CPAN-Meta	9	621	0	23	25	https://github.com/Perl-Toolchain-Gang/CPAN-Meta
Pod-Usage	9	603	0	NA	NA	NA
Test-NoWarnings	9	528	0	NA	NA	NA
@types/yargs	8	485	872	1039	14616	https://github.com/DefinitelyTyped/DefinitelyTyped
XML-LibXML	5	460	0	15	4	https://github.com/shlomif/perl-XML-LibXML
pkg-config	9	450	0	NA	NA	NA
psr/container	0	442	13346	25	2162	https://github.com/php-fig/container
Class-Accessor	9	414	0	NA	NA	NA
Test-Differences	9	374	0	NA	NA	NA
commons-configuration:commons-configuration	9	365	5871	NA	NA	scm:svn:http://svn.apache.org/repos/asf/commons/proper/configuration/tags/CONFIGURATION_1_10RC2
hsqldb:hsqldb	8	347	3411	NA	NA	scm:svn:http://anonsvn.jboss.org/repos/maven/poms/jboss-parent-pom/tags/jboss-parent-5/hsqldb
Module-Runtime	9	341	0	NA	NA	NA
@types/prop-types	0	340	495	1039	14616	https://github.com/DefinitelyTyped/DefinitelyTyped
Retyped.dom	9	326	3	NA	NA	NA
ExtUtils-CBuilder	8	324	0	11	16	http://github.com/Perl-Toolchain-Gang/ExtUtils-CBuilder
File-ShareDir-Install	9	320	0	1	0	https://github.com/Perl-Toolchain-Gang/File-ShareDir-Install
child_process	0	311	686	3	41	https://github.com/npm/security-holder
CONFIG	9	303	42	NA	NA	NA
File-Find-Rule	7	301	0	NA	NA	NA
DBIx-Class	9	300	0	126	124	NA
Data-Dump	7	297	0	NA	NA	NA
python	9	284	0	NA	NA	NA
LWP-Protocol-https	8	279	0	NA	NA	NA
Retyped.lodash	9	278	0	NA	NA	NA
Digest-SHA1	7	271	0	NA	NA	NA
Module-Load	7	267	0	NA	NA	NA
Catalyst-Runtime	9	261	0	NA	NA	NA
Test	8	257	10	NA	NA	NA
Retyped.node	9	250	2	NA	NA	NA
@types/inquirer	8	248	360	1039	14616	https://github.com/DefinitelyTyped/DefinitelyTyped
Test-MockObject	8	247	0	NA	NA	NA
Test-CheckDeps	9	245	0	2	2	https://github.com/Leont/test-checkdeps
esdoc-standard-plugin	7	236	243	5	54	https://github.com/esdoc/esdoc-plugins
Any-Moose	7	228	0	8	0	https://github.com/moose/Any-Moose
typo3/cms-core	0	227	271	215	1	https://github.com/TYPO3-CMS/core
YAML-Syck	9	219	0	NA	NA	NA
Params-Util	7	213	0	NA	NA	NA
javax.servlet:jstl	0	209	1073	NA	NA	NA
Module-Metadata	8	209	0	20	4	https://github.com/Perl-Toolchain-Gang/Module-Metadata
common-sense	7	207	0	NA	NA	NA
MRO-Compat	7	206	0	5	0	https://github.com/moose/MRO-Compat
autoconf	6	203	0	NA	NA	NA