This series covers the processes I have followed analyzing the latest releases of the Libraries.io dataset. If you are looking for a deep dive into any of the theory or mathematics behind my analysis, this is not the series for you. This post is primarily to introduce the Libraries.io dataset to people new to data analysis.
About a year ago, I began working as a program manager in an open source software development team. I was very new to the open source community (and still am!), but I knew the ecosystem was growing and diversifying at a rapid pace due to companies and governments adopting open source software at rates that would have been incomprehensible 20 years ago. For example, Kubernetes is an open source containerization and orchestration software that was created by Google, grew a massive software ecosystem and is now embraced by enterprises around the globe like Microsoft, Amazon and Alibaba. In fact, Microsoft is so smitten by Kubernetes, the company recently deprecated its Azure Container Service in favor of the newly created Azure Kubernetes Service.
Companies are adopting and monetizing open source software for a variety of reasons, but how much do we truly know about open source development? When speaking to my manager about my interests in using data analysis within the context of open source development, she asked me a compelling question that I had no clue how to answer: “What are the most prominent open source APIs being offered on GitHub or other large repository hosts by Infrastructure as a Service (IaaS) Cloud Service Providers (CSPs) and what is their purpose?” The question only raise more questions for me, like What other repo hosts are used on a large scales besides GitHub? Why focus on APIs? and How do I define prominence?. I knew it would take time to formulate a confident response to the question; this is where my journey - and this multi part series - began.
I was new to both the open source community and data analysis, so I started my research with the basics. I knew that projects were made up of code that were held in repositories (like folders) which were typically hosted on sites like Github. Having that basic information, I began further research into the greater software development process.
Teams of open source software developers build applications using Git based version control repository management services, which have become key components of modern software development workflows. These services provide automated distribution of open source software between the people who develop it (developers) and the people who use it (users). Version control ensures that the entire distribution channel is provided with either the newest, safest version of necessary code elements (for users), or the less stable, “in progress” code elements (for developers), in parallel. This means multiple versions of code must exist in tandem without causing usability issues to either users or developers.
The code elements are packaged into Projects and Repositories, managed using Package Managers, and housed on repository hosting sites, the most popular being Github, GitLab and Bitbucket. Collections of software projects that are developed and evolve together - due to dependencies and shared developer communities - make up software ecosystems. To answer initial questions about project prominence as well as more targeted questions about IaaS CSP APIs, I knew I’d need to find a data set that focused on total ecosystems, i.e. projects, repositories, and package managers. .
After some trial and error, I discovered the Libraries.io dataset, which is the output of an open source project (recently acquired by Tidelift) that indexes over 25 million other open source projects across the web.
What does this mean? Every few months the Libraries.io team releases a dataset that can be analyzed to discover information about open source software ecosystems. The dataset released in March 2018 contains seven csv files, listed below. Because I was specifically interested in projects and repositories, I decided to begin with the Projects With Related Repository Fields file.
| File name | Description |
|---|---|
| Projects | A project is a piece of software available on any one of the 34 package managers supported by Libraries.io. |
| Versions | A Libraries.io version is an immutable published version of a Project from a package manager. Not all package managers have a concept of publishing versions, often relying directly on tags/branches from a revision control tool. |
| Tags | A tag is equivalent to a tag in a revision control system. Tags are sometimes used instead of Versions where a package manager does not use the concept of versions. Tags are often semantic version numbers. |
| Dependencies | Dependencies describe the relationship between a project and the software it builds upon. Dependencies belong to Version. Each Version can have different sets of dependencies. Dependencies point at a specific Version or range of versions of other projects. |
| Repositories | A Libraries.io repository represents a publically accessible source code repository from either github.com, gitlab.com or bitbucket.org. Repositories are distinct from Projects, they are not distributed via a package manager and typically an application for end users rather than component to build upon. |
| Repository dependencies | A repository dependency is a dependency upon a Version from a package manager has been specified in a manifest file, either as a manually added dependency committed by a user or listed as a generated dependency listed in a lockfile that has been automatically generated by a package manager and committed. |
| Projects with related Repository fields | This is an alternative projects export that denormalizes a projects related source code repository inline to reduce the need to join between two data sets. |
The Projects with related Repository fields file has a total of 2,556,311 project records with 64 variables including status, keywords, repository size, language, stars count and more. Using these values, users can generate insights about projects hosted on the most popular repository hosting sites.
I began my analysis with some high level exploration of the data. I first wanted to get a sense of how the populations of projects were distributed across repository hosts, and what methods were being used to assess their popularity. To start, I found that the 2.5 million unique projects in the dataset span the three most popular repository hosting sites, with GitHub hosting the vast majority of both projects and repositories.
Interestingly, 25% of project ID’s had unknown repository hosting sites. It is unclear where these projects are hosted. A quick look, however, gives us an idea of what platforms are typical when hosting is unknown.
## Selecting by prj_count_no_host
Even with 25% of the population hosted on unknown sites, it is clear that GitHub dominates the open source repository hosting market in terms of projects, repositories and (as evidenced below) dependencies. According to the charts below, when repositories and projects are dependent on the funcitonality of other repositories and projects, there is a greater chance that they are hosted on GitHub.
After looking at the entirety of the project population, it was clear that I needed a way to remove some of the “noise” within the data to differentiate popular projects from unpopular ones. Out of the 60+ variables available to me, what could be used to define success criteria in order to whittle dataset down to a manageable set of projects that could be considered “prominent”? It seemed like there were two ways to score the success of a project: Repository Stars Count and Sourcerank.
| Field | Field Purpose |
|---|---|
| Repository Stars Count | Number of stars on the repository, only available for GitHub and GitLab. |
| SourceRank | Libraries.io defined score based on quality, popularity and community metrics. |
To start my study, I chose to use an individual project’s SourceRank score as the definition of success because it spanned all repository hosts (Gitlab, Github and Bitbucket), as opposed to stars count which is not a feature on Bitbucket. Also, the SourceRank score is more granular, in that it is specific to each unique project, whereas stars count is specific to a full repository (which could host many projects). According to the SourceRank repository, Sourcerank “…is the metric that Libraries.io calculates for each project to produce a number that can be used for sorting in lists and weighting in search results as well as encouraging good practises in open source projects to improve quality and discoverability.” source: GitHub.com
First, I looked at some general metrics in order to understand how the entire population of projects were distributed across the SourceRank scores, which range from 0 to 32 (higher is better). As we can see on the graphs below, 92.9% of all projects have SourceRank scores that fall below 10 points. In comparison, 92.3% of projects with SourceRank scores >= 10 hold the highest average number of dependent projects.
What does this tell us? Essentially, the majority of projects across GitHub, GitLab and Bitbucket - ~92% - depend heavily on a very small percentage - ~7% - of projects, which tend to have SourceRank scores >= 10. If any of these highly rated projects were to cease maintenance, a large portion of our software infrastructure would be affected. Taking this new data point in mind, I chose to focus my analysis on projects that had SourceRank scores of 10 or above. Doing this allowed me to remove the noise of less useful or “dead” repositories, leaving me with a dataset consisting of roughly 200,00 records, most of which were hosted on GitHub.
I then used Rstudio to clean up the data to get a final sample size of 198,558 records and 46 variables. The table below lists the chosen final variables, along with the variable abbreviations, and their initial cardinalities; some variables were re-binned to lower cardinalities in the analysis.
| Long Name | Short Name | Cardinality | Definition |
|---|---|---|---|
| _id | PjID | 0 | The unique primary key of the project in the Libraries.io database. |
| repository_id | ReID | 0 | The unique primary key of the repository for this project in the Libraries.io database. |
| status | PjStat | 4 | Either Active, Deprecated, Unmaintained, Help Wanted, Removed, no value also means active. Updated when detected by Libraries.io or submitted manually by Libraries.io user via “project suggection” feature. |
| repository_host_type | ReHost | 3 | Which website the repository is hosted on, either GitHub, GitLab or Bitbucket. |
| repository_forks | ReFork | 3 | Is the repository a fork of another. |
| repository_issues_enabled | ReIsEn | 2 | Is the bug tracker enabled for this repository?. |
| repository_wiki_enabled | ReWiEn | 2 | Is the wiki enabled for this repository?. |
| repository_pages_enabled | RePgEn | 2 | Is GitHub pages enabled for this repository? only possible for GitHub. |
| repository_status | ReStat | 6 | Either Active, Deprecated, Unmaintained, Help Wanted, Removed, no value also means active. Updated when detected by Libraries.io or su. manually by Libraries.io user via “repo suggection” feature. |
| repository_pull_requests_enabled | RePREn | 4 | Are pull requests enabled for this repository? Only available for GitLab repositories. |
| LatestRelPublishYear | PjLaRelY | 16 | Time of the latest release detected by Libraries.io (ordered by semver, falling back to publish date for invalid semver). |
| LatestRelPublishMonth | PjLaRelM | 14 | Time of the latest release detected by Libraries.io (ordered by semver, falling back to publish date for invalid semver). |
| LatestRelPublishDay | PjLaRelD | 31 | Time of the latest release detected by Libraries.io (ordered by semver, falling back to publish date for invalid semver). |
| RepoCreatedYear | ReCrYr | 14 | Timestamp of when the repository was created on the host. |
| RepoCreatedMonth | ReCrMo | 14 | Timestamp of when the repository was created on the host. |
| RepoCreatedDay | ReCrDa | 31 | Timestamp of when the repository was created on the host. |
| RepoUpdatedYear | ReUpYr | 8 | Timestamp of when the repository was last saved by Libraries.io. |
| RepoUpdatedMonth | ReUpMo | 14 | Timestamp of when the repository was last saved by Libraries.io. |
| RepoUPdatedDay | ReUpDa | 31 | Timestamp of when the repository was last saved by Libraries.io. |
| RepoLastPushYear | ReLaPuYr | 7 | Timestamp of when the repository was last pushed to, only available for GitHub repositories. |
| RepoLastPushMonth | ReLaPuMo | 13 | Timestamp of when the repository was last pushed to, only available for GitHub repositories. |
| RepoLastPushDay | ReLaPuDa | 31 | Timestamp of when the repository was last pushed to, only available for GitHub repositories. |
| RepoLastSyncYear | ReLaSyYr | 6 | Timestamp of when Libraries.io last synced the repository from the host API. |
| RepoLastSyncMonth | ReLaSyMo | 15 | Timestamp of when Libraries.io last synced the repository from the host API. |
| RepoLastSyncDay | ReLaSyDa | 32 | Timestamp of when Libraries.io last synced the repository from the host API. |
| platform_bin_PM | PjPlBPM | 6 | name of the package manager the project is available on |
| language_bin100 | PjLaBA | 37 | Primary programming language the project is written in, pulled from the repository if source is hosted on GitHub. |
| language_bin1000 | PjLaBB | 20 | Primary programming language the project is written in, pulled from the repository if source is hosted on GitHub. |
| language_bin9000 | PjLaBC | 7 | Primary programming language the project is written in, pulled from the repository if source is hosted on GitHub. |
| repository_language_bin100 | ReLaBA | 34 | Primary programming language the project is written in, only available for GitHub and Bitbucket. |
| repository_language_bin1000 | ReLaBB | 20 | Primary programming language the project is written in, only available for GitHub and Bitbucket. |
| repository_language_bin9000 | ReLaBC | 9 | Primary programming language the project is written in, only available for GitHub and Bitbucket. |
| licenses_bin1000 | PjLiBA | 14 | Comma separated array of SPDX identifiers for licenses declared in package manager meta data or submitted manually by Libraries.io user via “project suggection” feature. |
| repo_licenses_bin1000 | ReLiBA | 12 | SPDX identifier of the license of the repository, only available for GitHub repositories. |
| project_key_iaas | Pjkent | 14 | Key enterprise IaaS Cloud Service Providers |
| versions_count | PjVerCo | 16 | Number of published versions of the project found by Libraries.io. |
| dependent_projects_count | PjDepPjC | 7 | Number of other projects that declare the project as a dependency in one or more of their versions. |
| dependent_repositories_count | PjDepReC | 12 | The total count of open source repositories that list the project as a dependency as detected by Libraries.io. |
| repository_size | ReSize | 10 | Size of the repository in kilobytes, only available for GitHub and Bitbucket. |
| repository_stars_count | ReStCo | 11 | Number of stars on the repository, only available for GitHub and GitLab. |
| repository_forks_count | ReFkCo | 12 | Number of forks of this repository. |
| repository_open_issues_count | ReOpIsC | 12 | Number of open issues on the repository bug tracker, only available for GitHub and GitLab. |
| repository_watchers_count | ReWaC | 12 | Number of subscribers to all notifications for the repository, only available for GitHub and Bitbucket. |
| repository_contributors_count | ReCoCo | 12 | Number of unique contributors that have committed to the default branch. |
| repository_sourcerank | ReSouR | 11 | Libraries.io defined score based on quality, popularity and community metrics. |
| sourcerank | PjSouRa | 2 | Libraries.io defined score based on quality, popularity and community metrics. |
I still wanted a clearer understanding of what made up a project’s SourceRank score and which variables within my dataset had an effect on the score, so I could really understand what “prominent” in this case truly meant. To do this, I ran exploratory data mining models using Occam (an acronym for “Organizational Complexity Computation And Modeling”), a software package developed at Portland State University that specializes in a discrete, probabilistic modeling method called Reconstructability Analysis (RA). A good RA model is one that captures information (successful predictions), reduces uncertainty (%ΔH) and has low complexity (fewer degrees of freedom (df)).If you’d like to elarn more about RA modeling methodology, you can read more about it here.
The aim of my study was to gain insight into which variables had a relationship with “prominent” projects - projects with SourceRank scores of 10 or above. This would involve running a directed reconstructability analysis search for a model that would predict a dependent variable (SourceRank score) from a set of predictors (the other variables).
Using Rstudio’s binning function, I bucketed the project SourceRank scores into two scoring intervals that were equal in content: scores that were 10-11 went into bucket 1, and scores that were 11-32 went into bucket two. I ran three searches - two “coarse”, and one “fine”; a final best model was selected from the fine search and summarized below.
Ignoring repository SourceRank score and stars count (which I am assuming will be associated with a high project SourceRank score), I’m expecting the top 3-4 predictors for a higher SourceRank score to be the variables listed below.
| Variable | Variable Purpose | Hypothesis |
|---|---|---|
| Repository_forks_count | Number of forks of this repository. | Forks are modified copies of repositories or projects. They allow developers to freely experiment with projects without affecting the original source code. I hypothesize that a higher number of forks signals higher developer engagement, which signals high need for (atleast a portion of) that source code. |
| Dependent_projects_count | Number of other projects that declare the project as a dependency in one or more of their versions. | If a project has a high number of other projects and/or repositories that are dependent on it to function, it signals a developers “trust” in the fact that the original project will function. This means that the original project most likely follows standards and protocols, is licensed properly, adheres to consistent versioning practices, and is generally a good community partner. |
| Dependent_repositories_count | The total count of open source repositories that list the project as a dependency as detected by Libraries.io. | |
| Repository_contributors_count | Number of unique contributors that have committed to the default branch. | Projects with high numbers of contributors means that there is a large community of developers ready to deal with any issues that may arise with the usability of the project. More contributors allows for faster fixes, less “down time” and higher levels of community engagement. |
For the initial coarse search, the top 10 single predicting independent variables have been listed with their complexities (Δdf), the % of reduction of uncertainty for each dependent variable (%ΔH), the % correct (%C) and their BIC (ΔBIC) - a metric that takes accuracy and complexity into account - from independence. These metrics will make it clear which of the variables within the data set have the strongest relation to a SourceRank score. Each models’ predicting power will be compared to our independence model (baseline) of %C of 57%.
| Model | dDF | Alpha | dBIC | %dH(DV) | %C(Data) |
|---|---|---|---|---|---|
| IV:PjdeprecPjsoura | 4 | 0 | 40850.49 | 21.47 | 74.38 |
| IV:ResourPjsoura | 4 | 0 | 37776.27 | 19.85 | 72.05 |
| IV:PjdeppjcPjsoura | 3 | 0 | 23475.69 | 12.34 | 69.09 |
| IV:RestcoPjsoura | 4 | 0 | 13830.56 | 7.28 | 64.69 |
| IV:RefkcoPjsoura | 4 | 0 | 11838.29 | 6.24 | 63.86 |
| IV:RecocoPjsoura | 4 | 0 | 10707.52 | 5.64 | 63.05 |
| IV:RewacPjsoura | 4 | 0 | 7703.47 | 4.07 | 62.00 |
| IV:ReopiscPjsoura | 4 | 0 | 7186.09 | 3.80 | 61.26 |
| IV:RelapuyrPjsoura | 4 | 0 | 6360.29 | 3.36 | 60.67 |
| IV:PjvercoPjsoura | 4 | 0 | 6359.73 | 3.36 | 61.00 |
According to our results, a project’s dependent repository count - Pjdeprec - is the strongest single predictor for a projects SourceRank score. Essentially, if we had to guess whether a SourceRank score for a project will be high or low, using only these variables, we’d be able to guess correctly most often by looking at a project’s dependent repository count. In fact, if we were to take this variable into account when guessing whether or not a project will have a high (>11) or low (<11) SourceRank score, we could reduce the baseline of uncertainty in our final guess by 21%, as evidenced above, boosting our %C prediction rating to 74%.
The 2nd strongest single predictor is the repository SourceRank score, which makes sense - a project with a high SourceRank score will probably live in a repository with a high SourceRank score. Surprisingly, repository stars count is the fourth best predictor, only reducing uncertainty by 7%. Stars count is an entirely community led metric, however, that is focused on the entire repository, so it may not be truly representative of a specific project’s actual popularity.
When we look at models with multiple predictors, we see largely expected results. This time, I’m looking at variables with greater granularity and I chose to remove the repository SourceRank variable - Resour - from our view. While project and repository scores differed enough for repository SourceRank score to be a possible predicting variable, I wanted to understand more about SourceRank in general, and therefore, felt that removing this very similar variable from the overall study was necessary.
| Model | dDF | Alpha | dBIC | %dH(DV) | %C(Data) |
|---|---|---|---|---|---|
| IV:RelasyyrPjdeppjcPjdeprecRefkcoPjsoura | 399 | 0 | 70228.48 | 39.34 | 80.02 |
| IV:ReprenPjdeppjcPjdeprecRefkcoPjsoura | 199 | 0 | 69993.73 | 37.97 | 79.50 |
| IV:PjdeppjcPjdeprecRefkcoPjsoura | 99 | 0 | 69877.88 | 37.29 | 79.38 |
| IV:ReisenPjdeppjcPjdeprecRefkcoPjsoura | 199 | 0 | 69575.03 | 37.75 | 79.56 |
| IV:RelapuyrPjdeppjcPjdeprecRefkcoPjsoura | 499 | 0 | 69256.55 | 39.45 | 80.01 |
There is a four way relationship between the listed variables to predict a project’s SourceRank score. This model will reduce prediction uncertainty by 39%, bringing the percentage of data classified correctly up to 80% versus our baseline of 57%. Unfortunately, the degrees of freedom (Δdf) are very high in these models, showing high model complexity. This is undesirable, as a good model is one that captures information, reduces uncertainty, and has low complexity. Therefore, using a model with these four variables will allow us to predict a projects’ SourceRank score more accurately, however, the complexity outweighs the models accuracy.
The third model, which uses three variables instead of four, shows significantly less complexity in relation to our top model, with less than a percentage point lost in accuracy. If we were forced to pick one of these five models, the third model would be my choice, as we still see a reduction in uncertainty (37%), an increase in confidence (79%) and 99 degrees of freedom, which is still high, but lowest among these choices.
Moving on to our final model, we see a slight increase in the percentage of data that has been classified correctly vs our coarse model (80.53% vs 80.02%), along with a slight increase in the percentage of uncertainty reduction (40.26% vs 39.34%). However, we see the biggest prediction boost in the degrees of freedom (Δdf). As evidenced in our best performing coarse model, the degrees of freedom was very large (399), meaning the model was very complex. In our fine search, our degrees of freedom dropped to 15 for our best model, significantly reducing the complexity, and therefore, creating a more attractive and useful model.
| Model | dDF | Alpha | dBIC | %dH(DV) | %C(Data) |
|---|---|---|---|---|---|
| IV:PjdeppjcPjsoura:PjdeprecPjsoura:RestcoPjsoura:RecocoPjsoura | 15 | 0 | 76536.82 | 40.26 | 80.53 |
| IV:RelapuyrPjsoura:PjdeppjcPjsoura:PjdeprecPjsoura:RestcoPjsoura | 15 | 0 | 76241.70 | 40.11 | 80.45 |
| IV:RelasyyrPjsoura:PjdeppjcPjsoura:PjdeprecPjsoura:RestcoPjsoura | 14 | 0 | 76204.92 | 40.08 | 80.61 |
| IV:PjdeppjcPjsoura:PjdeprecPjsoura:RestcoPjsoura | 11 | 0 | 72468.52 | 38.10 | 79.71 |
| IV:PjdeppjcPjsoura:PjdeprecPjsoura:RefkcoPjsoura | 11 | 0 | 69852.98 | 36.73 | 79.23 |
Using Occam, we can now clearly see that the dependent project count, dependent repository count, repository stars count and repository contributor count all have a relationship with a project’s SourceRank score. If we consider each of these variables when debating a project’s prominence, we will be more likely to distinguish a prominent project (one that has a high SourceRank) from one that is not.
Now we have a better understanding of what makes up our definition of prominence - SourceRank scores - and how they relate to other variables in our dataset. So what are the most prominent projects? We have a list of the top 30 projects across Github, GitLab, and Bitbucket, ranked by SourceRank scores below.
| name | SourceRank | dep_prj_cnt | dep_repo_cnt | repo_contributors | stars | url |
|---|---|---|---|---|---|---|
| mocha | 32 | 146367 | 352184 | 364 | 14849 | https://github.com/mochajs/mocha |
| webpack | 32 | 53682 | 271743 | 431 | 38539 | https://github.com/webpack/webpack |
| babel-core | 32 | 63321 | 367370 | 552 | 26369 | https://github.com/babel/babel |
| lodash | 32 | 62036 | 385284 | 268 | 30261 | https://github.com/lodash/lodash |
| rails | 31 | 10995 | 441244 | 2626 | 38926 | https://github.com/rails/rails |
| babel-preset-es2015 | 31 | 68019 | 244325 | 552 | 26369 | https://github.com/babel/babel |
| express | 31 | 35598 | 622380 | 224 | 37098 | https://github.com/expressjs/express |
| eslint | 31 | 85726 | 250151 | 571 | 10803 | https://github.com/eslint/eslint |
| chai | 31 | 83435 | 225602 | 135 | 5135 | https://github.com/chaijs/chai |
| bundler | 30 | 56270 | 89704 | 586 | 4130 | https://github.com/bundler/bundler |
| rake | 30 | 64231 | 571371 | 149 | 1213 | https://github.com/ruby/rake |
| activesupport | 30 | 10640 | 481540 | 2626 | 38926 | https://github.com/rails/rails |
| activerecord | 30 | 4952 | 423932 | 2626 | 38926 | https://github.com/rails/rails |
| react-addons-test-utils | 30 | 8861 | 37031 | 800 | 90387 | https://github.com/facebook/react |
| rimraf | 30 | 35606 | 211517 | 19 | 2303 | https://github.com/isaacs/rimraf |
| moment | 30 | 17218 | 114023 | 473 | 35880 | https://github.com/moment/moment |
| eslint-config-airbnb | 30 | 15960 | 44501 | 370 | 67559 | https://github.com/airbnb/javascript |
| react | 30 | 39200 | 259763 | 800 | 90383 | https://github.com/facebook/react |
| react-dom | 30 | 29807 | 204617 | 800 | 90383 | https://github.com/facebook/react |
| @angular/platform-browser-dynamic | 30 | 7541 | 99447 | 574 | 33929 | https://github.com/angular/angular |
| @angular/http | 30 | 6177 | 97082 | 574 | 33929 | https://github.com/angular/angular |
| @angular/platform-browser | 30 | 8476 | 100578 | 574 | 33929 | https://github.com/angular/angular |
| redux | 30 | 7224 | 82500 | 540 | 38905 | https://github.com/reactjs/redux |
| @angular/forms | 30 | 6111 | 95243 | 574 | 33929 | https://github.com/angular/angular |
| @angular/common | 30 | 9309 | 102749 | 574 | 33929 | https://github.com/angular/angular |
| @angular/compiler | 30 | 8998 | 102428 | 574 | 33929 | https://github.com/angular/angular |
| request | 30 | 34726 | 238564 | 291 | 18811 | https://github.com/request/request |
| @angular/router | 30 | 5215 | 82697 | 574 | 33929 | https://github.com/angular/angular |
| @angular/core | 30 | 10172 | 102741 | 574 | 33929 | https://github.com/angular/angular |
| babel-cli | 30 | 58483 | 102471 | 552 | 26369 | https://github.com/babel/babel |
Taking a look at our findings of the most prominent projects, we see that all of them have very high dependent repository and project counts, which we know are great indicators of prominence, due to our earlier reconstructability analysis. Interestingly, none of these projects are sponsored by IaaS CSPs. While I knew I wasn’t ready to answer my manager’s original question regarding APIs, I knew I could at least gain some insights into IaaS CSP sponsored projects.
Using Rstudio, I looked at specific repositories that I knew belonged to large IaaS CSPs like Google, Amazon, and Microsoft, along with IaaS focused foundations like the OpenStack Foundation and CNCF, which hosts the Kubernetes project. By creating a new column highlighting these specific CSPs and foundations, I was able to parse out the most prominent IaaS CSP sponsored projects to create the table below.
| name | SourceRank | dep_prj_cnt | dep_repo_cnt | repo_contributors | stars | url |
|---|---|---|---|---|---|---|
| com.google.guava:guava | 25 | 3768 | 55701 | 139 | 22690 | https://github.com/google/guava.git |
| googleapis | 25 | 573 | 4513 | 76 | 6049 | https://github.com/google/google-api-nodejs-client |
| com.google.inject:guice | 24 | 732 | 8071 | 46 | 6370 | scm:git:git://github.com/google/guice.git/guice |
| google/apiclient | 22 | 331 | 2017 | 92 | 5017 | https://github.com/google/google-api-php-client |
| org.codehaus.groovy:groovy | 22 | 355 | 2262 | 248 | 2503 | scm:git:https://github.com/apache/groovy.git |
| traceur | 22 | 496 | 4427 | 59 | 7564 | https://github.com/google/traceur-compiler |
| eslint-config-google | 22 | 1578 | 4686 | 9 | 714 | https://github.com/google/eslint-config-google |
| material-design-icons | 22 | 269 | 2059 | 17 | 33810 | https://github.com/google/material-design-icons |
| Google.Protobuf | 21 | 134 | 700 | 375 | 24237 | https://github.com/google/protobuf |
| MSTest.TestFramework | 21 | 63 | 4001 | 20 | 188 | https://github.com/microsoft/testfx |
## Closing
As you can see, there is a wealth of information available in the Libraries.io dataset. Overall, Github hosts the vast majority of repositories, we depend on Google open source contributions very heavily, and project dependency counts truly seem to be an indicator of project prominence. In the next part of this series, we’ll take a look at projects that are at risk due to the Bus Factor, i.e.i.e. projects that are depended upon by many other packages, but only have a handful of contributors that commit to the project.
| name | SourceRank | dep_prj_cnt | dep_repo_cnt | repo_contributors | stars | url |
|---|---|---|---|---|---|---|
| Config-Any | 9 | NA | 0 | NA | NA | http://git.shadowcat.co.uk/gitweb/gitweb.cgi?p=p5sagit/Config-Any.git |
| Data-Dumper | 9 | 1597 | 0 | NA | NA | NA |
| Test-NoTabs | 9 | 944 | 0 | 0 | 0 | https://github.com/karenetheridge/Test-NoTabs |
| File-Slurp | 9 | 841 | 0 | NA | NA | NA |
| Test-PAUSE-Permissions | 9 | 716 | 0 | NA | NA | NA |
| MIME-Base64 | 9 | 705 | 0 | NA | NA | NA |
| Path-Class | 9 | 685 | 0 | NA | NA | NA |
| Digest-MD5 | 9 | 663 | 0 | NA | NA | NA |
| CPAN-Meta | 9 | 621 | 0 | 23 | 25 | https://github.com/Perl-Toolchain-Gang/CPAN-Meta |
| Pod-Usage | 9 | 603 | 0 | NA | NA | NA |
| Test-NoWarnings | 9 | 528 | 0 | NA | NA | NA |
| @types/yargs | 8 | 485 | 872 | 1039 | 14616 | https://github.com/DefinitelyTyped/DefinitelyTyped |
| XML-LibXML | 5 | 460 | 0 | 15 | 4 | https://github.com/shlomif/perl-XML-LibXML |
| pkg-config | 9 | 450 | 0 | NA | NA | NA |
| psr/container | 0 | 442 | 13346 | 25 | 2162 | https://github.com/php-fig/container |
| Class-Accessor | 9 | 414 | 0 | NA | NA | NA |
| Test-Differences | 9 | 374 | 0 | NA | NA | NA |
| commons-configuration:commons-configuration | 9 | 365 | 5871 | NA | NA | scm:svn:http://svn.apache.org/repos/asf/commons/proper/configuration/tags/CONFIGURATION_1_10RC2 |
| hsqldb:hsqldb | 8 | 347 | 3411 | NA | NA | scm:svn:http://anonsvn.jboss.org/repos/maven/poms/jboss-parent-pom/tags/jboss-parent-5/hsqldb |
| Module-Runtime | 9 | 341 | 0 | NA | NA | NA |
| @types/prop-types | 0 | 340 | 495 | 1039 | 14616 | https://github.com/DefinitelyTyped/DefinitelyTyped |
| Retyped.dom | 9 | 326 | 3 | NA | NA | NA |
| ExtUtils-CBuilder | 8 | 324 | 0 | 11 | 16 | http://github.com/Perl-Toolchain-Gang/ExtUtils-CBuilder |
| File-ShareDir-Install | 9 | 320 | 0 | 1 | 0 | https://github.com/Perl-Toolchain-Gang/File-ShareDir-Install |
| child_process | 0 | 311 | 686 | 3 | 41 | https://github.com/npm/security-holder |
| CONFIG | 9 | 303 | 42 | NA | NA | NA |
| File-Find-Rule | 7 | 301 | 0 | NA | NA | NA |
| DBIx-Class | 9 | 300 | 0 | 126 | 124 | NA |
| Data-Dump | 7 | 297 | 0 | NA | NA | NA |
| python | 9 | 284 | 0 | NA | NA | NA |
| LWP-Protocol-https | 8 | 279 | 0 | NA | NA | NA |
| Retyped.lodash | 9 | 278 | 0 | NA | NA | NA |
| Digest-SHA1 | 7 | 271 | 0 | NA | NA | NA |
| Module-Load | 7 | 267 | 0 | NA | NA | NA |
| Catalyst-Runtime | 9 | 261 | 0 | NA | NA | NA |
| Test | 8 | 257 | 10 | NA | NA | NA |
| Retyped.node | 9 | 250 | 2 | NA | NA | NA |
| @types/inquirer | 8 | 248 | 360 | 1039 | 14616 | https://github.com/DefinitelyTyped/DefinitelyTyped |
| Test-MockObject | 8 | 247 | 0 | NA | NA | NA |
| Test-CheckDeps | 9 | 245 | 0 | 2 | 2 | https://github.com/Leont/test-checkdeps |
| esdoc-standard-plugin | 7 | 236 | 243 | 5 | 54 | https://github.com/esdoc/esdoc-plugins |
| Any-Moose | 7 | 228 | 0 | 8 | 0 | https://github.com/moose/Any-Moose |
| typo3/cms-core | 0 | 227 | 271 | 215 | 1 | https://github.com/TYPO3-CMS/core |
| YAML-Syck | 9 | 219 | 0 | NA | NA | NA |
| Params-Util | 7 | 213 | 0 | NA | NA | NA |
| javax.servlet:jstl | 0 | 209 | 1073 | NA | NA | NA |
| Module-Metadata | 8 | 209 | 0 | 20 | 4 | https://github.com/Perl-Toolchain-Gang/Module-Metadata |
| common-sense | 7 | 207 | 0 | NA | NA | NA |
| MRO-Compat | 7 | 206 | 0 | 5 | 0 | https://github.com/moose/MRO-Compat |
| autoconf | 6 | 203 | 0 | NA | NA | NA |