R Packages Popularity

TidySunday - R for Fun

Introduction

R has been making remarkable development for over 15 years and and the progress won’t be impeded in the near future due to its strong community in academics and applied statiscians in the contemporary time. R packages, just like libraries in other programming languages, are essential elements for any work related to modeling and data. A massive set of packages for statistical modelling, machine learning, visualisation, and importing and manipulating data. Whatever model or graphic you’re trying to do, chances are that someone has already tried to do it and you can learn from their efforts. Furthermore, the dynamic and enthusiastic researching environment will offer immediate access to the cutting-edge statistical techniques and implementations among studies in statistics and machine learning.

In this post, our modeling target is to comprehend the relationship between the number of releases, vignettes and those R packages using data from #TidyTuesday

Data Exploration

First, let load the dataset using the github link from tidy tuesday challenge dataset.

library(tidyverse)
dat <- read_csv("https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2022/2022-03-15/cran.csv")

The description of our dataset will be provided as belows:

  • package
    • Package name (character)
  • version
    • Package version number (character)
  • date
    • Date and time the package was uploaded/updated (datetime)
  • rnw
    • RNW (Sweave) based vignette count (integer)
  • rmd
    • RMD based vignette count (integer)

Normally, we will often ask ourselves what the data looked like. Hence, let’s check some common package

dat %>% filter(package == "ggplot2")
## # A tibble: 41 x 5
##    package version date                       rnw   rmd
##    <chr>   <chr>   <chr>                    <dbl> <dbl>
##  1 ggplot2 0.5.1   Sat Jun  9 10:58:29 2007     0     0
##  2 ggplot2 0.5.2   Sun Jun 17 21:57:47 2007     0     0
##  3 ggplot2 0.5.4   Sun Jul  8 09:51:07 2007     0     0
##  4 ggplot2 0.5.5   Fri Aug 31 13:16:07 2007     0     0
##  5 ggplot2 0.5.6   Wed Oct 17 20:03:58 2007     0     0
##  6 ggplot2 0.5.7   Fri Jan 11 18:35:11 2008     0     0
##  7 ggplot2 0.5     Fri Jun  1 10:22:10 2007     0     0
##  8 ggplot2 0.6     Thu Apr  3 18:28:00 2008     0     0
##  9 ggplot2 0.7     Fri Oct  3 17:26:35 2008     0     0
## 10 ggplot2 0.8.1   Thu Dec 11 21:41:00 2008     0     0
## # ... with 31 more rows
dat %>% filter(package == "tidymodels")
## # A tibble: 7 x 5
##   package    version date                      rnw   rmd
##   <chr>      <chr>   <chr>                   <dbl> <dbl>
## 1 tidymodels 0.0.1   2018-07-22 16:04:08 UTC     0     0
## 2 tidymodels 0.0.2   2018-11-27 02:49:14 UTC     0     0
## 3 tidymodels 0.0.3   2019-10-04 21:11:00 UTC     0     0
## 4 tidymodels 0.1.0   2020-02-16 20:03:48 UTC     0     2
## 5 tidymodels 0.1.1   2020-07-13 21:47:48 UTC     0     2
## 6 tidymodels 0.1.2   2020-11-22 21:07:26 UTC     0     2
## 7 tidymodels 0.1.3   2021-04-19 02:18:42 UTC     0     2

Now, we will create a summarized dataset with the following features: (1) the first release date, (2) the number of releases, (3) the number of vignettes as the most recent release

summarized_pkg <- dat %>%
  group_by(package) %>%
  summarize(
    release_date = first(date),
    n_releases = n(),
    vignettes = last(rnw) + last(rmd)
  )

Hold on !!!, let’s check the ratio of no vignettes package over those having at least 1

mean(summarized_pkg$vignettes < 1)
## [1] 0.6611048

Sure there are many of them. We may want to inspect the distribution of number of vignettes

summarized_pkg %>%
  ggplot(aes(vignettes)) + 
  geom_histogram(bins = 12, colour = "#404080") + 
  scale_x_continuous(trans = scales::pseudo_log_trans(base = 10))

The distribution is right-skewed with most of the count fall on “zero vignette” package. Just a few packages have a ton of vignettes:

summarized_pkg %>% filter(vignettes > 20)
## # A tibble: 4 x 4
##   package release_date            n_releases vignettes
##   <chr>   <chr>                        <int>     <dbl>
## 1 catdata 2012-02-03 10:29:32 UTC          4        45
## 2 fastai  2020-11-05 18:26:03 UTC          9        24
## 3 HSAUR3  2014-04-17 10:31:57 UTC         11        23
## 4 pla     2015-08-18 21:58:53 UTC          2        61

We will make one more exploratory plot representing the indicator of having a vignette and number of releases of all packages:

summarized_pkg %>%
  mutate(has_vignette = vignettes > 0) %>%
  ggplot(aes(has_vignette, n_releases, color = has_vignette, fill = has_vignette)) + 
  geom_boxplot(size = 1.3, alpha = 0.15, show.legend = FALSE) +
  scale_y_log10() +
  coord_flip() +
  labs(x = "Has a vignette?")

We may believe that packages with more releases are more likely to have a vignette.

Poisson Regression Model

There are various specialized packages that could provide the framework for poisson regression. In this post, we will use poissonreg with glm engine so we could fit the discrete, count nature of our data.

library(tidymodels)
library(poissonreg)

poisson_spec <- poisson_reg() %>%
  set_mode('regression') %>%
  set_engine('glm')

pois_rec_spec <- recipe(
  vignettes ~ n_releases,
  data = summarized_pkg)

poisson_wf <- workflow() %>%
  add_recipe(pois_rec_spec) %>%
  add_model(poisson_spec)


fit_model <- poisson_wf %>% fit(data = summarized_pkg)

augment(fit_model, new_data = summarized_pkg, type.predict = "response") %>%
  ggplot(aes(vignettes, .pred)) +
  geom_point(alpha = 0.1) +
  geom_abline(slope = 1, size = 1, color = "grey40") +
  labs(title = "Predicting the number of vignettes using Poission Regression", x = "Actual", y = "Predicted")

Poisson regressions do not supply an independent variance parameter \(\sigma\), and as a result can be overdispersed. Under the Poisson distribution the variance equals the mean—that is, the standard deviation equals the square root of the mean. In the model above, \(\mathbb{E}(y_{i}) = \theta_{i}\) and \(\text{Sd}(y_{i}) = \sqrt{\theta_{i}}\). We define the standard residuals as follows: \[ \begin{aligned} z_i & = \frac{y_i - \hat{y_i}}{\text{Sd}(\hat{y_i})} \\ & = \frac{y_i - \hat{\theta_i}}{\sqrt{\hat{\theta_i}}} \end{aligned} \] where \(\hat{\theta_i} = e^{X_i \hat{\theta}}\). If the Poisson model is true, then the \(z_i\)’s should be approximately independent (not exactly independent, since the same estimate βˆ is used in computing all of them), each with mean 0 and standard deviation 1. If there is overdispersion, however, we would expect the \(z_i\)’s to be larger, in absolute value, reflecting the extra variation beyond what is predicted under the Poisson model.

We can test for overdispersion in classical Poisson regression by computing the sum of squares of the n standardized residuals, \(\sum_{i=1}^{n} z^2_{i}\), and comparing this to the \(\chi^2_{n−k}\) distribution, which is what we would expect under the model (using n−k rather than n degrees of freedom to account for the estimation of k regression coefficients). The \(\chi^2_{n−k}\) distribution has average value n−k, and so the ratio, \[ \text{estimated overdispersion } = \frac{1}{n-k} \sum_{i=1}^{n} z^2_{i} \] is a summary of the overdispersion in the data compared to the fitted model. Thus, we will estimate the overdispersion factor using the formula above and provide an approximate test:

result <- augment(fit_model, new_data = summarized_pkg, type.predict = "response") %>% 
  select(vignettes, .pred) %>%
  mutate(z = (vignettes - .pred) /sqrt(.pred))


cat ("overdispersion ratio is ", sum(result$z^2)/(4-2), "\n")
## overdispersion ratio is  30431.83
cat ("p-value of overdispersion test is ", pchisq (sum(result$z^2), (4-2)), "\n")
## p-value of overdispersion test is  1

The sum of squared standardized residuals is \(\sum_{i=1}^{n} z^2_{i} = 60863.66\), compared to an expected value of n−k = 2. The estimated overdispersion factor is \(60863.66/2 = 30431.83\), and the p-value is 1, indicating that the probability is essentially zero that a random variable from a \(\chi^2_{2}\) distribution would be as large as \(30342\). In summary, the number of vignettes are overdispersed by a factor of \(30431\), which is huge—even an overdispersion factor of \(2\) would be considered large—and also statistically significant.

Zero-inflated Poisson Model

A better model for this dataset on R package vignettes might be zero-inflated Poisson, as there are so many zeroes. A ZIP model combines or mixes two models, one that generates zeroes and one that models counts or the size of non-zero observations with the Poisson distribution. There are two sets of covariates for these two models, that can be different:

  • one for the count data
  • one for the probability of zeroes Creating this kind of model in tidymodels is very straightforward
zip_spec <- poisson_reg() %>% set_engine("zeroinfl")
zip_wf <- workflow() %>%
  add_variables(outcomes = vignettes, predictors = n_releases) %>%
  add_model(zip_spec, formula = vignettes ~ n_releases | n_releases)

fit(zip_wf, data = summarized_pkg)
## == Workflow [trained] ==========================================================
## Preprocessor: Variables
## Model: poisson_reg()
## 
## -- Preprocessor ----------------------------------------------------------------
## Outcomes: vignettes
## Predictors: n_releases
## 
## -- Model -----------------------------------------------------------------------
## 
## Call:
## pscl::zeroinfl(formula = vignettes ~ n_releases | n_releases, data = data)
## 
## Count model coefficients (poisson with log link):
## (Intercept)   n_releases  
##     0.23669      0.01382  
## 
## Zero-inflation model coefficients (binomial with logit link):
## (Intercept)   n_releases  
##     0.27897     -0.01396

The coefficients here are different than when we didn’t use a ZIP model, but we still see that packages with more releases have more vignettes (and packages with fewer releases are more likely to have zero vignettes).

Coefficients Intervals Using Boostrap

At this stage, we will use boostrapping method to inference on model coefficients and their confidence intervals. The first step is to create a set of boostrap resamples

folds <- bootstraps(summarized_pkg, times = 1000, apparent = TRUE)
folds
## # Bootstrap sampling with apparent sample 
## # A tibble: 1,001 x 2
##    splits               id           
##    <list>               <chr>        
##  1 <split [17560/6468]> Bootstrap0001
##  2 <split [17560/6484]> Bootstrap0002
##  3 <split [17560/6457]> Bootstrap0003
##  4 <split [17560/6431]> Bootstrap0004
##  5 <split [17560/6474]> Bootstrap0005
##  6 <split [17560/6429]> Bootstrap0006
##  7 <split [17560/6484]> Bootstrap0007
##  8 <split [17560/6525]> Bootstrap0008
##  9 <split [17560/6493]> Bootstrap0009
## 10 <split [17560/6485]> Bootstrap0010
## # ... with 991 more rows

Now let’s create a little function to get out the coefficients for the probability-of-zero-counts part of our ZIP model. (We could instead or in addition get out the Poisson/count part of the ZIP model.)

get_coefs <- function(x) {
  x %>%
    extract_fit_engine() %>%
    tidy(type = "zero")
}

fit(zip_wf, data = summarized_pkg) %>% get_coefs()
## # A tibble: 2 x 6
##   term        type  estimate std.error statistic  p.value
##   <chr>       <chr>    <dbl>     <dbl>     <dbl>    <dbl>
## 1 (Intercept) zero    0.279    0.0265      10.5  7.22e-26
## 2 n_releases  zero   -0.0140   0.00201     -6.94 3.99e-12

We can now take this function and use it for all of our bootstrap resamples with our ZIP model.

ctrl <- control_resamples(extract = get_coefs)

doParallel::registerDoParallel()
set.seed(28)
zip_res <- fit_resamples(zip_wf, folds, control = ctrl)
zip_res
## # Resampling results
## # Bootstrap sampling with apparent sample 
## # A tibble: 1,001 x 5
##    splits                id            .metrics         .notes      .extracts   
##    <list>                <chr>         <list>           <list>      <list>      
##  1 <split [17560/17560]> Apparent      <tibble [2 x 4]> <tibble [0~ <tibble [1 ~
##  2 <split [17560/6468]>  Bootstrap0001 <tibble [2 x 4]> <tibble [0~ <tibble [1 ~
##  3 <split [17560/6484]>  Bootstrap0002 <tibble [2 x 4]> <tibble [0~ <tibble [1 ~
##  4 <split [17560/6457]>  Bootstrap0003 <tibble [2 x 4]> <tibble [0~ <tibble [1 ~
##  5 <split [17560/6431]>  Bootstrap0004 <tibble [2 x 4]> <tibble [0~ <tibble [1 ~
##  6 <split [17560/6474]>  Bootstrap0005 <tibble [2 x 4]> <tibble [0~ <tibble [1 ~
##  7 <split [17560/6429]>  Bootstrap0006 <tibble [2 x 4]> <tibble [0~ <tibble [1 ~
##  8 <split [17560/6484]>  Bootstrap0007 <tibble [2 x 4]> <tibble [0~ <tibble [1 ~
##  9 <split [17560/6525]>  Bootstrap0008 <tibble [2 x 4]> <tibble [0~ <tibble [1 ~
## 10 <split [17560/6493]>  Bootstrap0009 <tibble [2 x 4]> <tibble [0~ <tibble [1 ~
## # ... with 991 more rows

We can use tidyr to get those out, and then we can visualize the bootstrap intervals.

zip_res %>%
  select(id, .extracts) %>%
  unnest(.extracts) %>%
  unnest(.extracts) %>%
  ggplot(aes(x = estimate, fill = term)) +
  geom_histogram(color = "white", alpha = 0.8, show.legend = FALSE) +
  facet_wrap(~term, scales = "free_x") +
  geom_vline(xintercept = 0, lty = 2, size = 1.2, color = "gray70")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.