Distribution of the Number of Patrons Per Patreon Project

Introduction

My goal in this analysis is to explore the distribution of the number patrons per Patreon project. This is a companion to my analysis of the distribution of earnings among Patreon projects.

Put another way, I want to estimate the probability of a Patreon project having more than a certain number of patrons. For example, how likely is it that a random Patreon project has more than 10 patrons? More than 100? More than 1,000?

Clearly this probability decreases the higher the desired number of patrons is: the probability of having more than 100 patrons is less than the probability of having more than 10. But how can we quantify this? Is there a simple rule by which we can estimate this probability?

A common conception is that the numbers of patrons on Patreon (or numbers of subscribers on other “creator economy” services like Substack) are distributed according to a so-called “power-law” distribution. (For the mathematics behind a power-law distribution, see the analysis of the distribution of Patreon earnings.) One goal of mine in this analysis is to assess whether or not this is true.

For those readers not familiar with the R statistical software and the additional Tidyverse software I use to manipulate and plot data, check out the various ways to learn more about the Tidyverse.

Setup

I load the following R libraries, for the purposes listed:

tidyverse. Do general data manipulation and plotting.
tools. Compute MD5 checksums.
DescTools. Compute Gini coefficients.
poweRlaw. Work with power-law and other distributions.

library("tidyverse")
library("tools")
library("DescTools")
library("poweRlaw")

Preparing the data

Obtaining the Patreon data

I use a local copy of the Graphtreon-collected Patreon data for December 2022. This dataset contains an entry for every Patreon project for which the number of patrons is publicly reported.

Because the Graphtreon data is proprietary, I store it in a separate directory and do not make it available as part of this analysis. See the “References” section below for more information.

I check the MD5 hash values for the file, and stop if the contents are not what are expected.

stopifnot(md5sum("../../graphtreon/graphtreonBasicExport_Dec2022.csv") == "98ff63f7d6aa3f2d1b2acaf40425ac9b")

Loading the Patreon data

I load the raw Patreon data from Graphtreon:

patreon_tb <- read_csv("../../graphtreon/graphtreonBasicExport_Dec2022.csv")

## Rows: 217861 Columns: 11
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (6): Name, Creation Name, Category, Pay Per, Patreon, Graphtreon
## dbl  (4): Patrons, Earnings, Is Nsfw, Twitter Followers
## dttm (1): Launched
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Analysis

Preliminary analysis

I do some basic exploratory data analysis, starting with the total amount of data in the dataset.

total_projects <- length(patreon_tb$Patrons)

There are a total of 217,861 projects listed in the Grapheon data for the month in question. Note the word “projects” here, not “creators”: Patreon is organized by projects, and it’s possible that a given person may have more than one project active. It’s also possible that a given project may be associated with multiple people.

I suspect that the vast majority of Patreon projects are associated with one creator, and that the vast majority of people have only one project in which they participate. Unfortunately there’s no way of telling from the data at hand how true this is. I’ll therefore be careful in the terms I use, and will generally refer to “projects,” not “creators.”

Moving on to the actual data fields, there are three numeric variables of interest in the Graphtreon data:

the number of patrons for each Patreon project
the earnings for each project, for those projects that publicly report earnings
the number of Twitter followers of the Twitter account (if any) associated with the project

As noted above, my primary focus in this analysis is on the number of patrons per project.

I start by checking to see if (as advertised) all projects in the dataset have their number of patrons reported, and if any of the projects reported having zero patrons.

no_reported_patrons <- patreon_tb %>%
  filter(is.na(Patrons)) %>%
  summarize(n()) %>%
  as.integer()

zero_reported_patrons <- patreon_tb %>%
  filter(!is.na(Patrons) & Patrons <= 0) %>%
  summarize(n()) %>%
  as.integer()

For the month in question there were 0 projects in the dataset that did not report their number of patrons, and 0 projects that reported having zero patrons.

Projects ranked by their number of patrons

I now construct a sample dataset consisting of all projects reporting nonzero numbers of patrons for the month in question, ranked by the number of patrons, from greatest to least.

by_patrons_tb <- patreon_tb %>%
  filter(!is.na(Patrons) & Patrons > 0) %>%
  arrange(desc(Patrons))

by_patrons_tb <- by_patrons_tb %>%
  mutate(Patrons_Rank = 1:nrow(by_patrons_tb))

nonzero_patrons <- total_projects - (no_reported_patrons + zero_reported_patrons)

This sample dataset contains a total of 217,861 projects, representing 100% of all projects in the Graphtreon dataset.

Plotting number of patrons vs. number of patrons rank

Now that I have my dataset of interest, I can continue my exploratory data analysis, this time by plotting the number of patrons per project as a function of rank (i.e., from those projects having the most patrons to those having the least).

by_patrons_tb %>%
  ggplot(mapping=aes(x = Patrons_Rank, y = Patrons)) +
  geom_point() +
  scale_x_continuous(labels = scales::label_comma()) +
  scale_x_continuous(labels = scales::label_comma()) +
  scale_y_continuous(labels = scales::label_comma()) +
  xlab("Patrons Rank") +
  ylab("Number of Patrons") +
  labs(
    title = "Patreon Number of Patrons vs. Patrons Rank",
    subtitle = "All Projects Reporting Their Number of Patrons",
    caption = "Data source: Graphtreon Basic CSV Export, December 2022"
  ) +
  theme_gray() +
  theme(axis.text.x = element_text(angle = 60, hjust = 1)) +
  theme(axis.title.x = element_text(margin = margin(t = 5))) +
  theme(axis.title.y = element_text(margin = margin(r = 10))) +
  theme(plot.caption = element_text(margin = margin(t = 15), hjust = 0))

## Scale for x is already present.
## Adding another scale for x, which will replace the existing scale.

This is an extremely skewed distribution: for the month in question only a relatively few top-ranked projects had significant numbers of patrons.

An alternative way of plotting such a highly skewed distribution is to plot both the \(x\)- and \(y\)-axes as logarithms of the underlying values (a so-called “log-log” plot). (This requires all values to be greater than zero, since the logarithm of zero is undefined.) Here is such a plot for number of patrons vs. number of patrons rank:

by_patrons_tb %>%
  ggplot(mapping=aes(x = Patrons_Rank, y = Patrons)) +
  geom_point() +
  coord_trans(x = "log10", y = "log10") +
  scale_x_continuous(breaks = c(10, 100, 1000, 10000, 50000, 100000, 200000), labels = scales::label_comma()) +
  scale_y_continuous(breaks = c(5, 10, 100, 1000, 10000, 50000), labels = scales::label_comma()) +
  xlab("Patrons Rank") +
  ylab("Number of Patrons") +
  labs(
    title = "Patreon Number of Patrons vs. Patrons Rank (Log-Log)",
    subtitle = "All Projects Reporting Their Number of Patrons",
    caption = "Data source: Graphtreon Basic CSV Export, December 2022"
  ) +
  theme_gray() +
  theme(axis.text.x = element_text(angle = 60, hjust = 1)) +
  theme(axis.title.x = element_text(margin = margin(t = 5))) +
  theme(axis.title.y = element_text(margin = margin(r = 10))) +
  theme(plot.caption = element_text(margin = margin(t = 15), hjust = 0))

If the distribution of the number of patrons per project were according to a power law then graphing it on a log-log scale would result in a straight line. In this case the curve is definitely not straight, but deviates as we get into the right tail of projects with the lowest number of patrons, indicating a deviation from any power-law behavior in that region. A similar deviation appears to be present in those projects with the highest number of patrons.

Measuring inequality of the number of patrons per project

As shown in the previous section, the distribution of number of patrons among Patreon projects was highly unequal for the month in question. I now compute some example statistics to characterize this inequality:

minimum and maximum number of patrons per project
average number of patrons per project vs. median number of patrons
standard deviation of the number of patrons per project
the percentage of the total number of patron memberships (see below) associated with the top 0.1%, 1%, 10%, and 50% of projects in the dataset
the percentage of projects that had more than 10 patrons, more than 100 patrons, more than 1,000 patrons, or more than 10,000 patrons
the Gini coefficient (also known as the Gini index), a widely-used measure of the level of inequality of resources (e.g., income or wealth)

(I refer to “patron memberships” or just “memberships” because a particular person may be a patron of multiple projects. Just adding up the number of patrons for all projects would result in double-counting. Unfortunately the Graphtreon data doesn’t have sufficient information to calculate the true total number of patrons.)

max_patrons <- max(by_patrons_tb$Patrons)
min_patrons <- min(by_patrons_tb$Patrons)
mean_patrons <- mean(by_patrons_tb$Patrons)
sd_patrons <- sd(by_patrons_tb$Patrons)
median_patrons <- median(by_patrons_tb$Patrons)

top_point_1_pct = round(0.001 * nonzero_patrons)
top_1_pct = round(0.01 * nonzero_patrons)
top_10_pct = round(0.1 * nonzero_patrons)
top_25_pct = round(0.25 * nonzero_patrons)
top_50_pct = round(0.5 * nonzero_patrons)

total_memberships <- sum(by_patrons_tb$Patrons)
top_point_1_pct_share = sum(by_patrons_tb$Patrons[1:top_point_1_pct]) / total_memberships
top_1_pct_share = sum(by_patrons_tb$Patrons[1:top_1_pct]) / total_memberships
top_10_pct_share = sum(by_patrons_tb$Patrons[1:top_10_pct]) / total_memberships
top_25_pct_share = sum(by_patrons_tb$Patrons[1:top_25_pct]) / total_memberships
top_50_pct_share = sum(by_patrons_tb$Patrons[1:top_50_pct]) / total_memberships

frac_over_10 = sum(by_patrons_tb$Patrons >= 10) / nonzero_patrons
frac_over_100 = sum(by_patrons_tb$Patrons > 100) / nonzero_patrons
frac_over_1000 = sum(by_patrons_tb$Patrons > 1000) / nonzero_patrons
frac_over_10000 = sum(by_patrons_tb$Patrons > 10000) / nonzero_patrons

gini_patrons <- Gini(by_patrons_tb$Patrons)

For the month in question the number of patrons per project ranged from a minimum of 1 to a maximum of 44,454. The mean number of patrons per project was 59 (with a standard deviation of 427), while the median number of patrons per project was 6. The median being an order of magnitude less than the mean is a reflection of the top-ranked projects having disproportionately more patrons.

More specifically, for the month in question:

The top 0.1% of projects had 16% of the total memberships.
The top 1% of projects had 43% of the total memberships.
The top 10% of projects had 81% of the total memberships.
The top 25% of projects had 93% of the total memberships.
The top 50% of projects had 98% of the total memberships.

Turning now to the proportion of projects havings more than a certain number of patrons for the month in question:

40% of projects had more than 10 patrons.
9% of projects had more than 100 patrons.
0.8% of projects had more than 1,000 patrons.
0.03% of projects had more than 10,000 patrons.

The Gini coefficient associated with the number of project memberships is 0.87. A Gini coefficient value of 0 corresponds to completely equal numbers of patrons per project, and a value of 1 to the most unequal distribution. The measured value corresponds to a very unequal distribution of memberships among projects, consistent with the other statistics. (See above for the discussions of patrons vs. memberships when looking at the entire dataset.)

Finally, for the month in question the total number of memberships for all projects combined was 12,904,012. The Patreon “About” page claims a total of over 8 million monthly active patrons, so the average number of projects per patron is about 1.6.

Do the numbers of patrons per project follow a power-law distribution?

It is very common for people to talk about services like Patreon or Substack as being characterized by a “power-law” distribution. This is typically shorthand for the fact that on such services only a few creators realize significant numbers of patrons or subscribers, with the number of patrons/subscribers rapidly dropping off once you get beyond those in the top rankings.

However, just because the distribution of number of patrons exhibits rapid drop-off (as in the graphs shown above), it doesn’t necessarily follow that the distribution is truly a power-law distribution. In this section I do some tests to assess whether the number of patrons per project for the month in question follow a power law or not.

For more on the mathematics of power-law distributions see the analysis of Patreon earnings.

Fitting distributions to the data

I attempt to fit a power-law distribution to the entire sample dataset of all 217,861 Patreon projects that reported nonzero number of patrons. Since the number of patrons is always an integer value, I attempt to fit a discrete power-law distribution. I also attempt to fit a discrete exponential distribution and a discrete log-normal distribution, to see if either of those provide a better fit than a power-law distribution.

The first step is to create models for all three distributions, using as input the entire sample dataset of projects with a nonzero number of patrons for the month in question.

m_pl <- displ(by_patrons_tb$Patrons)
m_exp <- disexp(by_patrons_tb$Patrons)
m_lnorm <- dislnorm(by_patrons_tb$Patrons)

I now need to estimate parameters for each of the models. In particular, I need an estimate for \(x_\textrm{min}\), the cut-off point below which the models do not apply. There are two ways to do this.

The first and better way is to use the estimate_xmin() function with each model (power-law, exponential, and log-normal) to try to find the best value of \(x_\textrm{min}\), one that will provide the best model fit. Unfortunately, this is very time-consuming to do for a dataset with over 200,000 entries with values up to almost 60,000.

The alternate approach is simply to specify an arbitrary value of \(x_\textrm{min}\). For example, in the case of Patreon using the value \(x_\textrm{min} = 1\) makes intuitive sense, since all projects have at least one patron, This will likely not give us the best fit, but the model will cover the entire sample dataset.

Setting \(x_\textrm{min}\) to an arbitrary value also simplifies comparing the different models, since they must have the same \(x_\textrm{min}\) value in order to do the comparison in a rigorous way than simply inspecting curves on plots.

Therefore I next estimate the parameters for the three models using the arbitrary value \(x_\textrm{min} = 1\). I also need to specify a larger value for \(x_\textrm{max}\) since the estimate_xmin() function normally doesn’t look at data values higher than 10,000.

The parameters returned by estimate_xmin() are then plugged back into the models.

m_pl_est <- estimate_xmin(m_pl, xmins = 1, xmax = 50000)
m_pl$setXmin(m_pl_est)

m_exp_est <- estimate_xmin(m_exp, xmins = 1, xmax = 50000)
m_exp$setXmin(m_exp_est)

m_lnorm_est <- estimate_xmin(m_lnorm, xmins = 1, xmax = 50000)
m_lnorm$setXmin(m_lnorm_est)

I can now plot the so-called complementary cumulative distribution function (“ccdf”) of the data, along with the curves of best fit from the three models. The ccdf for a value \(x\) gives the probability that an observed value \(X\) will be greater than \(x\): \(Pr(X \gt x)\). On the other hand, the cumulative distribution function gives the probability that \(X\) is less than or equal to \(x\), or \(Pr(X \le x)\). We thus have \(Pr(X \gt x) = 1 - Pr(X \le x)\).

The poweRlaw R package provides plotting methods for its models, so in the interest of simplicity I use the plot() and lines() functions to create the plot rather than ggplot(). The plot() function plots the ccdf of the underlying patrons data. The lines() function then adds the fitted curves for the power-law distribution (green), the exponential distribution (blue), and the log-normal distribution (orange).

plot(m_pl, xlab = "Number of Patrons", ylab = "CCDF of Number of Patrons", main = "Number of Patrons and Fitted Distributions", sub = "All Patreon Projects Reporting Nonzero Number of Patrons", col = "#000000")
lines(m_pl, col = "#009E73", lwd = 2)
lines(m_exp, col = "#56B4E9", lwd = 2)
lines(m_lnorm, col = "#E69F00", lwd = 2)
legend("bottomleft", c("power-law","exponential", "log-normal"), fill = c("#009E73","#56B4E9", "#E69F00"))

Based on the above plot, it appears that the log-normal distribution is a much better fit to the data than either the power-law or exponential distributions. However, the fit begins to break down for projects for the very greatest number of patrons.

I can confirm that the log-normal distribution is a better fit than the power-law and exponential distributions by using the compare_distributions() function.

lnorm_vs_pl_one_sided <- compare_distributions(m_lnorm, m_pl)$p_one_sided
lnorm_vs_exp_one_sided <- compare_distributions(m_lnorm, m_exp)$p_one_sided

The one-sided p-value tests whether the first distribution is a better fit than the second. In this case the one-sided p-value is 0 when comparing the log-normal distribution to the power-law distribution and 0 when comparing the log-normal distribution to the exponential distribution. These p-values indicate that the log-normal distribution is clearly a better fit than either of the other distributions.

Predicted probabilities vs. observed probabilities

The log-normal distribution has parameters \(\mu\) and \(\sigma\) determining the exact form of the distribution. For more on the exact format of the probability density function for a log-normal distribution, see the analysis of Patreon earnings.

In our case \(\mu\) and \(\sigma\) are given by the fitted model:

lnorm_mu <- m_lnorm$pars[1]
lnorm_sigma <- m_lnorm$pars[2]

The value of \(\mu\) is approximately 1.2 and the value of \(\sigma\) is approximately 2.32.

I can use the values as input to the plnorm() function to estimate the probability of a Patreon project having more than a certain number of patrons:

prob_over_10 = plnorm(10, meanlog = lnorm_mu, sdlog = lnorm_sigma, lower.tail = FALSE)
prob_over_100 = plnorm(100, meanlog = lnorm_mu, sdlog = lnorm_sigma, lower.tail = FALSE)
prob_over_1000 = plnorm(1000, meanlog = lnorm_mu, sdlog = lnorm_sigma, lower.tail = FALSE)
prob_over_10000 = plnorm(10000, meanlog = lnorm_mu, sdlog = lnorm_sigma, lower.tail = FALSE)

The estimated probabilities are as follows, with the observed probabilities in parentheses:

0.32 estimated probability of having more than 10 patrons (observed 0.4)
0.07 estimated probability of having more than 100 patrons (observed 0.09)
0.007 estimated probability of having more than 1,000 patrons (observed 0.008)
0.0003 estimated probability of having more than 10,000 patrons (observed 0.0003)

As is apparent from the values above, a log-normal distribution does a reasonably good job of fitting the observed Patreon data for the month in question.

Conclusions

Based on the above analysis, I conclude the following:

First, as with earnings from monthly charges, the distribution of the number of patrons across Patreon projects is highly unequal, and the likelihood of any individual project having a significant number of patrons is very low.

Second, for the month in question the distribution of the number of patrons does not follow a power law, but rather can be best modeled using a log-normal distribution.

Appendix

Caveats

This analysis is subject to the following caveat, among others:

The Graphtreon dataset does not contain Patreon projects that do not publicly report their number of patrons. If the likelihood of a project doing this is not uniform across all projects, this may skew the results, since the dataset would not necessarily be a representative sample of all Patreon projects.

References

Patreon project data was obtained from Graphtreon LLC as a basic CVS export for the month of December 2022, https://graphtreon.com/data-services.

The standard reference for assessing whether empirical data fits a power-law distribution is Aaron Clauset, Cosma Rohilla Shalizi, and M.E.J. Newman, “Power-law distributions in empirical data,” arXiv:0706.1062 [physics.data-an].

The poweRlaw R package for assessing whether a distribution fits a power law is described in C.S. Gillespie, “Fitting Heavy Tailed Distributions: The poweRlaw Package,” Journal of Statistical Software, 64(2), 1–16, http://www.jstatsoft.org/v64/i02/.

Environment

I used the following R environment in doing the analysis above:

sessionInfo()

## R version 4.2.1 (2022-06-23)
## Platform: x86_64-apple-darwin17.0 (64-bit)
## Running under: macOS Big Sur ... 10.16
## 
## Matrix products: default
## BLAS:   /Library/Frameworks/R.framework/Versions/4.2/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/4.2/Resources/lib/libRlapack.dylib
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## attached base packages:
## [1] tools     stats     graphics  grDevices utils     datasets  methods  
## [8] base     
## 
## other attached packages:
##  [1] poweRlaw_0.70.6   DescTools_0.99.47 forcats_0.5.2     stringr_1.5.0    
##  [5] dplyr_1.0.10      purrr_1.0.1       readr_2.1.3       tidyr_1.2.1      
##  [9] tibble_3.1.8      ggplot2_3.4.0     tidyverse_1.3.2  
## 
## loaded via a namespace (and not attached):
##  [1] fs_1.5.2            lubridate_1.9.0     bit64_4.0.5        
##  [4] httr_1.4.4          backports_1.4.1     bslib_0.4.2        
##  [7] utf8_1.2.2          R6_2.5.1            DBI_1.1.3          
## [10] colorspace_2.0-3    withr_2.5.0         tidyselect_1.2.0   
## [13] Exact_3.2           bit_4.0.5           compiler_4.2.1     
## [16] cli_3.6.0           rvest_1.0.3         expm_0.999-7       
## [19] xml2_1.3.3          labeling_0.4.2      sass_0.4.4         
## [22] scales_1.2.1        mvtnorm_1.1-3       proxy_0.4-27       
## [25] digest_0.6.31       rmarkdown_2.19      pkgconfig_2.0.3    
## [28] htmltools_0.5.4     dbplyr_2.2.1        fastmap_1.1.0      
## [31] highr_0.10          rlang_1.0.6         readxl_1.4.1       
## [34] rstudioapi_0.14     jquerylib_0.1.4     generics_0.1.3     
## [37] farver_2.1.1        jsonlite_1.8.4      vroom_1.6.0        
## [40] googlesheets4_1.0.1 magrittr_2.0.3      Matrix_1.5-3       
## [43] Rcpp_1.0.9          munsell_0.5.0       fansi_1.0.3        
## [46] lifecycle_1.0.3     stringi_1.7.12      yaml_2.3.6         
## [49] MASS_7.3-58.1       rootSolve_1.8.2.3   grid_4.2.1         
## [52] parallel_4.2.1      crayon_1.5.2        lmom_2.9           
## [55] lattice_0.20-45     haven_2.5.1         hms_1.1.2          
## [58] knitr_1.41          pillar_1.8.1        boot_1.3-28.1      
## [61] gld_2.6.6           reprex_2.0.2        glue_1.6.2         
## [64] evaluate_0.19       data.table_1.14.6   modelr_0.1.10      
## [67] vctrs_0.5.1         tzdb_0.3.0          cellranger_1.1.0   
## [70] gtable_0.3.1        assertthat_0.2.1    cachem_1.0.6       
## [73] xfun_0.36           broom_1.0.2         pracma_2.4.2       
## [76] e1071_1.7-12        class_7.3-20        googledrive_2.0.0  
## [79] gargle_1.2.1        timechange_0.2.0    ellipsis_0.3.2

Source code

The source code for this analysis can be found in the public code repository https://gitlab.com/frankhecker/misc-analysis in the patreon subdirectory.

This document and its source code are available for unrestricted use, distribution and modification under the terms of the Creative Commons CC0 1.0 Universal (CC0 1.0) Public Domain Dedication. Stated more simply, you’re free to do whatever you’d like with it.