Distribution of Earnings Among Patreon Projects Charging by the Month

Introduction

My goal in this analysis is to explore the distribution of earnings across Patreon projects that charge patrons by the month.

Put another way, I want to estimate the probability of a Patreon project earning more than a certain amount per month. For example, how likely is it that a random Patreon project earns more than $10 a month? More than $100? More than $1,000?

Clearly this probability decreases the higher the desired amount of earnings is: the probability of earning more than $100 per month is less than the probability of earning more than $10. But how can we quantify this? Is there a simple rule by which we can estimate this probability?

A common conception is that monthly earnings on Patreon and other “creator economy” services (e.g., Substack) are distributed according to a so-called “power-law” distribution. (For the mathematics behind a power-law distribution, see below.) One goal of mine in this analysis is to assess whether or not this is true.

For those readers not familiar with the R statistical software and the additional Tidyverse software I use to manipulate and plot data, check out the various ways to learn more about the Tidyverse.

Setup

I load the following R libraries, for the purposes listed:

tidyverse. Do general data manipulation and plotting.
tools. Compute MD5 checksums.
DescTools. Compute Gini coefficients.
poweRlaw. Work with power-law and other distributions.

library("tidyverse")
library("tools")
library("DescTools")
library("poweRlaw")

Preparing the data

Obtaining the Patreon data

I use a local copy of the Graphtreon-collected Patreon data for December 2022. This dataset contains an entry for every Patreon project for which the number of patrons is publicly reported.

Because the Graphtreon data is proprietary, I store it in a separate directory and do not make it available as part of this analysis. See the “References” section below for more information.

I check the MD5 hash values for the file, and stop if the contents are not what are expected.

stopifnot(md5sum("../../graphtreon/graphtreonBasicExport_Dec2022.csv") == "98ff63f7d6aa3f2d1b2acaf40425ac9b")

Loading the Patreon data

I load the raw Patreon data from Graphtreon:

patreon_tb <- read_csv("../../graphtreon/graphtreonBasicExport_Dec2022.csv")

## Rows: 217861 Columns: 11
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr  (6): Name, Creation Name, Category, Pay Per, Patreon, Graphtreon
## dbl  (4): Patrons, Earnings, Is Nsfw, Twitter Followers
## dttm (1): Launched
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Analysis

Preliminary analysis

I do some basic exploratory data analysis, starting with the total amount of data in the dataset.

total_projects <- length(patreon_tb$Patrons)

There are a total of 217,861 projects listed in the Grapheon data for the month in question. Note the word “projects” here, not “creators”: Patreon is organized by projects, and it’s possible that a given person may have more than one project active. It’s also possible that a given project may be associated with multiple people.

I suspect that the vast majority of Patreon projects are associated with one creator, and that the vast majority of people have only one project in which they participate. Unfortunately there’s no way of telling from the data at hand how true this is. I’ll therefore be careful in the terms I use, and will generally refer to “projects,” not “creators.”

Moving on to the actual data fields, there are three numeric variables of interest in the Graphtreon data:

the number of patrons for each Patreon project
the earnings for each project, for those projects that publicly report earnings
the number of Twitter followers of the Twitter account (if any) associated with the project

As noted above, my primary focus is on earnings, since making money is presumably why people start Patreon projects, and the promise of making money is the main selling point of the so-called “creator economy.”

I therefore focus in particular on projects that have nonzero reported earnings. Moreover, I focus on projects that charge their patrons monthly (as opposed to, say, per podcast or video) in order to compare like for like.

I start by looking at variables related to project earnings, looking for answers to the following questions:

How many projects did not publicly report their earnings?
Of those that did report, how many had zero earnings?
Of those that did have nonzero earnings, how many did not charge patrons on a monthly basis?

no_reported_earnings <- patreon_tb %>%
  filter(is.na(Earnings)) %>%
  summarize(n()) %>%
  as.integer()

reported_earnings <- total_projects - no_reported_earnings

zero_earnings <- patreon_tb %>%
  filter(!is.na(Earnings) & Earnings <= 0) %>%
  summarize(n()) %>%
  as.integer()

nonzero_earnings <- reported_earnings - zero_earnings

nonzero_nonmonthly_earnings <- patreon_tb %>%
  filter(!is.na(Earnings) & Earnings > 0) %>%
  filter(is.na(`Pay Per`) | `Pay Per` != "month") %>%
  summarize(n()) %>%
  as.integer()

nonzero_monthly_earnings <- nonzero_earnings - nonzero_nonmonthly_earnings

For the month in question there were a total of 217,861 Patreon projects in the Graphtreon dataset, of which 83,294 did not make their earnings public. This reduces the potential sample size down to 134,567 projects at best.

There were only 265 projects that reported zero earnings (as opposed to not publicly reporting earnings at all). Given the relatively small size of this group, I ignore it in the analysis. (This also simplifies doing log-log plots, as discussed below.)

There were only 5,369 projects that reported nonzero earnings and did not charge by the month. Again, given the relatively small size of this group, I ignore it as well.

Projects with earnings from monthly charges

I now construct a sample dataset consisting of all projects reporting nonzero earnings from monthly charges for the month in question, ranked by the amount of earnings, from greatest to least.

by_earnings_tb <- patreon_tb %>%
  filter(!is.na(Earnings) & Earnings > 0) %>%
  filter(!is.na(`Pay Per`) & `Pay Per` == "month") %>%
  arrange(desc(Earnings))

by_earnings_tb <- by_earnings_tb %>%
  mutate(Earnings_Rank = 1:nrow(by_earnings_tb))

This sample dataset contains a total of 128,933 projects, representing 59% of all projects in the Graphtreon dataset.

Plotting earnings from monthly charges vs. earnings rank

Now that I have my dataset of interest, I can continue my exploratory data analysis, this time by plotting earnings from monthly charges as a function of rank (i.e., from those projects earning the most to those earning the least).

by_earnings_tb %>%
  ggplot(mapping=aes(x = Earnings_Rank, y = Earnings)) +
  geom_point() +
  scale_x_continuous(labels = scales::label_comma()) +
  scale_x_continuous(breaks = c(25000, 50000, 75000, 100000, 125000), labels = scales::label_comma()) +
  scale_y_continuous(labels = scales::label_dollar()) +
  xlab("Earnings Rank") +
  ylab("Earnings") +
  labs(
    title = "Patreon Earnings vs. Earnings Rank",
    subtitle = "All Projects Reporting Earnings from Monthly Charges",
    caption = "Data source: Graphtreon Basic CSV Export, December 2022"
  ) +
  theme_gray() +
  theme(axis.text.x = element_text(angle = 60, hjust = 1)) +
  theme(axis.title.x = element_text(margin = margin(t = 5))) +
  theme(axis.title.y = element_text(margin = margin(r = 10))) +
  theme(plot.caption = element_text(margin = margin(t = 15), hjust = 0))

## Scale for x is already present.
## Adding another scale for x, which will replace the existing scale.

This is an extremely skewed distribution: for the month in question only a relatively few top-ranked projects had significant earnings from monthly charges.

An alternative way of plotting such a highly skewed distribution is to plot both the $x$- and $y$-axes as logarithms of the underlying values (a so-called “log-log” plot). (This requires all values to be greater than zero, since the logarithm of zero is undefined.) Here is such a plot for earnings vs. earnings rank:

by_earnings_tb %>%
  ggplot(mapping=aes(x = Earnings_Rank, y = Earnings)) +
  geom_point() +
  coord_trans(x = "log10", y = "log10") +
  scale_x_continuous(breaks = c(10, 100, 1000, 10000, 50000, 100000, 200000), labels = scales::label_comma()) +
  scale_y_continuous(breaks = c(10, 100, 1000, 10000, 50000, 100000, 300000), labels = scales::label_dollar()) +
  xlab("Earnings Rank") +
  ylab("Earnings") +
  labs(
    title = "Patreon Earnings vs. Earnings Rank (Log-Log)",
    subtitle = "All Projects Reporting Earnings from Monthly Charges",
    caption = "Data source: Graphtreon Basic CSV Export, December 2022"
  ) +
  theme_gray() +
  theme(axis.text.x = element_text(angle = 60, hjust = 1)) +
  theme(axis.title.x = element_text(margin = margin(t = 5))) +
  theme(axis.title.y = element_text(margin = margin(r = 10))) +
  theme(plot.caption = element_text(margin = margin(t = 15), hjust = 0))

If the distribution of monthly earnings were according to a power law then graphing it on a log-log scale would result in a straight line. In this case the curve is definitely not straight, but deviates as we get into the right tail of projects with the lowest monthly earnings, indicating a deviation from any power-law behavior in that region. A similar but smaller deviation appears to be present in those projects with the highest monthly earnings.

Measuring inequality of earnings from monthly charges

As shown in the previous section, the distribution of earnings from monthly charges among Patreon projects was highly unequal for the month in question. I now compute some example statistics to characterize this inequality:

minimum and maximum earnings from monthly charges
average earnings from monthly charges vs. median earnings
standard deviation of earnings from monthly charges
the percentage of total earnings from monthly charges associated with the top 0.1%, 1%, 10%, and 50% of projects in the dataset
the percentage of projects that earned more than $10 per month from monthly charges, more than $100 per month, more than $1,000 per month, or more than $10,000 per month
the Gini coefficient (also known as the Gini index), a widely-used measure of the level of inequality of income

mean_earnings <- mean(by_earnings_tb$Earnings)
sd_earnings <- sd(by_earnings_tb$Earnings)
median_earnings <- median(by_earnings_tb$Earnings)

top_point_1_pct = round(0.001 * nonzero_monthly_earnings)
top_1_pct = round(0.01 * nonzero_monthly_earnings)
top_10_pct = round(0.1 * nonzero_monthly_earnings)
top_25_pct = round(0.25 * nonzero_monthly_earnings)
top_50_pct = round(0.5 * nonzero_monthly_earnings)

total_earnings <- sum(by_earnings_tb$Earnings)
top_point_1_pct_share = sum(by_earnings_tb$Earnings[1:top_point_1_pct]) / total_earnings
top_1_pct_share = sum(by_earnings_tb$Earnings[1:top_1_pct]) / total_earnings
top_10_pct_share = sum(by_earnings_tb$Earnings[1:top_10_pct]) / total_earnings
top_25_pct_share = sum(by_earnings_tb$Earnings[1:top_25_pct]) / total_earnings
top_50_pct_share = sum(by_earnings_tb$Earnings[1:top_50_pct]) / total_earnings

frac_over_10 = sum(by_earnings_tb$Earnings >= 10) / nonzero_monthly_earnings
frac_over_100 = sum(by_earnings_tb$Earnings > 100) / nonzero_monthly_earnings
frac_over_1000 = sum(by_earnings_tb$Earnings > 1000) / nonzero_monthly_earnings
frac_over_10000 = sum(by_earnings_tb$Earnings > 10000) / nonzero_monthly_earnings

gini_earnings <- Gini(by_earnings_tb$Earnings)

For the month in question the average earnings from monthly charges per project was $180 (with a standard deviation of $1,410), while the median earnings per project was $25. The median being an order of magnitude less than the mean is a reflection of the top-ranked projects having disproportionately more earnings from monthly charges.

More specifically, for the month in question:

The top 0.1% of projects had 17% of the total earnings from monthly charges.
The top 1% of projects had 40% of the total earnings.
The top 10% of projects had 77% of the total earnings.
The top 25% of projects had 91% of the total earnings.
The top 50% of projects had 98% of the total earnings.

Turning now to the proportion of projects earning more than a certain amount in monthly charges for the month in question:

68% of projects earned more than $10 per month.
24% of projects earned more than $100 per month.
3% of projects earned more than $1,000 per month.
0.1% of projects earned more than $10,000 per month.

The Gini coefficient associated with earnings from monthly charges per project is 0.84. A Gini coefficient value of 0 corresponds to completely equal shares of income, and a value of 1 to the most unequal distribution. The measured value corresponds to a very unequal distribution of earnings from monthly charges, consistent with the other statistics.

By comparison, based on data from Wikipedia the country with the greatest income inequality in the world is South Africa, where an advanced urban economy coexists with vast swaths of poverty. South Africa’s Gini coeeficient is 0.63. As a further comparison, the Gini coefficient for the United States is 0.41, and the Gini coefficients for the various Scandavian countries range from 0.26 to 0.29.

Finally, for the month in question the total earnings from monthly charges for all projects combined was $23,240,210. This, of course, is the number that helped drive Patreon’s overall revenue and profits for 2022, but it has no relevance for any individual project.

Life in Patreonia

To get a better feel for how earnings were distributed among Patreon projects for the month in question, I now look at the following subsets of the total sample of 128,933 projects reporting nonzero earnings from monthly charges, with each subset being 10 times larger than the last; for convenience in referring to them I give them evocative names:

Patreon Heights: the top 100 projects ranked by earnings from monthly charges
Patreon Grove: the next 1,000 projects ranked by earnings (i.e., in positions 101 through 1,100)
Patreonville: the next 10,000 projects ranked by earnings (i.e., in positions 1101 through 11,100)
the rest of Patreonia: the next 100,000 projects ranked by earnings (i.e., in positions 11,101 through 111,100)

These subsets combined accounted for 86% of the projects reporting nonzero earnings from monthly charges.

For each subset I compute the following quantities and then do a log-log plot of earnings vs. rank:

maximum, minimum, mean (with standard deviation), and median earnings from monthly charges for the subset
Gini coefficient of earnings for the subset
total earnings for the subset as a whole, both in absolute terms and as a percentage of total earnings across all projects with earnings from monthly charges
maximum, minimum, mean (with standard deviation), and median number of patrons for the subset

Patreon Heights (Top 100)

I calculate the quantities above for the top 100 projects by reported earnings from monthly charges:

ph_tb <- by_earnings_tb %>%
  filter(Earnings_Rank > 0 & Earnings_Rank <= 100)

mean_ph_earnings <- mean(ph_tb$Earnings)
sd_ph_earnings <- sd(ph_tb$Earnings)
median_ph_earnings <- median(ph_tb$Earnings)
min_ph_earnings <- min(ph_tb$Earnings)
max_ph_earnings <- max(ph_tb$Earnings)
gini_ph_earnings <- Gini(ph_tb$Earnings)
total_ph_earnings <- sum(ph_tb$Earnings)
pct_ph_earnings <- (100. * total_ph_earnings) / total_earnings

mean_ph_patrons <- mean(ph_tb$Patrons)
sd_ph_patrons <- sd(ph_tb$Patrons)
median_ph_patrons <- median(ph_tb$Patrons)
min_ph_patrons <- min(ph_tb$Patrons)
max_ph_patrons <- max(ph_tb$Patrons)

For the month in question the resulting values for earnings from monthly charges for the top 100 projects were as follows:

Maximum: $183,912
Minimum: $15,112
Mean: $35,401 (with a standard deviation of $30,285)
Median: $24,490
Gini coefficient: 0.37
Total earnings: $3,540,114 (15% of total for all projects)

The resulting values for the number of patrons for the top 100 projects were as follows:

Maximum: 37,391
Minimum: 263
Mean: 6,638 (with a standard deviation of 7,339)
Median: 3,876

I next do a log-log plot of earnings vs. rank for the top 100 projects ranked by earnings from monthly charges:

ph_tb %>%
  ggplot(mapping=aes(x = Earnings_Rank, y = Earnings)) +
  geom_point() +
  coord_trans(x = "log10", y = "log10") +
  scale_x_continuous(breaks = c(5, 10, 25, 50, 100), labels = scales::label_comma()) +
  scale_y_continuous(breaks = c(25000, 50000, 100000, 150000, 200000),labels = scales::label_dollar()) +
  xlab("Earnings Rank") +
  ylab("Earnings") +
  labs(
    title = "“Patreon Heights” Earnings vs. Earnings Rank (Log-Log)",
    subtitle = "Top 100 Projects Reporting Earnings from Monthly Charges",
    caption = "Data source: Graphtreon Basic CSV Export, December 2022"
  ) +
  theme_gray() +
  theme(axis.text.x = element_text(angle = 60, hjust = 1)) +
  theme(axis.title.x = element_text(margin = margin(t = 5))) +
  theme(axis.title.y = element_text(margin = margin(r = 10))) +
  theme(plot.caption = element_text(margin = margin(t = 15), hjust = 0))

Even in this highest-earning subset of projects we see the phenomenon that earnings dropped rapidly for lower-ranked projects.

Patreon Grove (Next 1,000)

I calculate the quantities above for the next 1,000 projects by reported earnings from monthly charges:

pg_tb <- by_earnings_tb %>%
  filter(Earnings_Rank > 100 & Earnings_Rank <= 1100)

mean_pg_earnings <- mean(pg_tb$Earnings)
sd_pg_earnings <- sd(pg_tb$Earnings)
median_pg_earnings <- median(pg_tb$Earnings)
min_pg_earnings <- min(pg_tb$Earnings)
max_pg_earnings <- max(pg_tb$Earnings)
gini_pg_earnings <- Gini(pg_tb$Earnings)
total_pg_earnings <- sum(pg_tb$Earnings)
pct_pg_earnings <- (100. * total_pg_earnings) / total_earnings

mean_pg_patrons <- mean(pg_tb$Patrons)
sd_pg_patrons <- sd(pg_tb$Patrons)
median_pg_patrons <- median(pg_tb$Patrons)
min_pg_patrons <- min(pg_tb$Patrons)
max_pg_patrons <- max(pg_tb$Patrons)

For the month in question the resulting values for earnings from monthly charges for the next 1,000 projects were as follows:

Maximum: $15,057
Minimum: $2,737
Mean: $5,181 (with a standard deviation of $2,597)
Median: $4,199
Gini coefficient: 0.25
Total earnings: $5,181,396 (22% of total for all projects)

The resulting values for the number of patrons for the next 1,000 projects were as follows:

Maximum: 11,197
Minimum: 6
Mean: 1,001 (with a standard deviation of 924)
Median: 772

I next do a log-log plot of earnings vs. rank for the next 1,000 projects ranked by earnings from monthly charges:

pg_tb %>%
  ggplot(mapping=aes(x = Earnings_Rank, y = Earnings)) +
  geom_point() +
  coord_trans(x = "log10", y = "log10") +
  scale_x_continuous(breaks = c(150, 250, 500, 750, 1000), labels = scales::label_comma()) +
  scale_y_continuous(breaks = c(5000, 7500, 10000, 15000),labels = scales::label_dollar()) +
  xlab("Earnings Rank") +
  ylab("Earnings") +
  labs(
    title = "“Patreon Grove” Earnings vs. Earnings Rank (Log-Log)",
    subtitle = "Projects Ranked 101-1100 in Earnings from Monthly Charges",
    caption = "Data source: Graphtreon Basic CSV Export, December 2022"
  ) +
  theme_gray() +
  theme(axis.text.x = element_text(angle = 60, hjust = 1)) +
  theme(axis.title.x = element_text(margin = margin(t = 5))) +
  theme(axis.title.y = element_text(margin = margin(r = 10))) +
  theme(plot.caption = element_text(margin = margin(t = 15), hjust = 0))

This shows a similar drop-off in earnings as in the first subset.

Patreonville (Next 10,000)

I calculate the quantities above for the next 10,000 projects by earnings:

pv_tb <- by_earnings_tb %>%
  filter(Earnings_Rank > 1100 & Earnings_Rank <= 11100)

mean_pv_earnings <- mean(pv_tb$Earnings)
sd_pv_earnings <- sd(pv_tb$Earnings)
median_pv_earnings <- median(pv_tb$Earnings)
min_pv_earnings <- min(pv_tb$Earnings)
max_pv_earnings <- max(pv_tb$Earnings)
gini_pv_earnings <- Gini(pv_tb$Earnings)
total_pv_earnings <- sum(pv_tb$Earnings)
pct_pv_earnings <- (100. * total_pv_earnings) / total_earnings

mean_pv_patrons <- mean(pv_tb$Patrons)
sd_pv_patrons <- sd(pv_tb$Patrons)
median_pv_patrons <- median(pv_tb$Patrons)
min_pv_patrons <- min(pv_tb$Patrons)
max_pv_patrons <- max(pv_tb$Patrons)

For the month in question the resulting values for earnings from monthly charges for the next 10,000 projects were as follows:

Maximum: $2,735
Minimum: $355
Mean: $852 (with a standard deviation of $527)
Median: $656
Gini coefficient: 0.32
Total earnings: $8,522,521 (37% of total for all projects)

The resulting values for the number of patrons for the next 10,000 projects were as follows:

Maximum: 1,879
Minimum: 1
Mean: 157 (with a standard deviation of 158)
Median: 109

I next do a log-log plot of earnings vs. rank for the next 10,000 projects ranked by earnings from monthly charges:

pv_tb %>%
  ggplot(mapping=aes(x = Earnings_Rank, y = Earnings)) +
  geom_point() +
  coord_trans(x = "log10", y = "log10") +
  scale_x_continuous(breaks = c(2500, 5000, 7500, 10000), labels = scales::label_comma()) +
  scale_y_continuous(breaks = c(500, 1000, 2000, 3000),labels = scales::label_dollar()) +
  xlab("Earnings Rank") +
  ylab("Earnings") +
  labs(
    title = "“Patreonville” Earnings vs. Earnings Rank (Log-Log)",
    subtitle = "Projects Ranked 1,101-11,100 in Earnings from Monthly Charges",
    caption = "Data source: Graphtreon Basic CSV Export, December 2022"
  ) +
  theme_gray() +
  theme(axis.text.x = element_text(angle = 60, hjust = 1)) +
  theme(axis.title.x = element_text(margin = margin(t = 5))) +
  theme(axis.title.y = element_text(margin = margin(r = 10))) +
  theme(plot.caption = element_text(margin = margin(t = 15), hjust = 0))

This shows a similar drop-off in earnings as in the first two subsets.

The Rest of Patreonia (Next 100,000)

I calculate the quantities above for the next 100,000 projects by earnings:

rop_tb <- by_earnings_tb %>%
  filter(Earnings_Rank > 11100 & Earnings_Rank <= 111100)

mean_rop_earnings <- mean(rop_tb$Earnings)
sd_rop_earnings <- sd(rop_tb$Earnings)
median_rop_earnings <- median(rop_tb$Earnings)
min_rop_earnings <- min(rop_tb$Earnings)
max_rop_earnings <- max(rop_tb$Earnings)
gini_rop_earnings <- Gini(rop_tb$Earnings)
total_rop_earnings <- sum(rop_tb$Earnings)
pct_rop_earnings <- (100. * total_rop_earnings) / total_earnings

mean_rop_patrons <- mean(rop_tb$Patrons)
sd_rop_patrons <- sd(rop_tb$Patrons)
median_rop_patrons <- median(rop_tb$Patrons)
min_rop_patrons <- min(rop_tb$Patrons)
max_rop_patrons <- max(rop_tb$Patrons)

For the month in question the resulting values for earnings from monthly charges for the next 100,000 projects were as follows:

Maximum: $355
Minimum: $4
Mean: $60 (with a standard deviation of $73)
Median: $28
Gini coefficient: 0.58
Total earnings: $5,961,894 (26% of total for all projects)

The resulting values for the number of patrons for the next 100,000 projects were as follows:

Maximum: 10,913
Minimum: 1
Mean: 11 (with a standard deviation of 39)
Median: 5

I next do a log-log plot of earnings vs. rank for the next 100,000 projects ranked by earnings from monthly charges:

rop_tb %>%
  ggplot(mapping=aes(x = Earnings_Rank, y = Earnings)) +
  geom_point() +
  coord_trans(x = "log10", y = "log10") +
  scale_x_continuous(breaks = c(25000, 75000, 50000, 100000), labels = scales::label_comma()) +
  scale_y_continuous(breaks = c(10, 25, 50, 100, 200, 300, 400),labels = scales::label_dollar()) +
  xlab("Earnings Rank") +
  ylab("Earnings") +
  labs(
    title = "“Rest of Patreon” Earnings vs. Earnings Rank (Log-Log)",
    subtitle = "Projects Ranked 11,101-111,100 in Earnings from Monthly Charges",
    caption = "Data source: Graphtreon Basic CSV Export, December 2022"
  ) +
  theme_gray() +
  theme(axis.text.x = element_text(angle = 60, hjust = 1)) +
  theme(axis.title.x = element_text(margin = margin(t = 5))) +
  theme(axis.title.y = element_text(margin = margin(r = 10))) +
  theme(plot.caption = element_text(margin = margin(t = 15), hjust = 0))

This shows a drop-off for lower-ranked projects, as did the plots for the first three subsets, but there is deviation from the straight-line behavior for the lowest-ranked projects in this subset.

How are earnings and number of patrons related?

One obvious way for a Patreon project to have higher earnings is to have more patrons. But that’s not the only way; in particular, a project could increase the amount of money they get from each patron, for example, because relatively more patrons are in higher membership tiers. Which factor is more important for the Patreon projects in my sample dataset?

To investigate this, I start by calculating the correlation coefficient between the number of patrons and earnings across all the projects with nonzero earnings from monthly charges for the month in question. (More specifically, this is Pearson’s $r$, where I consider the number of patrons to be the independent variable and earnings to be the dependent variable.)

cor_p_e <- cor(by_earnings_tb$Patrons, by_earnings_tb$Earnings)

The resulting correlation coefficient $r$ is 0.86. This is a positive number, since in general projects with more patrons have higher earnings, and it is a reasonably strong correlation but not perfect. (If earnings were perfectly correlated with the number of patrons then the correlation coefficient would be 1.) Looking at $r^2$ instead, I note that about 74% of the variance in earnings is explained by the variance in the number of patrons.

To get a better feel for how the number of patrons varies according to the amount of earnings, I plot each project’s number of patrons against the project’s earnings rank:

by_earnings_tb %>%
  ggplot(mapping=aes(x = Earnings_Rank, y = Patrons)) +
  geom_point() +
  coord_trans(x = "log10", y = "log10") +
  scale_x_continuous(breaks = c(10, 100, 500, 1000, 5000, 10000, 50000, 100000, 200000), labels = scales::label_comma()) +
  scale_y_continuous(breaks = c(1, 5, 10, 25, 100, 250, 1000, 2500, 10000, 25000, 50000),labels = scales::label_comma()) +
  xlab("Earnings Rank") +
  ylab("Number of Patrons") +
  labs(
    title = "Number of Patrons vs. Earnings Rank (Log-Log)",
    subtitle = "All Projects with Nonzero Earnings from Monthly Charges",
    caption = "Data source: Graphtreon Basic CSV Export, December 2022"
  ) +
  theme_gray() +
  theme(axis.text.x = element_text(angle = 60, hjust = 1)) +
  theme(axis.title.x = element_text(margin = margin(t = 5))) +
  theme(axis.title.y = element_text(margin = margin(r = 10))) +
  theme(plot.caption = element_text(margin = margin(t = 15), hjust = 0))

This plot shows two things: First, there are projects with the same number of patrons that have wildly different earnings. For example, consider the dark horizontal line at the lower right, which represents projects that have only a single patron. There are such projects that are in the top 5,000 projects by earnings, and others that rank below 100,000 in earnings.

Similarly, projects with very similar earnings can realize those earnings from a patron base that is wildly different in size. For example, looking at projects around the 1,000 mark in earnings rank, there are projects that rely on as few as 10 or so patrons, and others that require over a thousand patrons to realize the same level of earnings.

To explore this question further, I take the set of all projects with nonzero earnings from monthly charges for the month in question and add a new column showing earnings per patron. I then calculate some basic statistics for that measure.

by_earnings_tb <- by_earnings_tb %>%
  mutate(EPP = Earnings / Patrons)

max_epp = max(by_earnings_tb$EPP)
min_epp = min(by_earnings_tb$EPP)
mean_epp = mean(by_earnings_tb$EPP)
sd_epp = sd(by_earnings_tb$EPP)
median_epp = median(by_earnings_tb$EPP)

For the month in question, the maximum earnings per patron from monthly charges was $1,197.04, the minimum earnings per patron was $0.002, the mean (average) earnings per patron was $6.83 (with a standard deviation of $12.01), and the median earnings per patron was $4.50.

I also plot a histogram of earnings per patron to see its distribution. Only 694 projects had over $50 in earnings per patron, so I don’t bother extending the histogram beyond that point. The solid orange line shows the mean earnings per patron, and the dashed green line the median.

by_earnings_tb %>%
  filter(EPP <= 50) %>%
  ggplot(mapping=aes(x = EPP)) +
  geom_histogram(binwidth = 1) +
  geom_vline(xintercept = mean_epp, color = "#E69F00") +
  geom_vline(xintercept = median_epp, color = "#009E73", linetype = "dashed") +
  scale_x_continuous(labels = scales::label_dollar()) +
  scale_y_continuous(labels = scales::label_comma()) +
  xlab("Earnings Per Patron") +
  ylab("Number of Projects") +
  labs(
    title = "Distribution of Earnings Per Patron",
    subtitle = "All Projects with Nonzero Earnings from Monthly Charges",
    caption = "Data source: Graphtreon Basic CSV Export, December 2022"
  ) +
  theme_gray() +
  theme(axis.title.x = element_text(margin = margin(t = 5))) +
  theme(axis.title.y = element_text(margin = margin(r = 10))) +
  theme(plot.caption = element_text(margin = margin(t = 15), hjust = 0))

From the graph it’s apparent that most projects earn $10 or less per patron; only 19,784 projects (15% of all projects with nonzero earnings from monthly charges) earned more than that per patron.

Do Patreon earnings follow a power-law distribution?

It is very common for people to talk about services like Patreon, Substack, Spotify, etc., as being characterized by a “power-law” distribution. This is typically shorthand for the fact that on such services only a few creators realize significant earnings, with earnings rapidly dropping off once you get beyond those in the top rankings.

However, just because the distribution of earnings exhibits rapid drop-off (as in the graphs shown above), it doesn’t necessarily follow that the distribution is truly a power-law distribution. In this section I do some tests to assess whether Patreon earnings for the month in question follow a power law or not.

Mathematics of a power-law distribution

A power-law distribution is a particular type of probability distribution. Assume that earnings from a service like Patreon are in whole dollars only; i.e., the earnings can take only discrete values (e.g., $1, $19, $117, $1,729, etc.). If such earnings followed a discrete power-law distribution then the probability $p(x)$ of earning exactly $x$ dollars would drop off based on the value of $x$ raised to the negative power of a scaling factor $\alpha$:

\[p(x) = Pr(X = x) = Cx^{-\alpha}\]

Here $X$ is the observed earnings and $C$ is a normalization constant to make the probabilities across all possible values of earnings sum to 1. As $x$ increases $p(x)$ decreases, and for very large values of $x$ approaches zero.

The above is a simplification, in two aspects. First, there must be some minimum value $x_\textrm{min} > 0$ below which the power law behavior does not hold, since $x^{-\alpha}$ is not defined for $x = 0$.

Second, in practice Patreon earnings can have a fractional part; for example, a project might earn $43.57 per month. Thus they are arguably better analyzed as potentially having a continuous power-law distribution. Such a distribution has a different definition of $p(x)$—but still one that depends on the scaling factor $\alpha$ used as a negative power of $x$.

Since $x$ could in theory be any real number, it doesn’t make sense to speak of the probability of the observed earnings $X$ being equal to $x$. Instead we look at the probability that $X$ could be found in some small interval around $x$, expressed in terms of a probability density function $p(x)$:

\[p(x) dx = Pr(x - dx \le X \le x + dx) = Cx^{-\alpha}dx\]

The probability density function $p(x)$ can then be expressed as

\[p(x) = \frac{\alpha -1}{x_\textrm{min}} \left( \frac{x}{x_\textrm{min}} \right)^{-\alpha}\]

If $x_\textrm{min} = 1$ (see below) then this reduces to $p(x) = (\alpha -1)x^{-\alpha}$.

Fitting distributions to the data

I attempt to fit a power-law distribution to the entire sample dataset of monthly earnings for all 128,933 Patreon projects that reported nonzero earnings from monthly charges. Since earnings can be fractional, I attempt to fit a continuous power-law distribution. I also attempt to fit a continuous exponential distribution and a continuous log-normal distribution, to see if either of those provide a better fit than a power-law distribution.

The first step is to create models for all three distributions, using as input the entire sample dataset of projects with nonzero earnings from monthly charges for the month in question.

m_pl <- conpl(by_earnings_tb$Earnings)
m_exp <- conexp(by_earnings_tb$Earnings)
m_lnorm <- conlnorm(by_earnings_tb$Earnings)

I now need to estimate parameters for each of the models. In particular, I need an estimate for $x_\textrm{min}$, the cut-off point below which the models do not apply. There are two ways to do this. The first and better way is to use the estimate_xmin() function with each model (power-law, exponential, and log-normal) to try to find the best value of $x_\textrm{min}$, one that will provide the best model fit.

Unfortunately, this is very time-consuming to do for a dataset with over 100,000 entries with values up to almost $200,000. Doing this for just one distribution took several hours on a fairly-new laptop.

The alternate approach is simply to specify an arbitrary value of $x_\textrm{min}$. For example, in the case of Patreon using the value $x_\textrm{min} = 1$ makes intuitive sense, since $1 per month is the lowest membership tier for a lot of projects, and only 6,536 projects (5% of those reporting earnings from monthly charges) earn less than $1 a month. This will likely not give us the best fit, but the model will cover more of the overall dataset.

Setting $x_\textrm{min}$ to an arbitrary value also simplifies comparing the different models, since they must have the same $x_\textrm{min}$ value in order to do the comparison in a rigorous way than simply inspecting curves on plots.

Therefore I next estimate the parameters for the three models using the arbitrary value $x_\textrm{min} = 1$. I also need to specify a larger value for $x_\textrm{max}$ since the estimate_xmin() function normally doesn’t look at data values higher than 10,000.

The parameters returned by estimate_xmin() are then plugged back into the models.

m_pl_est <- estimate_xmin(m_pl, xmins = 1, xmax = 200000)
m_pl$setXmin(m_pl_est)

m_exp_est <- estimate_xmin(m_exp, xmins = 1, xmax = 200000)
m_exp$setXmin(m_exp_est)

m_lnorm_est <- estimate_xmin(m_lnorm, xmins = 1, xmax = 200000)
m_lnorm$setXmin(m_lnorm_est)

I can now plot the so-called complementary cumulative distribution function (“ccdf”) of the data, along with the curves of best fit from the three models. The ccdf for a value $x$ gives the probability that an observed value $X$ will be greater than $x$: $Pr(X \gt x)$. On the other hand, the cumulative distribution function gives the probability that $X$ is less than or equal to $x$, or $Pr(X \le x)$. We thus have $Pr(X \gt x) = 1 - Pr(X \le x)$.

The poweRlaw R package provides plotting methods for its models, so in the interest of simplicity I use the plot() and lines() functions to create the plot rather than ggplot(). The plot() function plots the ccdf of the underlying earnings data. The lines() function then adds the fitted curves for the power-law distribution (green), the exponential distribution (blue), and the log-normal distribution (orange).

plot(m_pl, xlab = "Earnings ($)", ylab = "CCDF of Earnings", main = "Patreon Earnings and Fitted Distributions", sub = "All Projects Reporting Nonzero Earnings from Monthly Charges", col = "#000000")
lines(m_pl, col = "#009E73", lwd = 2)
lines(m_exp, col = "#56B4E9", lwd = 2)
lines(m_lnorm, col = "#E69F00", lwd = 2)
legend("bottomleft", c("power-law","exponential", "log-normal"), fill = c("#009E73","#56B4E9", "#E69F00"))

Based on the above plot, it appears that the log-normal distribution is a much better fit to the data than either the power-law or exponential distributions. However, the fit begins to break down for earnings above $1,000 per month; above that point the probability of earning more than a given amount appears to be somewhat greater than that given by the log-normal distribution. (Above that point we are also dealing with very small probabilities.)

I can confirm that the log-normal distribution is a better fit than the power-law and exponential distributions by using the compare_distributions() function.

lnorm_vs_pl_one_sided <- compare_distributions(m_lnorm, m_pl)$p_one_sided
lnorm_vs_exp_one_sided <- compare_distributions(m_lnorm, m_exp)$p_one_sided

The one-sided p-value tests whether the first distribution is a better fit than the second. In this case the one-sided p-value is 0 when comparing the log-normal distribution to the power-law distribution and 0 when comparing the log-normal distribution to the exponential distribution. These p-values indicate that the log-normal distribution is clearly a better fit than either of the other distributions.

Predicted probabilities vs. observed probabilities

The log-normal distribution (as its name might imply) is related to the normal distribution (sometimes referred to as the Gaussian distribution). More specifically, per Wikipedia

a log-normal (or lognormal) distribution is a continuous probability distribution of a random variable whose logarithm is normally distributed. Thus, if the random variable $X$ is log-normally distributed, then $Y = ln(X)$ has a normal distribution.

The probability density function for a log-normal distribution is

\[p(x) = \frac{1}{x\sigma\sqrt{2\pi}}\textrm{exp}\left( -\frac{(\ln(x)-\mu)^2}{2\sigma^2}\right)\]

where $\textrm{exp}(x) = e^x$ and $\mu$ and $\sigma$ are the parameters determining the exact form of the distribution. More specifically, if $X$ is a random variable described by a log-normal distribution, then $\ln(X)$ is a normally-distributed random variable, and $\mu$ and $\sigma$ are the mean and standard deviation of that normal distribution.

In our case $\mu$ and $\sigma$ are given by the fitted model:

lnorm_mu <- m_lnorm$pars[1]
lnorm_sigma <- m_lnorm$pars[2]

The value of $\mu$ is approximately 3.33 and the value of $\sigma$ is approximately 1.84.

I can use the values as input to the plnorm() function to estimate the probability of a Patreon project earning more than a certain amount per month in monthly charges:

prob_over_10 = plnorm(10, meanlog = lnorm_mu, sdlog = lnorm_sigma, lower.tail = FALSE)
prob_over_100 = plnorm(100, meanlog = lnorm_mu, sdlog = lnorm_sigma, lower.tail = FALSE)
prob_over_1000 = plnorm(1000, meanlog = lnorm_mu, sdlog = lnorm_sigma, lower.tail = FALSE)
prob_over_10000 = plnorm(10000, meanlog = lnorm_mu, sdlog = lnorm_sigma, lower.tail = FALSE)

The estimated probabilities are as follows, with the observed probabilities in parentheses:

0.71 estimated probability of earning more than $10 (observed 0.68)
0.24 estimated probability of earning more than $100 (observed 0.24)
0.026 estimated probability of earning more than $1000 (observed 0.03)
0.001 estimated probability of earning more than $10000 (observed 0.001)

As is apparent from the values above, a log-normal distribution does a reasonably good job of fitting the observed Patreon data for the month in question.

Log(earnings) is normally distributed

Recall from our discussion above that if a random variable $X$ is log-normally distributed, then $Y = ln(X)$ has a normal distribution. Based on the above I have reason to believe that Patreon earnings from monthly charges for the month in question are log-normally distributed. I should therefore expect that the natural logarithm of those earnings is normally distributed.

I explore that expectation by taking the logarithms of all earnings value and then plotting a histogram showing the number of those resulting values that fall into particular ranges of values, each with width 0.2.

by_earnings_tb %>%
  mutate(logEarnings = log(Earnings)) %>%
  ggplot(mapping=aes(x = logEarnings)) +
  geom_histogram(binwidth = 0.2) +
  geom_vline(xintercept = lnorm_mu, color = "#E69F00") +
  geom_vline(xintercept = lnorm_mu - lnorm_sigma, color = "#56B4E9", linetype = "dashed") +
  geom_vline(xintercept = lnorm_mu + lnorm_sigma, color = "#56B4E9", linetype = "dashed") +
  xlab("log(Earnings)") +
  ylab("Number of Projects") +
  labs(
    title = "Log(Earnings) Distribution for Patreon Projects",
    subtitle = "All Projects with Nonzero Earnings from Monthly Charges",
    caption = "Data source: Graphtreon Basic CSV Export, December 2022"
  ) +
  theme_gray() +
  theme(axis.title.x = element_text(margin = margin(t = 5))) +
  theme(axis.title.y = element_text(margin = margin(r = 10))) +
  theme(plot.caption = element_text(margin = margin(t = 15), hjust = 0))

The orange solid line in the plot above marks the value of $\mu$, the first parameter estimated from fitting a log-normal distribution to the Patreon earnings data, while the blue dashed lines mark the values $\mu - \sigma$ and $\mu + \sigma$, where $\sigma$ is the second parameter estimated from fitting a log-normal distribution to the data.

(The spikes in the counts on the left side of the distribution are presumably from particular levels of earnings where the data deviates from a log-normal distribution.)

Here the reason for using the symbols $\mu$ and $\sigma$ becomes apparent: they represent the mean and standard deviation respectively of the normal distribution corresponding to the log-normal distribution.

I can confirm that by calculating the sample mean and sample standard deviation of the logarithms of earnings:

mean_log <- mean(log(by_earnings_tb$Earnings))
sd_log <- sd(log(by_earnings_tb$Earnings))

The value of the sample mean of the logged earnings is 3.29, compared to the value 3.33 for $\mu$, while the value of the sample standard deviation is 1.84, compared to the value 1.84 for $\sigma$.

Conclusions and speculations

Based on the above analysis, I conclude the following:

First, the distribution of Patreon earnings from monthly charges is highly unequal, and the likelihood of any individual project making a significant amount of money is very low.

However, for the month in question the “Patreonville” and “rest of Patreonia” subsets discussed above (containing 10,000 and 100,000 projects respectively) together accounted for almost two-thirds of the earnings from monthly charges (63%), while the “rest of Patreonia” subset alone accounted for over a quarter (26%).

Assuming that Patreon derives a relatively fixed and equal perentage of the revenue that each project makes, that means that for the month in question Patreon derived the majority of its own revenue from projects with relatively low earnings. That in turn means that serving such low-earnings projects might have been profitable for Patreon, as long as the marginal cost to serve a new project was as close to zero as possible.

Second, for the month in question the distribution of Patreon earnings from monthly charges does not follow a power law, but rather can be best modeled using a log-normal distribution.

This fact gives rise to an interesting speculation, or rather two related speculations:

As I noted above, taking the logarithm of the values of a log-normal distribution produces a normal distribution. But this can be run in reverse: if $X$ is a random variable that is normally-distributed, then $\exp(X)$ is a random variable that is log-normally distributed.

The first speculation is that there is an $X$-factor that determines the quality/value/appeal of an individual project—be it the talent of the creators associated with the project, the format or subject of the work being produced, how long the project has been active, or a combination of all these and even more—and that this $X$-factor is normally-distributed.

(Indeed, if the $X$-factor is a combination of a relatively large number of relatively independent subfactors, each normally distributed or nearly so, then we would expect it to be normally-distributed itself, or nearly so—since the sum of independent normally-distributed random variables is itself a normally-distributed random variable.)

The second speculation is that there is something in the structure of Patreon—and perhaps other similar services—that takes this normally-distributed $X$-factor and rewards projects with earnings that are proportional not to $X$ but to $\exp(X)$. It would thereby convert the original normal distribution of quality/appeal/value/whatever into a log-normal distribution of earnings, with a few projects receiving outsized earnings compared to other projects whose quality/appeal/value/whatever was not that much different from the high-earning projects.

On that bit of speculation I’ll end this analysis. As the saying goes, “further research is needed.”

Appendix

Caveats

This analysis is subject to the following caveats, among others:

The Graphtreon dataset does not contain Patreon projects that do not publicly report their number of patrons. If the likelihood of a project doing this is not uniform across all projects, this may skew the results, since the dataset would not necessarily be a representative sample of all Patreon projects.
As noted above, the Graphtreon dataset contains many projects that do not publicly report earnings. This is more common among projects with the most patrons, and may skew the results for the highest-ranked projects by earnings.

References

Patreon project data was obtained from Graphtreon LLC as a basic CVS export for the month of December 2022, https://graphtreon.com/data-services.

The standard reference for assessing whether empirical data fits a power-law distribution is Aaron Clauset, Cosma Rohilla Shalizi, and M.E.J. Newman, “Power-law distributions in empirical data,” arXiv:0706.1062 [physics.data-an].

The poweRlaw R package for assessing whether a distribution fits a power law is described in C.S. Gillespie, “Fitting Heavy Tailed Distributions: The poweRlaw Package,” Journal of Statistical Software, 64(2), 1–16, http://www.jstatsoft.org/v64/i02/.

Country-level Gini coefficients are from the table “UN, World Bank and CIA list – income ratios and Gini indices” in the Wikipedia article “List of countries by income equality.” An explanation of how Gini coefficients are calculated can be found in my blog post “Income Inequality in Howard County, Part 1.”

Environment

I used the following R environment in doing the analysis above:

sessionInfo()

## R version 4.2.1 (2022-06-23)
## Platform: x86_64-apple-darwin17.0 (64-bit)
## Running under: macOS Big Sur ... 10.16
## 
## Matrix products: default
## BLAS:   /Library/Frameworks/R.framework/Versions/4.2/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/4.2/Resources/lib/libRlapack.dylib
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## attached base packages:
## [1] tools     stats     graphics  grDevices utils     datasets  methods  
## [8] base     
## 
## other attached packages:
##  [1] poweRlaw_0.70.6   DescTools_0.99.47 forcats_0.5.2     stringr_1.5.0    
##  [5] dplyr_1.0.10      purrr_1.0.1       readr_2.1.3       tidyr_1.2.1      
##  [9] tibble_3.1.8      ggplot2_3.4.0     tidyverse_1.3.2  
## 
## loaded via a namespace (and not attached):
##  [1] fs_1.5.2            lubridate_1.9.0     bit64_4.0.5        
##  [4] httr_1.4.4          backports_1.4.1     bslib_0.4.2        
##  [7] utf8_1.2.2          R6_2.5.1            DBI_1.1.3          
## [10] colorspace_2.0-3    withr_2.5.0         tidyselect_1.2.0   
## [13] Exact_3.2           bit_4.0.5           compiler_4.2.1     
## [16] cli_3.6.0           rvest_1.0.3         expm_0.999-7       
## [19] xml2_1.3.3          labeling_0.4.2      sass_0.4.4         
## [22] scales_1.2.1        mvtnorm_1.1-3       proxy_0.4-27       
## [25] digest_0.6.31       rmarkdown_2.19      pkgconfig_2.0.3    
## [28] htmltools_0.5.4     dbplyr_2.2.1        fastmap_1.1.0      
## [31] highr_0.10          rlang_1.0.6         readxl_1.4.1       
## [34] rstudioapi_0.14     jquerylib_0.1.4     generics_0.1.3     
## [37] farver_2.1.1        jsonlite_1.8.4      vroom_1.6.0        
## [40] googlesheets4_1.0.1 magrittr_2.0.3      Matrix_1.5-3       
## [43] Rcpp_1.0.9          munsell_0.5.0       fansi_1.0.3        
## [46] lifecycle_1.0.3     stringi_1.7.12      yaml_2.3.6         
## [49] MASS_7.3-58.1       rootSolve_1.8.2.3   grid_4.2.1         
## [52] parallel_4.2.1      crayon_1.5.2        lmom_2.9           
## [55] lattice_0.20-45     haven_2.5.1         hms_1.1.2          
## [58] knitr_1.41          pillar_1.8.1        boot_1.3-28.1      
## [61] gld_2.6.6           reprex_2.0.2        glue_1.6.2         
## [64] evaluate_0.19       data.table_1.14.6   modelr_0.1.10      
## [67] vctrs_0.5.1         tzdb_0.3.0          cellranger_1.1.0   
## [70] gtable_0.3.1        assertthat_0.2.1    cachem_1.0.6       
## [73] xfun_0.36           broom_1.0.2         pracma_2.4.2       
## [76] e1071_1.7-12        class_7.3-20        googledrive_2.0.0  
## [79] gargle_1.2.1        timechange_0.2.0    ellipsis_0.3.2

Source code

The source code for this analysis can be found in the public code repository https://gitlab.com/frankhecker/misc-analysis in the patreon subdirectory.

This document and its source code are available for unrestricted use, distribution and modification under the terms of the Creative Commons CC0 1.0 Universal (CC0 1.0) Public Domain Dedication. Stated more simply, you’re free to do whatever you’d like with it.