The Patreon service, part of the so-called “creator economy” (Chayka 2021) markets itself to artists, musicians, writers, podcasters, and others as a place where “creators retain creative freedom while getting the salary they deserve” (Patreon, n.d.). Patreon advertises itself as having over 8 million monthly active “patrons” (people who pay creators for various services) and more than 250,000 creators on the platform, and as having paid out over $3.5 billion to creators since 2013, when the service was founded.
In addition to such summary statistics, Patreon (2022) has also published survey results breaking down various characteristics of its customers, such as where they live, what types of content they produce, and (in particular) what proportion of their “creative income” comes from Patreon. However neither Patreon nor (to my knowledge) anyone else has published an in-depth analysis of the distribution of earnings among the platform’s projects.
Fortunately, an independent organization makes available monthly reports listing all Patreon creators reporting their number of patrons, along with earnings data for those projects that make that information publicly available (Graphtreon, n.d.). I use one month’s worth of the Graphtreon data (for December 2022) to explore the distribution of earnings across Patreon projects, with a focus on projects that charge patrons by the month (as do the vast majority of projects reporting earnings). This analysis revises and expands upon previous analyses (Hecker 2023a; 2023b; 2023c).
My goals in this analysis are as follows:
To do the analysis I use the R statistical software language (R Foundation, n.d.), the Tidyverse R package (Hadley and RStudio 2023), and other R packages as discussed below.
I load the following R libraries, for the purposes listed:
library("tidyverse")
library("tools")
library("DescTools")
library("poweRlaw")
library("gglorenz")
I use a local copy of the Graphtreon-collected Patreon data for December 2022. This dataset contains an entry for every Patreon project for which the number of patrons is publicly reported. Because the Graphtreon data is proprietary, I store it in a separate directory and do not make it available as part of this analysis.
I check the MD5 hash values for the file, and stop if the contents are not what are expected.
stopifnot(md5sum("../../graphtreon/graphtreonBasicExport_Dec2022.csv") == "98ff63f7d6aa3f2d1b2acaf40425ac9b")
I load the raw Patreon data from Graphtreon into the table
patreon_tb:
patreon_tb <- read_csv(
"../../graphtreon/graphtreonBasicExport_Dec2022.csv",
col_types = "cccidcicicc"
)
This table contains the following fields:
Name. The name of the Patreon project.Creation Name. The purpose of the project (e.g.,
“Creating video games”).Category. The general type of content produced (e.g.,
“Podcasts”).Patrons. The number of patrons of the project.Earnings. Reported earnings for the month. This value
is missing for projects that do not publicly report their earnings.Pay Per. The basis on which patrons are charged. For
projects charging by the month (the vast majority, as shown below) this
has the value “month”. Other examples include “Podcast,” “Blog Post,”
and many others. (There does not appear to be any standardization of
terms.)Is Nsfw. This has the value 1 if the content is NSFW, 0
otherwise.Launched. The date and time that the project was
launched (e.g., “2018-09-13 20:16:39”).Twitter Followers. The number of Twitter followers for
the project. This value is missing for projects for which the number of
Twitter followers cannot be determined.Patreon. The Patreon URL for the project.Graphtreon. The Graphtreon URL for the project.As discussed below, I focus primarily on the Patrons and
Earnings fields.
I do some basic exploratory data analysis, starting with the total amount of data in the dataset.
total_projects <- length(patreon_tb$Patrons)
There are a total of 217,861 projects listed in the Grapheon data for the month in question. Note the word “projects” here, not “creators”: Patreon is organized by projects, and it’s possible that a given person may have more than one project active. It’s also possible that a given project may be associated with multiple people.
I suspect that the vast majority of Patreon projects are associated with one creator, and that the vast majority of people have only one project in which they participate. Unfortunately there’s no way of telling from the data at hand how true this is. I’ll therefore be careful in the terms I use, and will generally refer to “projects,” not “creators.”
Moving on to the actual data fields, as previously discussed there are three numeric variables of interest in the Graphtreon data:
The Graphtreon data should include only projects that reported a nonzero number of patrons, so I do an initial check to see if that is true:
no_reported_patrons <- sum(is.na(patreon_tb$Patrons))
zero_patrons <- sum(!is.na(patreon_tb$Patrons) & patreon_tb$Patrons <= 0)
For the month in question there were 0 projects in the dataset that did not report their number of patrons, and 0 projects that reported having zero patrons. So it is in fact the case that for this dataset the number of patrons is always present and is always one or more.
As noted above, my primary focus is on earnings, since making money is presumably why people start Patreon projects in the first place, and the promise of making money is the main selling point of the so-called “creator economy.”
I therefore focus in particular on projects that have the following characteristics:
How important are projects not in the latter two categories? I calculate the number of projects that reported zero earnings, or have nonzero earnings but don’t charge by the month:
unreported_earnings <- sum(is.na(patreon_tb$Earnings))
zero_earnings <- sum(!is.na(patreon_tb$Earnings) & patreon_tb$Earnings == 0)
nonmonthly_earnings <- patreon_tb %>%
filter(!is.na(Earnings) & Earnings > 0) %>%
filter(is.na(`Pay Per`) | `Pay Per` != "month") %>%
summarize(n()) %>%
as.integer()
monthly_earnings <- patreon_tb %>%
filter(!is.na(Earnings) & Earnings > 0) %>%
filter(!is.na(`Pay Per`) & `Pay Per` == "month") %>%
summarize(n()) %>%
as.integer()
There were only 265 projects that reported zero earnings (as opposed to not publicly reporting earnings at all). Given the relatively small size of this group, I ignore it in the analysis.
There were only 5,369 projects that reported nonzero earnings and did not charge by the month. Again, given the relatively small size of this group, I ignore it as well.
This leaves 128,933 projects reporting nonzero earnings from monthly charges, representing 59% of all projects in the Graphtreon dataset.
Can an analysis of projects reporting nonzero earnings from monthly charges be reasonably extrapolated to those projects not reporting earnings at all?
One key question is whether top-ranked projects are less likely to report earnings than lower-ranked projects. To gauge the extent to which this is true, I sort all projects in descending order by number of patrons, group them into batches of 1,000 projects each, then look at how many projects in each batch do not report their earnings at all.
patreon_tb %>%
arrange(desc(Patrons)) %>%
mutate(Patrons_Rank = row_number()) %>%
mutate(Patrons_Rank_Group = ceiling(Patrons_Rank / 1000)) %>%
select(Patrons_Rank_Group, Earnings) %>%
group_by(Patrons_Rank_Group) %>%
summarize(Not_Reporting = sum(is.na(Earnings)) / 1000) %>%
ggplot(aes(x = Patrons_Rank_Group, y = Not_Reporting)) +
geom_point() +
scale_y_continuous(breaks = c(0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8), limits = c(0, 0.8)) +
xlab("Rank by Number of Patrons (000)") +
ylab("Fraction Not Reporting Earnings") +
labs(
title = "Patreon Projects Not Reporting Monthly Earnings",
subtitle = "Fraction Not Reporting Monthly Earnings, By Rank in Number of Patrons",
caption = "Data source: Graphtreon Basic CSV Export, December 2022"
) +
theme_gray() +
theme(axis.title.x = element_text(margin = margin(t = 5))) +
theme(axis.title.y = element_text(margin = margin(r = 10))) +
theme(plot.caption = element_text(margin = margin(t = 15), hjust = 0))
Almost three quarters of the top-ranked projects (by number of patrons) do not report their earnings, while just over a quarter of the lowest-ranked projects do not report earnings.
Are there other differences between projects reporting nonzero
earnings from monthly charges and projects not reporting earnings at
all? To do further analysis I create a table of those projects that
reported nonzero earnings from monthly charges
(earnings_tb) and a table of those projects that did not
report their earnings (unreported_tb). I also arrange the
first table in rank order by earnings.
earnings_tb <- patreon_tb %>%
filter(!is.na(Earnings) & Earnings > 0 & !is.na(`Pay Per`) & `Pay Per` == "month") %>%
arrange(desc(Earnings))
earnings_tb <- earnings_tb %>%
mutate(Earnings_Rank = 1:nrow(earnings_tb))
unreported_tb <- patreon_tb %>%
filter(is.na(Earnings))
I then calculate summary statistics for the number of patrons per project in both sets of projects.
max_patrons_unreported <- max(unreported_tb$Patrons)
mean_patrons_unreported <- mean(unreported_tb$Patrons)
sd_patrons_unreported <- sd(unreported_tb$Patrons)
median_patrons_unreported <- median(unreported_tb$Patrons)
max_patrons_earnings <- max(earnings_tb$Patrons)
mean_patrons_earnings <- mean(earnings_tb$Patrons)
sd_patrons_earnings <- sd(earnings_tb$Patrons)
median_patrons_earnings <- median(earnings_tb$Patrons)
For the month in question the average number of patrons across all projects reporting nonzero earnings from monthly charges was 34 (with a standard deviation of 306) while the median number of patrons per project for this set of projects was 4. (The median being an order of magnitude less than the mean is a reflection of the top-ranked projects having disproportionately more patrons.) The maximum number of patrons for projects in this set was 37,391.
For the month in question the average number of patrons across all projects not reporting their earnings was 98 (with a standard deviation of 568) while the median number of patrons per project was 10. The maximum number of patrons for projects in this set was 44,454.
It appears that projects not reporting their earnings tend to have a larger number of patrons (and thus perhaps higher earnings) than projects that do report their earnings. This is consistent with the finding above that projects with a larger number of patrons are more likely to not report their earnings.
Do projects reporting earnings differ in other ways from projects not reporting earnings? I calculate the relative percentages of projects in each group that feature NSFW content:
num_nsfw_earnings = sum(!is.na(earnings_tb$`Is Nsfw`) & earnings_tb$`Is Nsfw` == 1)
num_nsfw_unreported = sum(!is.na(unreported_tb$`Is Nsfw`) & unreported_tb$`Is Nsfw` == 1)
pct_nsfw_earnings <- (100. * num_nsfw_earnings) / monthly_earnings
pct_nsfw_unreported <- (100. * num_nsfw_unreported) / unreported_earnings
24% of projects with nonzero earnings from monthly charges feature NSFW content, vs. 28% of projects that do not report their earnings. So, again there is a difference between the two sets of projects, with projects not reporting earnings slightly more likely to feature NSFW content.
Do median and mean earnings differ between NSFW and SFW projects?
nsfw_mean = mean(earnings_tb$Earnings[earnings_tb$`Is Nsfw` == 1])
nsfw_median = median(earnings_tb$Earnings[earnings_tb$`Is Nsfw` == 1])
sfw_mean = mean(earnings_tb$Earnings[earnings_tb$`Is Nsfw` == 0])
sfw_median = median(earnings_tb$Earnings[earnings_tb$`Is Nsfw` == 0])
The median monthly earnings for NSFW projects charging by the month was $30, while the mean earnings was $216.
The median monthly earnings for SFW projects charging by the month was $23, while the mean earnings was $169.
Thus the median NSFW project had higher earnings than the median SFW project.
I conclude that the subset of projects that did not report earnings is somewhat different than the subset of projects that reported nonzero earnings from monthly charges. The main factor appears to be that the projects with the greatest number of patrons are significantly more likely to not report their earnings. It’s also possible that projects not reporting earnings have somewhat higher earnings on average because they’re more likely to produce NSFW content.
The remainder of this analysis focuses on projects reporting nonzero earnings from monthly charges, with the caveat that the analysis may not generalize to all Patreon projects.
Now that I have my dataset of interest, I can continue my exploratory data analysis, this time by plotting earnings from monthly charges as a function of rank (i.e., from those projects earning the most to those earning the least).
earnings_tb %>%
ggplot(mapping=aes(x = Earnings_Rank, y = Earnings)) +
geom_point() +
xlab("Earnings Rank") +
ylab("Earnings") +
labs(
title = "Patreon Monthly Earnings vs. Earnings Rank",
subtitle = "Patreon Projects Reporting Nonzero Earnings from Monthly Charges",
caption = "Data source: Graphtreon Basic CSV Export, December 2022"
) +
scale_x_continuous(
breaks = c(25000, 50000, 75000, 100000, 125000),
labels = scales::label_comma()
) +
scale_y_continuous(
labels = scales::label_dollar()
) +
theme_grey() +
theme(axis.text.x = element_text(angle = 60, hjust = 1)) +
theme(axis.title.x = element_text(margin = margin(t = 5))) +
theme(axis.title.y = element_text(margin = margin(r = 10))) +
theme(plot.caption = element_text(margin = margin(t = 15), hjust = 0))
This is an extremely skewed distribution: for the month in question only a relatively few top-ranked projects had significant earnings from monthly charges.
An alternative way of plotting such a highly skewed distribution is to plot both the \(x\)- and \(y\)-axes as logarithms of the underlying values (a so-called “log-log” plot). (This requires all values to be greater than zero, since the logarithm of zero is undefined.) Here is such a plot for earnings vs. earnings rank:
earnings_tb %>%
ggplot(mapping=aes(x = Earnings_Rank, y = Earnings)) +
geom_point() +
coord_trans(x = "log10", y = "log10") +
scale_x_continuous(
breaks = c(1, 10, 100, 1000, 10000, 100000, 200000),
labels = scales::label_comma()
) +
scale_y_continuous(
breaks = c(10, 100, 1000, 10000, 100000, 300000),
labels = scales::label_dollar()
) +
xlab("Earnings Rank") +
ylab("Earnings") +
labs(
title = "Patreon Monthly Earnings vs. Earnings Rank (Log-Log)",
subtitle = "Patreon Projects Reporting Nonzero Earnings from Monthly Charges",
caption = "Data source: Graphtreon Basic CSV Export, December 2022"
) +
theme_grey() +
theme(axis.text.x = element_text(angle = 60, hjust = 1)) +
theme(axis.title.x = element_text(margin = margin(t = 5))) +
theme(axis.title.y = element_text(margin = margin(r = 10))) +
theme(plot.caption = element_text(margin = margin(t = 15), hjust = 0))
Note that the drop-off in earnings is even more pronounced for projects with the very lowest earnings.
As shown in the previous section, the distribution of earnings from monthly charges among Patreon projects was highly unequal for the month in question. I now compute some example statistics to characterize this inequality:
mean_earnings <- mean(earnings_tb$Earnings)
sd_earnings <- sd(earnings_tb$Earnings)
median_earnings <- median(earnings_tb$Earnings)
top_point_1_pct = round(0.001 * monthly_earnings)
top_1_pct = round(0.01 * monthly_earnings)
top_10_pct = round(0.1 * monthly_earnings)
top_25_pct = round(0.25 * monthly_earnings)
top_50_pct = round(0.5 * monthly_earnings)
total_earnings <- sum(earnings_tb$Earnings)
top_point_1_pct_share = sum(earnings_tb$Earnings[1:top_point_1_pct]) / total_earnings
top_1_pct_share = sum(earnings_tb$Earnings[1:top_1_pct]) / total_earnings
top_10_pct_share = sum(earnings_tb$Earnings[1:top_10_pct]) / total_earnings
top_25_pct_share = sum(earnings_tb$Earnings[1:top_25_pct]) / total_earnings
top_50_pct_share = sum(earnings_tb$Earnings[1:top_50_pct]) / total_earnings
frac_over_10 = sum(earnings_tb$Earnings > 10) / monthly_earnings
frac_over_100 = sum(earnings_tb$Earnings > 100) / monthly_earnings
frac_over_1000 = sum(earnings_tb$Earnings > 1000) / monthly_earnings
frac_over_10000 = sum(earnings_tb$Earnings > 10000) / monthly_earnings
For the month in question the average earnings from monthly charges per project was $180 (with a standard deviation of $1,410), while the median earnings per project was $25. The median being an order of magnitude less than the mean is a reflection of the top-ranked projects having disproportionately more earnings from monthly charges.
More specifically, for the month in question:
Turning now to the proportion of projects earning more than a certain amount in monthly charges for the month in question:
The inequality of earnings among Patreon projects can be shown graphically by the Lorenz curve (Lorenz 1905), which in this case shows the cumulative fraction of monthly earnings plotted against the cumulative fraction of Patreon projects. The curve can in turn be described using the Gini coefficient or Gini index (Farris 2010), a widely-used measure of income inequality.
gini_earnings <- Gini(earnings_tb$Earnings)
earnings_tb %>%
ggplot(aes(Earnings)) +
stat_lorenz(desc = FALSE) +
coord_fixed() +
geom_abline(linetype = "dashed") +
theme_grey() +
scale_x_continuous(breaks = c(0.0, 0.25, 0.50, 0.75, 1.00)) +
scale_y_continuous(breaks = c(0.0, 0.25, 0.50, 0.75, 1.00)) +
labs(x = "Cumulative Fraction of Projects",
y = "Cumulative Fraction of Earnings",
title = "Patreon Monthly Earnings Inequality",
subtitle = "Earnings Share for Projects Reporting Monthly Earnings",
caption = "Data source: Graphtreon Basic CSV Export, December 2022"
) +
annotate_ineq(earnings_tb$Earnings)
As noted above, the bottom 50% of projects capture a very small percentage of total monthly earnings, about 2%, and the bottom 75% don’t do much better, capturing only 9% of total monthly earnings.
If the distribution of earnings were completely equal among Patreon projects then the Lorenz curve would look like the dashed line in the graph: the bottom 25% of projects would have 25% of total monthly earnings, the bottom 50% of projects would have 50% of total monthly earnings, and so on.
The Gini coefficient has a geometric interpretation as follows: it is the fraction of the total area under the dashed line that is taken up by the region between the dashed line and the Lorenz curve. (For a simple example of how this can be calculated, see [Hecker 2008].)
If the Lorenz curve is close to the line representing equal shares then that region will be relatively small, and the Gini coefficient will be close to zero. On the hand, if the Lorenz curve is far away from the line representing equal shares (as is the case above) then that region will be relatively large, and the Gini coefficient will be close to 1.
The Gini coefficient associated with earnings from monthly charges per project is 0.84, corresponding to a very unequal distribution of earnings from monthly charges, consistent with the other statistics.
By comparison, based on data from Wikipedia the country with the greatest income inequality in the world is South Africa, where an advanced urban economy coexists with vast swaths of poverty. South Africa’s Gini coefficient is 0.63 (World Bank 2023). As a further comparison, the Gini coefficient for the United States is 0.37, and the Gini coefficients for the various Scandinavian countries range from 0.26 to 0.29.
Finally, for the month in question the total earnings from monthly charges for all projects combined was $23,240,210. This number helped drive Patreon’s overall revenue and profits for 2022 (since Patreon makes money by taking fees from each project), but it has no relevance for any individual project.
To get a better feel for how earnings were distributed among Patreon projects for the month in question, I now look at the following subsets of the total sample of 128,933 projects reporting nonzero earnings from monthly charges, with each subset being 10 times larger than the last; for convenience in referring to them I give them names:
These subsets combined accounted for 86% of the projects reporting nonzero earnings from monthly charges.
For each subset I compute the following quantities and then do a log-log plot of earnings vs. rank:
I calculate the quantities above for the top 100 projects by reported earnings from monthly charges:
ph_tb <- earnings_tb %>%
filter(Earnings_Rank > 0 & Earnings_Rank <= 100)
mean_ph_earnings <- mean(ph_tb$Earnings)
sd_ph_earnings <- sd(ph_tb$Earnings)
median_ph_earnings <- median(ph_tb$Earnings)
min_ph_earnings <- min(ph_tb$Earnings)
max_ph_earnings <- max(ph_tb$Earnings)
gini_ph_earnings <- Gini(ph_tb$Earnings)
total_ph_earnings <- sum(ph_tb$Earnings)
pct_ph_earnings <- (100. * total_ph_earnings) / total_earnings
mean_ph_patrons <- mean(ph_tb$Patrons)
sd_ph_patrons <- sd(ph_tb$Patrons)
median_ph_patrons <- median(ph_tb$Patrons)
min_ph_patrons <- min(ph_tb$Patrons)
max_ph_patrons <- max(ph_tb$Patrons)
For the month in question the resulting values for earnings from monthly charges for the top 100 projects were as follows:
The resulting values for the number of patrons for the top 100 projects were as follows:
I next do a log-log plot of earnings vs. rank for the top 100 projects ranked by earnings from monthly charges:
ph_tb %>%
ggplot(mapping=aes(x = Earnings_Rank, y = Earnings)) +
geom_point() +
coord_trans(x = "log10", y = "log10") +
scale_x_continuous(
breaks = c(5, 10, 25, 50, 100),
labels = scales::label_comma()
) +
scale_y_continuous(
breaks = c(25000, 50000, 100000, 150000, 200000),
labels = scales::label_dollar()
) +
xlab("Earnings Rank") +
ylab("Earnings") +
labs(
title = "“Patreon Heights” Monthly Earnings vs. Earnings Rank",
subtitle = "Top 100 Patreon Projects Reporting Earnings from Monthly Charges",
caption = "Data source: Graphtreon Basic CSV Export, December 2022"
) +
theme_gray() +
theme(axis.text.x = element_text(angle = 60, hjust = 1)) +
theme(axis.title.x = element_text(margin = margin(t = 5))) +
theme(axis.title.y = element_text(margin = margin(r = 10))) +
theme(plot.caption = element_text(margin = margin(t = 15), hjust = 0))
Even in this highest-earning subset of projects we see the phenomenon that earnings dropped off rapidly for lower-ranked projects.
If we extrapolate to an entire year, the median annual earnings for “Patreon Heights” would be approximately $294,000. If we compare “Patreon Heights” to real-world jurisdictions, the closest comparison is to an especially affluent neighborhood in an especially affluent county in the US, like Loudoun County, Virginia, which has the highest median household income of any US county at $153,506 and a Gini coefficient of 0.37 (US Census Bureau 2021b; 2021a).
Creators with projects in this subset are the crème de la crème, those who have built successful full-time businesses on Patreon and who are often held up as examples of the viability of the “creator economy.”
I calculate the quantities above for the next 1,000 projects by reported earnings from monthly charges:
pg_tb <- earnings_tb %>%
filter(Earnings_Rank > 100 & Earnings_Rank <= 1100)
mean_pg_earnings <- mean(pg_tb$Earnings)
sd_pg_earnings <- sd(pg_tb$Earnings)
median_pg_earnings <- median(pg_tb$Earnings)
min_pg_earnings <- min(pg_tb$Earnings)
max_pg_earnings <- max(pg_tb$Earnings)
gini_pg_earnings <- Gini(pg_tb$Earnings)
total_pg_earnings <- sum(pg_tb$Earnings)
pct_pg_earnings <- (100. * total_pg_earnings) / total_earnings
mean_pg_patrons <- mean(pg_tb$Patrons)
sd_pg_patrons <- sd(pg_tb$Patrons)
median_pg_patrons <- median(pg_tb$Patrons)
min_pg_patrons <- min(pg_tb$Patrons)
max_pg_patrons <- max(pg_tb$Patrons)
For the month in question the resulting values for earnings from monthly charges for the next 1,000 projects were as follows:
The resulting values for the number of patrons for the next 1,000 projects were as follows:
I next do a log-log plot of earnings vs. rank for the next 1,000 projects ranked by earnings from monthly charges:
pg_tb %>%
ggplot(mapping=aes(x = Earnings_Rank, y = Earnings)) +
geom_point() +
coord_trans(x = "log10", y = "log10") +
scale_x_continuous(
breaks = c(150, 250, 500, 750, 1000),
labels = scales::label_comma()
) +
scale_y_continuous(
breaks = c(5000, 7500, 10000, 15000),
labels = scales::label_dollar()
) +
xlab("Earnings Rank") +
ylab("Earnings") +
labs(
title = "“Patreon Grove” Monthly Earnings vs. Earnings Rank",
subtitle = "Patreon Projects Ranked 101-1100 in Earnings from Monthly Charges",
caption = "Data source: Graphtreon Basic CSV Export, December 2022"
) +
theme_gray() +
theme(axis.text.x = element_text(angle = 60, hjust = 1)) +
theme(axis.title.x = element_text(margin = margin(t = 5))) +
theme(axis.title.y = element_text(margin = margin(r = 10))) +
theme(plot.caption = element_text(margin = margin(t = 15), hjust = 0))
This shows a similar drop-off in earnings as in the first subset.
The estimated annual median earnings in “Patreon Grove” (approximately $50,000) is well under the current median household income in the US. A US county with a comparable median household income (at $50,045) is Angelina County, Texas (US Census Bureau 2021b).
US-based creators in this subset could earn a living as full-time Patreon creators if they live alone, are willing to accept a somewhat lower standard of living, supplement their income with a side job, or live somewhere with a lower cost of living. However, this level of income is well above the median in some developed countries.
I calculate the quantities above for the next 10,000 projects by earnings:
pv_tb <- earnings_tb %>%
filter(Earnings_Rank > 1100 & Earnings_Rank <= 11100)
mean_pv_earnings <- mean(pv_tb$Earnings)
sd_pv_earnings <- sd(pv_tb$Earnings)
median_pv_earnings <- median(pv_tb$Earnings)
min_pv_earnings <- min(pv_tb$Earnings)
max_pv_earnings <- max(pv_tb$Earnings)
gini_pv_earnings <- Gini(pv_tb$Earnings)
total_pv_earnings <- sum(pv_tb$Earnings)
pct_pv_earnings <- (100. * total_pv_earnings) / total_earnings
mean_pv_patrons <- mean(pv_tb$Patrons)
sd_pv_patrons <- sd(pv_tb$Patrons)
median_pv_patrons <- median(pv_tb$Patrons)
min_pv_patrons <- min(pv_tb$Patrons)
max_pv_patrons <- max(pv_tb$Patrons)
For the month in question the resulting values for earnings from monthly charges for the next 10,000 projects were as follows:
The resulting values for the number of patrons for the next 10,000 projects were as follows:
I next do a log-log plot of earnings vs. rank for the next 10,000 projects ranked by earnings from monthly charges:
pv_tb %>%
ggplot(mapping=aes(x = Earnings_Rank, y = Earnings)) +
geom_point() +
coord_trans(x = "log10", y = "log10") +
scale_x_continuous(
breaks = c(2500, 5000, 7500, 10000),
labels = scales::label_comma()
) +
scale_y_continuous(
breaks = c(500, 1000, 2000, 3000),
labels = scales::label_dollar()
) +
xlab("Earnings Rank") +
ylab("Earnings") +
labs(
title = "“Patreonville” Monthly Earnings vs. Earnings Rank",
subtitle = "Patreon Projects Ranked 1,101-11,100 in Earnings from Monthly Charges",
caption = "Data source: Graphtreon Basic CSV Export, December 2022"
) +
theme_gray() +
theme(axis.text.x = element_text(angle = 60, hjust = 1)) +
theme(axis.title.x = element_text(margin = margin(t = 5))) +
theme(axis.title.y = element_text(margin = margin(r = 10))) +
theme(plot.caption = element_text(margin = margin(t = 15), hjust = 0))
This shows a similar drop-off in earnings as in the first two subsets.
The estimated annual median earnings in “Patreonville,” about $7,900, is well below the US Federal poverty guideline of $14,580 for a single-person household (US Department of Health and Human Services 2023), and is lower than the median household income for any jurisdiction in the US including Puerto Rico. (As a comparison, the median household income in Mayagüez, Puerto Rico, is $15,941. [US Census Bureau 2021b].)
A typical US-based person with a project in this subset would need to have a day job or the support of a partner or family. However, there are other countries where this might constitute a good middle-class income.
I calculate the quantities above for the next 100,000 projects by earnings:
rop_tb <- earnings_tb %>%
filter(Earnings_Rank > 11100 & Earnings_Rank <= 111100)
mean_rop_earnings <- mean(rop_tb$Earnings)
sd_rop_earnings <- sd(rop_tb$Earnings)
median_rop_earnings <- median(rop_tb$Earnings)
min_rop_earnings <- min(rop_tb$Earnings)
max_rop_earnings <- max(rop_tb$Earnings)
gini_rop_earnings <- Gini(rop_tb$Earnings)
total_rop_earnings <- sum(rop_tb$Earnings)
pct_rop_earnings <- (100. * total_rop_earnings) / total_earnings
mean_rop_patrons <- mean(rop_tb$Patrons)
sd_rop_patrons <- sd(rop_tb$Patrons)
median_rop_patrons <- median(rop_tb$Patrons)
min_rop_patrons <- min(rop_tb$Patrons)
max_rop_patrons <- max(rop_tb$Patrons)
For the month in question the resulting values for earnings from monthly charges for the next 100,000 projects were as follows:
The resulting values for the number of patrons for the next 100,000 projects were as follows:
I next do a log-log plot of earnings vs. rank for the next 100,000 projects ranked by earnings from monthly charges:
rop_tb %>%
ggplot(mapping=aes(x = Earnings_Rank, y = Earnings)) +
geom_point() +
coord_trans(x = "log10", y = "log10") +
scale_x_continuous(
breaks = c(25000, 75000, 50000, 100000),
labels = scales::label_comma()
) +
scale_y_continuous(
breaks = c(10, 25, 50, 100, 250),
labels = scales::label_dollar()
) +
xlab("Earnings Rank") +
ylab("Earnings") +
labs(
title = "“Rest of Patreonia” Monthly Earnings vs. Earnings Rank",
subtitle = "Patreon Projects Ranked 11,101-111,100 in Earnings from Monthly Charges",
caption = "Data source: Graphtreon Basic CSV Export, December 2022"
) +
theme_gray() +
theme(axis.text.x = element_text(angle = 60, hjust = 1)) +
theme(axis.title.x = element_text(margin = margin(t = 5))) +
theme(axis.title.y = element_text(margin = margin(r = 10))) +
theme(plot.caption = element_text(margin = margin(t = 15), hjust = 0))
This shows a drop-off for lower-ranked projects, as did the plots for the first three subsets, but there is deviation from the straight-line behavior for the lowest-ranked projects in this subset, indicating an even more severe drop-off in that region.
The estimated annual median earnings in the rest of “Patreonia” is about $340. For a typical US-based person this would be a “hobby business.” Even people outside the US would find this to be at best a supplement to their regular income; it’s comparable to incomes in the poorest countries on Earth.
One obvious way for a Patreon project to have higher earnings is to have more patrons. But that’s not the only way; in particular, a project could increase the amount of money they get from each patron, for example, because relatively more patrons are in higher membership tiers. Which factor is more important for the Patreon projects in this dataset?
The technology pundit Kevin Kelly (2008) has claimed that the secret to success as a creator in the Internet age is to have “1,000 true fans”:
To be a successful creator you don’t need … millions of dollars or millions of customers, millions of clients or millions of fans. To make a living as a craftsperson, photographer, musician, designer, author, animator, app maker, entrepreneur, or inventor you need only thousands of true fans.
A true fan is defined as a fan that will buy anything you produce. …
If you keep the full $100 of each true fan, then you need only 1,000 of them to earn $100,000 per year. That’s a living for most folks. …
Kelly’s blog post spawned lots of follow-on blog posts, podcasts, YouTube videos, and even books. But with all the material promoting the idea, it’s not clear if anyone ever thought to systematically test Kelly’s claim that “1,000 true fans is an alternative path to success other than stardom. … It’s a much saner destiny to hope for. And you are much more likely to actually arrive there.”
So let’s test it in the context of Patreon. As Kelly notes, it’s not enough just to have 1,000 fans; they have to provide you $100 in annual profit per fan. How many projects meet this criterion?
true_fan_projects <- earnings_tb %>%
filter(Patrons >= 1000 & EPP >= (100. / 12)) %>%
summarize(n = n()) %>%
as.integer()
There are only 31 Patreon projects strictly meeting this criterion (about 0.02% of all projects with nonzero monthly earnings) in terms of earnings. There are even fewer projects than that if we consider that Kelly was discussing $100 in profit per fan, since Patreon earnings do not account for all possible expenses incurred by creators.
Let’s change the criterion a bit. As Kelly notes, “If you are able to only earn $50 per year per true fan, then you need 2,000. (Likewise if you can sell $200 per year, you need only 500 true fans.)” So let’s count the number of projects that are earning the equivalent of $100,000 or more a year:
true_fan_projects_2 <- earnings_tb %>%
filter(Earnings >= (100000. / 12)) %>%
summarize(n = n()) %>%
as.integer()
This improves the picture somewhat. There are 238 projects meeting this criterion, about 0.18% of all projects with nonzero monthly earnings. However, as noted above, this doesn’t account for any expenses incurred in the course of running a Patreon project.
The graph below shows how hard it is to achieve the combination of number of patrons and earnings per patron (represented by the green line) that equates to $100,000 a year in earnings. Only a few projects have earnings per patron that are above the line and thus have achieved “1,000 True Fans” status; almost all projects are below it.
earnings_tb %>%
mutate(True_Fan_EPP = (100000 / 12) / Patrons) %>%
ggplot(mapping=aes(x = Patrons, y = EPP)) +
geom_point(alpha = 0.1) +
geom_line(
mapping=aes(x = Patrons, y = True_Fan_EPP),
color="#009E73"
) +
coord_trans(x = "log10", y = "log10") +
scale_x_continuous(
breaks = c(1, 2, 5, 10, 25, 100, 250, 1000, 2500, 10000, 25000, 100000, 200000),
labels = scales::label_comma()
) +
scale_y_continuous(
breaks = c(1, 10, 100, 1000),
labels = scales::label_dollar()
) +
xlab("Number of Patrons") +
ylab("Earnings Per Patron") +
labs(
title = "Monthly EPP vs. EPP Needed for $100,000 Annual Earnings",
subtitle = "Patreon Projects Reporting Nonzero Earnings from Monthly Charges",
caption = "Data source: Graphtreon Basic CSV Export, December 2022"
) +
theme_gray() +
theme(axis.text.x = element_text(angle = 60, hjust = 1)) +
theme(axis.title.x = element_text(margin = margin(t = 5))) +
theme(axis.title.y = element_text(margin = margin(r = 10))) +
theme(plot.caption = element_text(margin = margin(t = 15), hjust = 0))
It is very common for people to talk about services like Patreon, Substack, Spotify, etc., as being characterized by a power-law (or Pareto) distribution. This is typically shorthand for the fact that on such services only a few creators realize significant earnings, with earnings rapidly dropping off once you get beyond those in the top rankings.
However, just because the distribution of earnings exhibits rapid drop-off (as in the first graph shown above) or looks like a straight line on a log-log plot (as in other graphs above), it doesn’t necessarily follow that the distribution is truly a power-law distribution (Clauset, Shalizi, and Newman 2009). In this section I do some tests to assess whether Patreon earnings for the month in question follow a power-law distribution or not.
I now digress a bit to discuss the mathematics behind the power-law and log-normal distributions. (This also gives me a chance to play around with the \(\LaTeX\) support in R Markdown.)
A power-law distribution is a particular type of probability distribution. Assume that earnings from a service like Patreon are in whole dollars only; i.e., the earnings can take only discrete values (e.g., $1, $19, $117, $1,729, etc.). If such earnings followed a discrete power-law distribution then the probability \(p(x)\) of earning exactly \(x\) dollars would drop off based on the value of \(x\) raised to the negative power of a scaling factor \(\alpha\):
\[p(x) = Pr(X = x) = Cx^{-\alpha}\]
Here \(X\) is the observed earnings and \(C\) is a normalization constant to make the probabilities across all possible values of earnings sum to 1. As \(x\) increases \(p(x)\) decreases, and for very large values of \(x\) approaches zero.
The above is a simplification, in two aspects. First, there must be some minimum value \(x_\textrm{min} > 0\) below which the power law behavior does not hold, since \(x^{-\alpha}\) is not defined for \(x = 0\).
Second, in practice Patreon earnings can have a fractional part; for example, a project might earn $43.57 per month. Thus they are arguably better analyzed as potentially having a continuous power-law distribution. Such a distribution has a different definition of \(p(x)\)—but still one that depends on the scaling factor \(\alpha\) used as a negative power of \(x\).
Since \(x\) could in theory be any real number, it doesn’t make sense to speak of the probability of the observed earnings \(X\) being exactly equal to \(x\). Instead we look at the probability that \(X\) could be found in some small interval around \(x\), expressed in terms of a probability density function \(p(x)\):
\[p(x) dx = Pr(x - dx \le X \le x + dx) = Cx^{-\alpha}dx\]
The probability density function \(p(x)\) can then be expressed as
\[p(x) = \frac{\alpha -1}{x_\textrm{min}} \left( \frac{x}{x_\textrm{min}} \right)^{-\alpha}\]
If \(x_\textrm{min} = 1\) (see below) then this reduces to \(p(x) = (\alpha -1)x^{-\alpha}\).
The log-normal distribution (as its name might imply) is related to the normal distribution (sometimes referred to as the Gaussian distribution). More specifically, per Wikipedia:
a log-normal (or lognormal) distribution is a continuous probability distribution of a random variable whose logarithm is normally distributed. Thus, if the random variable \(X\) is log-normally distributed, then \(Y = ln(X)\) has a normal distribution.
The probability density function for a log-normal distribution is
\[p(x) = \frac{1}{x\sigma\sqrt{2\pi}}\textrm{exp}\left( -\frac{(\ln(x)-\mu)^2}{2\sigma^2}\right)\]
where \(\textrm{exp}(x) = e^x\) and \(\mu\) and \(\sigma\) are the parameters determining the exact form of the distribution. More specifically, if \(X\) is a random variable described by a log-normal distribution, then \(\ln(X)\) is a normally-distributed random variable, and \(\mu\) and \(\sigma\) are the mean and standard deviation of that normal distribution. (See below for further discussion of this.)
I attempt to fit a power-law distribution to the entire sample dataset of monthly earnings for all 128,933 Patreon projects that reported nonzero earnings from monthly charges. Since earnings can be fractional, I attempt to fit a continuous power-law distribution. I also attempt to fit a continuous exponential distribution and a continuous log-normal distribution, to see if either of those provide a better fit than a power-law distribution.
The first step is to create models for all three distributions, using as input the entire sample dataset of projects with nonzero earnings from monthly charges for the month in question.
m_pl <- conpl(earnings_tb$Earnings)
m_exp <- conexp(earnings_tb$Earnings)
m_lnorm <- conlnorm(earnings_tb$Earnings)
I now need to estimate parameters for each of the models. In
particular, I need an estimate for \(x_\textrm{min}\), the cut-off point below
which the models do not apply. There are two ways to do this. The first
and better way is to use the estimate_xmin() function with
each model (power-law, exponential, and log-normal) to try to find the
best value of \(x_\textrm{min}\), one
that will provide the best model fit.
Unfortunately, this is very time-consuming to do for a dataset with over 100,000 entries with values up to almost $200,000. Doing this for just one distribution took several hours on a fairly-new laptop.
The alternate approach is simply to specify an arbitrary value of \(x_\textrm{min}\). For example, in the case of Patreon using the value \(x_\textrm{min} = 1\) makes intuitive sense, since $1 per month is the lowest membership tier for a lot of projects, and only 6,536 projects (5% of those reporting earnings from monthly charges) earn less than $1 a month. This will likely not give us the best fit, but the model will cover more of the overall dataset.
Setting \(x_\textrm{min}\) to an arbitrary value also simplifies comparing the different models, since they must have the same \(x_\textrm{min}\) value in order to do the comparison in a more rigorous way than simply inspecting curves on plots.
Therefore I next estimate the parameters for the three models using
the arbitrary value \(x_\textrm{min} =
1\). I also need to specify a larger value for \(x_\textrm{max}\) since the
estimate_xmin() function normally doesn’t look at data
values higher than 10,000.
The parameters returned by estimate_xmin() are then
plugged back into the models.
m_pl_est <- estimate_xmin(m_pl, xmins = 1, xmax = 200000)
m_pl$setXmin(m_pl_est)
m_exp_est <- estimate_xmin(m_exp, xmins = 1, xmax = 200000)
m_exp$setXmin(m_exp_est)
m_lnorm_est <- estimate_xmin(m_lnorm, xmins = 1, xmax = 200000)
m_lnorm$setXmin(m_lnorm_est)
I can now plot the so-called complementary cumulative distribution function (“ccdf”) of the data, along with the curves of best fit from the three models. The ccdf for a value \(x\) gives the probability that an observed value \(X\) will be greater than \(x\): \(Pr(X \gt x)\). On the other hand, the cumulative distribution function gives the probability that \(X\) is less than or equal to \(x\), or \(Pr(X \le x)\). We thus have \(Pr(X \gt x) = 1 - Pr(X \le x)\).
The poweRlaw R package provides plotting methods for its models, so
in the interest of simplicity I use the plot() and
lines() functions to create the plot rather than
ggplot(). The plot() function plots the ccdf
of the underlying earnings data. The lines() function then
adds the fitted curves for the power-law distribution (green), the
exponential distribution (blue), and the log-normal distribution
(orange).
plot(m_pl, xlab = "Earnings ($)", ylab = "CCDF of Earnings", main = "Patreon Monthly Earnings and Fitted Distributions", sub = "Patreon Projects Reporting Nonzero Earnings from Monthly Charges", col = "#000000")
lines(m_pl, col = "#009E73", lwd = 2)
lines(m_exp, col = "#56B4E9", lwd = 2)
lines(m_lnorm, col = "#E69F00", lwd = 2)
legend("bottomleft", c("power-law","exponential", "log-normal"), fill = c("#009E73","#56B4E9", "#E69F00"))
Based on the above plot, it appears that the log-normal distribution is a much better fit to the data than either the power-law or exponential distributions. However, the fit begins to break down for earnings above $1,000 per month; above that point the probability of earning more than a given amount appears to be somewhat greater than that given by the log-normal distribution. (Above that point we are also dealing with very small probabilities and a lower percentage of projects reporting their earnings.)
I can confirm that the log-normal distribution is a better fit than
the power-law and exponential distributions by using the
compare_distributions() function.
lnorm_vs_pl_one_sided <- compare_distributions(m_lnorm, m_pl)$p_one_sided
lnorm_vs_exp_one_sided <- compare_distributions(m_lnorm, m_exp)$p_one_sided
The one-sided p-value tests whether the first distribution is a better fit than the second. In this case the one-sided p-value is 0 when comparing the log-normal distribution to the power-law distribution and 0 when comparing the log-normal distribution to the exponential distribution. These p-values indicate that the log-normal distribution is clearly a better fit than either of the other distributions.
I now turn my attention to those projects reporting more than $1,000 in monthly earnings; recall that these projects constitute about 3% of all projects.
This time I create models only for the power-law and log-normal distributions, since the fit for the exponential distribution was so poor. I use as input a reduced dataset of projects with earnings of $1,000 or more from monthly charges for the month in question.
top_earnings_tb <- earnings_tb %>%
filter(Earnings >= 1000)
m_pl_top <- conpl(top_earnings_tb$Earnings)
m_lnorm_top <- conlnorm(top_earnings_tb$Earnings)
I next estimate the parameters for the two models using the arbitrary value \(x_\textrm{min} = 1000\).
m_pl_top_est <- estimate_xmin(m_pl_top, xmins = 1000, xmax = 200000)
m_pl_top$setXmin(m_pl_top_est)
m_lnorm_top_est <- estimate_xmin(m_lnorm_top, xmins = 1000, xmax = 200000)
m_lnorm_top$setXmin(m_lnorm_top_est)
I can now plot the complementary cumulative distribution function of the data, along with the curves of best fit from the two models.
plot(m_pl_top, xlab = "Earnings ($)", ylab = "CCDF of Earnings", main = "Patreon Monthly Earnings and Fitted Distributions", sub = "Patreon Projects Reporting Monthly Earnings of $1,000 or More", col = "#000000")
lines(m_pl_top, col = "#009E73", lwd = 2)
lines(m_lnorm_top, col = "#E69F00", lwd = 2)
legend("bottomleft", c("power-law","log-normal"), fill = c("#009E73","#E69F00"))
In this case there is less of a visual difference between a power-law
distribution and a log-normal distribution. Again I confirm that the
log-normal distribution is a better fit than the power-law distribution
by using the compare_distributions() function.
lnorm_vs_pl_top_one_sided <- compare_distributions(m_lnorm_top, m_pl_top)$p_one_sided
The one-sided p-value is 0.008302 when comparing the log-normal distribution to the power-law distribution, indicating that the log-normal distribution is a better fit.
However, note that the log-normal distribution for projects with $1,000 or more in monthly earnings does not have the same parameters as the log-normal distribution for all projects with monthly earnings of $1 or more, as discussed below.
As discussed above, a log-normal distribution is characterized by two parameters, \(\mu\) and \(\sigma\). In our case \(\mu\) and \(\sigma\) are given by the fitted model:
lnorm_mu <- m_lnorm$pars[1]
lnorm_sigma <- m_lnorm$pars[2]
The value of \(\mu\) is approximately 3.33 and the value of \(\sigma\) is approximately 1.84.
(As noted above, the log-normal distribution I fit to projects with $1,000 or more in monthly earnings has different \(\mu\) and \(\sigma\): -0.42 and 2.62 respectively.)
I can use the values as input to the plnorm() function
to estimate the probability of a Patreon project earning more than a
certain amount per month in monthly charges:
prob_over_10 = plnorm(10, meanlog = lnorm_mu, sdlog = lnorm_sigma, lower.tail = FALSE)
prob_over_100 = plnorm(100, meanlog = lnorm_mu, sdlog = lnorm_sigma, lower.tail = FALSE)
prob_over_1000 = plnorm(1000, meanlog = lnorm_mu, sdlog = lnorm_sigma, lower.tail = FALSE)
prob_over_10000 = plnorm(10000, meanlog = lnorm_mu, sdlog = lnorm_sigma, lower.tail = FALSE)
The estimated probabilities are as follows, with the observed probabilities in parentheses:
As is apparent from the values above, a log-normal distribution does a reasonably good job of fitting the observed Patreon data for the month in question.
Recall from our discussion above that if a random variable \(X\) is log-normally distributed, then \(Y = ln(X)\) has a normal distribution. I have reason to believe that Patreon earnings from monthly charges for the month in question are log-normally distributed. I should therefore expect that the logarithm of those earnings is normally distributed.
I explore that expectation by taking the logarithms of all earnings values and then plotting a histogram showing the number of those resulting values that fall into particular ranges of values, each with width 0.2.
earnings_tb %>%
mutate(logEarnings = log(Earnings)) %>%
ggplot(mapping=aes(x = logEarnings)) +
geom_histogram(binwidth = 0.2) +
geom_vline(xintercept = lnorm_mu, color = "#E69F00") +
geom_vline(xintercept = lnorm_mu - lnorm_sigma, color = "#56B4E9", linetype = "dashed") +
geom_vline(xintercept = lnorm_mu + lnorm_sigma, color = "#56B4E9", linetype = "dashed") +
xlab("log(Monthly Earnings)") +
ylab("Number of Projects") +
labs(
title = "Log(Monthly Earnings) Distribution for Patreon Projects",
subtitle = "Patreon Projects Reporting Nonzero Earnings from Monthly Charges",
caption = "Data source: Graphtreon Basic CSV Export, December 2022"
) +
theme_gray() +
theme(axis.title.x = element_text(margin = margin(t = 5))) +
theme(axis.title.y = element_text(margin = margin(r = 10))) +
theme(plot.caption = element_text(margin = margin(t = 15), hjust = 0))
The orange solid line in the plot above marks the value of \(\mu\), the first parameter estimated from fitting a log-normal distribution to the Patreon earnings data, while the blue dashed lines mark the values \(\mu - \sigma\) and \(\mu + \sigma\), where \(\sigma\) is the second parameter estimated from fitting a log-normal distribution to the data.
(The spikes in the counts on the left side of the distribution are presumably from particular levels of earnings where the data deviates from a log-normal distribution.)
Again the reason for using the symbols \(\mu\) and \(\sigma\) becomes apparent: they represent the mean and standard deviation respectively of the normal distribution corresponding to the log-normal distribution.
I can confirm that by calculating the sample mean and sample standard deviation of the logarithms of earnings:
mean_log <- mean(log(earnings_tb$Earnings))
sd_log <- sd(log(earnings_tb$Earnings))
The value of the sample mean of the logged earnings is 3.29, compared to the value 3.33 for \(\mu\), while the value of the sample standard deviation is 1.84, compared to the value 1.84 for \(\sigma\).
Given that Patreon project earnings appear to be log-normally distributed, I would expect that the number of patrons per project would be log-normally distributed as well, given the relatively close correlation between number of patrons and monthly earnings. Is this in fact the case?
I attempt to fit power-law, log-normal, and exponential distributions to the entire dataset of all 217,861 Patreon projects for the month in question, including those that did not report their earnings. Since the number of patrons is always an integer value, I attempt to fit discrete distributions.
I create models for the three distributions, using as input the entire Graphtreon dataset of projects for the month in question.
m_pl_p <- displ(patreon_tb$Patrons)
m_exp_p <- disexp(patreon_tb$Patrons)
m_lnorm_p <- dislnorm(patreon_tb$Patrons)
I estimate the parameters for the three models using \(x_\textrm{min} = 1\) (again specifying a
larger value for \(x_\textrm{max}\))
and plug the parameters returned by estimate_xmin() back
into the models.
m_pl_p_est <- estimate_xmin(m_pl_p, xmins = 1, xmax = 50000)
m_pl_p$setXmin(m_pl_p_est)
m_exp_p_est <- estimate_xmin(m_exp_p, xmins = 1, xmax = 50000)
m_exp_p$setXmin(m_exp_p_est)
m_lnorm_p_est <- estimate_xmin(m_lnorm_p, xmins = 1, xmax = 50000)
m_lnorm_p$setXmin(m_lnorm_p_est)
I then plot the ccdf of the number of patrons per project, along with the curves of best fit from the three models.
plot(m_pl_p, xlab = "Number of Patrons", ylab = "CCDF of Number of Patrons", main = "Number of Patrons and Fitted Distributions", sub = "Patreon Projects Reporting Nonzero Number of Patrons", col = "#000000")
lines(m_pl_p, col = "#009E73", lwd = 2)
lines(m_exp_p, col = "#56B4E9", lwd = 2)
lines(m_lnorm_p, col = "#E69F00", lwd = 2)
legend("bottomleft", c("power-law","exponential", "log-normal"), fill = c("#009E73","#56B4E9", "#E69F00"))
As with monthly earnings, it appears that the log-normal distribution is a much better fit to the data than either the power-law or exponential distributions. However, the fit begins to break down for projects with more than a few thousand patrons.
I confirm that the log-normal distribution is a better fit than the
power-law and exponential distributions by using the
compare_distributions() function.
lnorm_vs_pl_one_sided_p <- compare_distributions(m_lnorm_p, m_pl_p)$p_one_sided
lnorm_vs_exp_one_sided_p <- compare_distributions(m_lnorm_p, m_exp_p)$p_one_sided
The one-sided p-value is 0 when comparing the log-normal distribution to the power-law distribution and 0 when comparing the log-normal distribution to the exponential distribution. As with monthly earnings, the log-normal distribution is clearly a better fit than either of the other distributions.
I now turn my attention to those projects reporting more than 1,000 patrons. This time I create models only for the power-law and log-normal distributions, since the fit for the exponential distribution was so poor. I use as input a reduced dataset of projects reporting 1,000 or more patrons for the month in question.
top_patrons_tb <- patreon_tb %>%
filter(Patrons >= 1000)
m_pl_p_top <- displ(top_patrons_tb$Patrons)
m_lnorm_p_top <- dislnorm(top_patrons_tb$Patrons)
I next estimate the parameters for the two models using the arbitrary value \(x_\textrm{min} = 1000\).
m_pl_p_top_est <- estimate_xmin(m_pl_top, xmins = 1000, xmax = 50000)
m_pl_p_top$setXmin(m_pl_p_top_est)
m_lnorm_p_top_est <- estimate_xmin(m_lnorm_p_top, xmins = 1000, xmax = 50000)
m_lnorm_p_top$setXmin(m_lnorm_p_top_est)
I can now plot the complementary cumulative distribution function of the data, along with the curves of best fit from the two models.
plot(m_pl_p_top, xlab = "Number of Patrons", ylab = "CCDF of Number of Patrons", main = "Patreon Number of Patrons and Fitted Distributions", sub = "Patreon Projects Reporting 1,000 or More Patrons", col = "#000000")
lines(m_pl_p_top, col = "#009E73", lwd = 2)
lines(m_lnorm_p_top, col = "#E69F00", lwd = 2)
legend("bottomleft", c("power-law","log-normal"), fill = c("#009E73","#E69F00"))
There is a clear visual difference between a power-law distribution
and a log-normal distribution. Again I confirm that the log-normal
distribution is a better fit than the power-law distribution by using
the compare_distributions() function.
lnorm_vs_pl_p_top_one_sided <- compare_distributions(m_lnorm_p_top, m_pl_p_top)$p_one_sided
The one-sided p-value is 7.27e-05 when comparing the log-normal distribution to the power-law distribution, indicating that the log-normal distribution is a better fit.
Based on the above analysis, I conclude the following:
First, the distribution of Patreon earnings from monthly charges is highly unequal, and the likelihood of any individual project making a significant amount of money is very low.
That in turn implies that the much-promoted idea of acquiring “1,000 true fans” is an almost-unobtainable fantasy for the vast majority of creators on Patreon. The number of projects meeting the original criterion (1,000 fans each producing $100 in annual profits) is a miniscule fraction of a percent. Even if we loosen the criterion to $100,000 in annual earnings (not profits) derived from any number of fans, the number of projects meeting the criterion is still a fraction of a percent.
Next, for the month in question the “Patreonville” and “rest of Patreonia” subsets discussed above (containing 10,000 and 100,000 projects respectively) together accounted for over three-fifths of the earnings from monthly charges (62%), while the “rest of Patreonia” subset alone accounted for over a quarter (26%).
Assuming that Patreon takes in as fees an approximately equal percentage of the revenue that each project makes, that means that, at least for the subset of projects publicly reporting their monthly earnings, for the month in question Patreon derived the majority of its own revenue from projects with relatively low earnings. That in turn implies that serving such low-earnings projects could be profitable for Patreon, as long as the marginal cost to serve a new project is close to zero.
Finally, for the month in question the distribution of Patreon earnings from monthly charges does not follow a power law, but rather can be best modeled using a log-normal distribution.
This gives rise to a speculation about what causes the distribution of Patreon earnings to have a log-normaldistribution, prompted by my reading of Sornette, Wheatley, and Cauwels (2019). The general idea is that success in a particular endeavor (as measured by such things as productivity, monetary rewards, popularity, etc.) is not the result of a single factor but of several relatively independent factors that multiply together.
Thus, for example, success as an artist on Patreon might depend on raw artistic ability, productivity (how many drawings they can produce in a given time), the potential size of their audience (for example, based on the type of art they create), their skill at self-promotion, and so on.
Because these factors multiply together, relatively small variations between people can be magnified to produce relatively large difference in success. For example, if there are four factors equally influencing overall success (as in the previous paragraph) and one person is 20% better at each of them than another person (i.e., their performance on each factor is 1.2 times more than that of the other person), then overall they will be about twice as successful (\(1.2 \cdot 1.2 \cdot 1.2 \cdot 1.2 \approx 2.1\)).
The more factors there are determining overall success, and the more variation in relevant factors influencing success among the people pursuing it, the greater the differences in overall success: a few people will be very successful, but most people will have little or no success.
How does this relate to the log-normal distribution? Sornette, Wheatley, and Cauwels put it as follows (2019, 5n9):
The logarithm of the product [of different factors influencing success] is the sum of the logarithms of the different factors. If the factors are independent, then—to a good approximation, and if the central limit theorem is applicable—their sum will be normally distributed, and hence so will be the logarithm of the productivity. Convergence to the Log-normal can be slow …. But, if the individual factors are themselves Log-normal, then the product is immediately Log-normal.
The Central Limit Theorem states that, under a fairly broad range of conditions, the sum of many relatively independent variables will be normally distributed. But if the logarithm of a measure of success is normally distributed, then that means that the measure of success itself follows a log-normal distribution—which is exactly what we see in the case of Patreon earnings.
Sornette, Wheatley, and Cauwels also contend that overall success is strongly influenced by luck, and that the effects of luck can lead to true power-law behavior (or nearly so) at both the bottom and top ends of the success spectrum (2019, 4–9).
The analysis above is inconclusive on this point; the distribution of earnings at the top end still appears to be log-normal. And in any case, the available dataset may not be able to provide a definitive answer, since the percentage of projects not reporting earnings is significantly higher at the top end.
Chayka, Kyle. 2021. “What the ‘Creator Economy’ Promises—and What It Actually Does.” New Yorker. July 17, 2021. https://www.newyorker.com/culture/infinite-scroll/what-the-creator-economy-promises-and-what-it-actually-does.
Chen, J. J., and Hernando Cortina. 2020. “gglorenz: Plotting Lorenz Curve with the Blessing of ‘ggplot2’.” https://CRAN.R-project.org/package=gglorenz.
Clauset, Aaron, Cosma Rohilla Shalizi, and M. E. J. Newman. 2009. “Power-Law Distributions in Empirical Data.” [arXiv:0706.1062v2] [physics.data-an]. https://doi.org/10.48550/arXiv.0706.1062.
Farris, Frank A. 2010. “The Gini Index and Measures of Equitability.” American Mathematical Monthly, December 2010: 851-864. https://scholarcommons.scu.edu/math_compsci/14/.
Graphtreon. n.d. “Data Services.” Accessed April 3, 2023. https://graphtreon.com/data-services.
Gillespie, Colin S. 2015. “Fitting Heavy Tailed Distributions: The poweRlaw Package.” Journal of Statistical Software, 64(2), 1–16. http://www.jstatsoft.org/v64/i02/.
Hadley, Wickham, and RStudio. 2023. “tidyverse: Easily Install and Load the ‘Tidyverse’.” https://CRAN.R-project.org/package=tidyverse.
Hecker, Frank. 2008. “Income Inequality in Howard County, Part 1.” FrankHecker.com (blog). November 16, 2008. https://frankhecker.com/2008/11/16/income-inequality-in-howard-county-part-1/
———. 2023a. “Distribution of Earnings Among Patreon Projects Charging by the Month.” RPubs. January 21, 2023. https://rpubs.com/frankhecker/993611.
———. 2023b. “Distribution of the Number of Patrons Per Patreon Project.” RPubs. January 22, 2023. https://rpubs.com/frankhecker/994383.
———. 2023c. “Earnings Per Patron Among Patreon Projects Charging by the Month.” RPubs. February 5, 2023. https://rpubs.com/frankhecker/999354.
Kelly, Kevin. 2008. “1,000 True Fans.” The Technium (blog). March 4, 2008. https://kk.org/thetechnium/1000-true-fans/.
Lorenz, M. O. 1905. “Methods of Measuring the Concentration of Wealth.” Publications of the American Statistical Association, Vol. 9, No. 70 (Jun., 1905), 209-219. https://doi.org/10.2307/2276207.
Patreon. n.d. “The Patreon Story.” Accessed April 3, 2023. https://www.patreon.com/about.
Patreon. 2022. “The First-Ever Patreon Creator Census.” Patreon Blog. May 4, 2022. https://blog.patreon.com/the-first-ever-patreon-creator-census.
R Foundation. n.d. “What is R?” Accessed April 3, 2023. https://www.r-project.org/about.html.
Signorell, Andri, et al. 2023. “DescTools: Tools for Descriptive Statistics.” Accessed April 3, 2023. https://CRAN.R-project.org/package=DescTools.
Sornette, Didier, Spencer Wheatley, Peter Cauwels. 2019. “The Fair Reward Problem: The Illusion of Success and How to Solve It.” [arXiv:1902.04940v2] [econ.GN]. https://doi.org/10.48550/arXiv.1902.04940.
US Census Bureau. 2021a. American Community Survey 2021, Table B19083, “Gini Index of Income Inequality.” Accessed April 7, 2023. https://data.census.gov/table?q=B19083:+GINI+INDEX+OF+INCOME+INEQUALITY&g=010XX00US$0500000.
US Census Bureau. 2021b. American Community Survey 2021, Table B19013, “Median Household Income in the Past 12 Months (in 2021 Inflation-Adjusted Dollars).” Accessed April 7, 2023. https://data.census.gov/table?q=B19013:+MEDIAN+HOUSEHOLD+INCOME+IN+THE+PAST+12+MONTHS+(IN+2021+INFLATION-ADJUSTED+DOLLARS)&g=010XX00US$0500000.
US Department of Health and Human Services. 2023. “HHS Poverty Guidelines for 2023.” Accessed April 7, 2023. https://aspe.hhs.gov/topics/poverty-economic-mobility/poverty-guidelines.
World Bank. 2023. World Bank Open Data, “Gini Index.” Accessed April 7, 2023. https://data.worldbank.org/indicator/SI.POV.GINI?most_recent_value_desc=true.
This analysis is subject to the following caveats, among others:
Here are some possible analyses that might be worth doing in future:
I used the following R environment in doing the analysis above:
sessionInfo()
## R version 4.2.3 (2023-03-15)
## Platform: x86_64-apple-darwin17.0 (64-bit)
## Running under: macOS Big Sur ... 10.16
##
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/4.2/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/4.2/Resources/lib/libRlapack.dylib
##
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
##
## attached base packages:
## [1] tools stats graphics grDevices utils datasets methods
## [8] base
##
## other attached packages:
## [1] gglorenz_0.0.2 poweRlaw_0.70.6 DescTools_0.99.48 lubridate_1.9.2
## [5] forcats_1.0.0 stringr_1.5.0 dplyr_1.1.1 purrr_1.0.1
## [9] readr_2.1.4 tidyr_1.3.0 tibble_3.2.1 ggplot2_3.4.2
## [13] tidyverse_2.0.0
##
## loaded via a namespace (and not attached):
## [1] Rcpp_1.0.10 mvtnorm_1.1-3 lattice_0.21-8 class_7.3-21
## [5] digest_0.6.31 utf8_1.2.3 R6_2.5.1 cellranger_1.1.0
## [9] pracma_2.4.2 evaluate_0.20 rootSolve_1.8.2.3 e1071_1.7-13
## [13] highr_0.10 httr_1.4.5 pillar_1.9.0 rlang_1.1.0
## [17] Exact_3.2 readxl_1.4.2 rstudioapi_0.14 data.table_1.14.8
## [21] jquerylib_0.1.4 Matrix_1.5-4 ineq_0.2-13 rmarkdown_2.21
## [25] labeling_0.4.2 bit_4.0.5 munsell_0.5.0 proxy_0.4-27
## [29] compiler_4.2.3 xfun_0.38 pkgconfig_2.0.3 htmltools_0.5.5
## [33] tidyselect_1.2.0 lmom_2.9 expm_0.999-7 fansi_1.0.4
## [37] crayon_1.5.2 tzdb_0.3.0 withr_2.5.0 MASS_7.3-58.3
## [41] grid_4.2.3 jsonlite_1.8.4 gtable_0.3.3 lifecycle_1.0.3
## [45] magrittr_2.0.3 scales_1.2.1 gld_2.6.6 vroom_1.6.1
## [49] cli_3.6.1 stringi_1.7.12 cachem_1.0.7 farver_2.1.1
## [53] bslib_0.4.2 generics_0.1.3 vctrs_0.6.1 boot_1.3-28.1
## [57] bit64_4.0.5 glue_1.6.2 hms_1.1.3 parallel_4.2.3
## [61] fastmap_1.1.1 yaml_2.3.7 timechange_0.2.0 colorspace_2.1-0
## [65] knitr_1.42 sass_0.4.5
The source code for this analysis can be found in the public code
repository https://gitlab.com/frankhecker/misc-analysis in the
patreon subdirectory.
This document and its source code are available for unrestricted use, distribution and modification under the terms of the Creative Commons CC0 1.0 Universal (CC0 1.0) Public Domain Dedication. Stated more simply, you’re free to do whatever you’d like with it.