The physical restrictions of the 20th century creative industries have been dramatically overturned by the digital transformation – substantially lowering the entry barriers and supply restrictions previously dictated by shelf-space constraints (Waldfogel 2017). The theories describing this new landscape of near-infinite supply have been shaped by the long tail phenomenon – popularized by Anderson (2004, 2006), which suggests that the aggregate value of low-selling niche titles (the emergence of which was made possible by the democratization of the means of production due to digitization) can rival and even best that of mainstream blockbusters. While the supply-side of this shift is well-documented and widely studied, the underlying individual demand-side dynamics remain largely opaque. This study aims to bridge this gap through an examination of how individual listening habits (individual music tailoring) contribute to the broader aggregate trends.
Many studies succeeding the exploration of the long tail phenomenon by Anderson (2006) have disputed the significance of this hypothesis (among these: Elberse 2008, Liebowitz and Zetner 2025, among others), finding a coexisting Superstar effect, where a fraction fo titels captures an even larger share of total plays/sales, potentially depressing the sales of niche titles. Through an examination of a 15 year period, covering the rise of streaming services, the fall of piracy and the maturation of social music platforms, RQ1 tests whether the market has truly diversified or if it has succumbed to a blockbuster strategy.
Traditional market-level analyses often overlook whether the emergence of the long tails is a result of a collective shift of consumption habits towards more diversified individual patterns or from homogeneous niche and focused consumer clusters. According to a hypothesis presented by Anderson (2006) – the democratization of access allows consumers to tailor their consumption to content that best fits their tastes.
The analysis presented in this study, based on the LFM-2b dataset (Schedl et al. 2022), illustrates an evolution of this tailoring process (see section Individual Tails), ultimately revealing a gradual shift towards more exploratory behavior, where users increasingly engage with the niche segments of their own libraries.
Beginning by loading the data – I will use the file, which stores users’ aggregated playcounts by year (due to the large size loaded with the use of package “data.table”).
The analysis in this project will be based on LFM-2b – a dataset made available by Schedl et al. (2022) – currently accessible solely through the Wayback Machine, as the source platform revoked the license for publishing large datasets derived from their site. LFM-2b contains listening records of over 120,000 users of a music platform - last.fm, amounting to more than 2 billion listening events (interactions between a distinct user and a specific song enriched by its metadata) of more than 50 million distinct tracks by over 5 million artists. Additionally, LFM-2b contains demographic information concerning users (age, gender, country) and additional track-level metadata, such as tags, microgenres and vector embeddings of lyrics.
This file contains 744,785,644 observations, each a tuple of <user_id, track_id, timestamp, playcount_yearly>.
year <- fread("/Users/gosia/Desktop/tails/lastfm_data/yearly.tsv", sep = "\t")
To gain a deeper understanding of the data, the summary statistics are displayed in the table below.
These descriptive statistics include:
These metrics already shed a light on the nature of consumption and temporal changes in music listening habits.
# using package data.table for efficiency
# user statistics
user_yearly_stats <- year[, .(
user_unique_tracks = uniqueN(track_id),
user_total_playcount = sum(as.numeric(playcount_yearly))),
by = .(user_id, year = year(timestamp))]
tracks <- year[,.(
n_tracks = uniqueN(track_id)),
by =.(year = year(timestamp))]
# summary statistics
summary_stats <- user_yearly_stats[, .(
n_users = .N,
avg_tracks_per_user = mean(user_unique_tracks),
avg_user_playcount = mean(user_total_playcount),
total_playcount = sum(user_total_playcount)),
by = year][order(year)]
summary_stats <- merge(summary_stats, tracks, by = "year")[order(year)]
print(summary_stats)
## Key: <year>
## year n_users avg_tracks_per_user avg_user_playcount total_playcount
## <int> <int> <num> <num> <num>
## 1: 1970 2 1.0000 1.000 2
## 2: 2005 713 1019.4951 3015.400 2149980
## 3: 2006 2394 1122.1792 3352.736 8026451
## 4: 2007 5609 1183.9863 3492.770 19590949
## 5: 2008 10889 1178.3385 3255.910 35453602
## 6: 2009 21231 1081.7030 2915.909 61907659
## 7: 2010 34152 1092.6612 2949.087 100717236
## 8: 2011 61138 1014.4626 2715.807 166039028
## 9: 2012 114797 1078.4056 2785.362 319751175
## 10: 2013 84836 1221.7860 3223.327 273454128
## 11: 2014 59861 1506.1195 4408.282 263884146
## 12: 2015 43787 1560.1864 4424.597 193739838
## 13: 2016 33123 1633.9497 4393.827 145536732
## 14: 2017 27057 1837.5453 4856.936 131414118
## 15: 2018 22708 2059.4292 5592.733 126999777
## 16: 2019 18316 2645.2807 7378.372 135142263
## 17: 2020 15258 957.9756 1989.631 30357786
## 18: 2023 1 1.0000 1.000 1
## 19: 2033 1 1.0000 1.000 1
## n_tracks
## <int>
## 1: 2
## 2: 390610
## 3: 1045089
## 4: 2086446
## 5: 3316272
## 6: 4917837
## 7: 6894332
## 8: 9690020
## 9: 14987437
## 10: 14390889
## 11: 14276578
## 12: 12210916
## 13: 10636571
## 14: 9935709
## 15: 9497939
## 16: 9425045
## 17: 4082530
## 18: 1
## 19: 1
Another thing quickly drawing the attention are the three singular, outlying observations comprised of listening events taking place in 1970, 2023, 2033 – most likely resulting from a UNIX time error. LFM-2b provides coverage of listening events spanning the period between the 14th of February in 2005 and the 20th of March in 2020 – all of the instances outside of this period are eliminated from further analysis due to the faultiness of records.
# excluding errors from summary_stats
summary_stats <- summary_stats[!summary_stats$year %in% c(1970, 2023, 2033)]
# excluding 1970, 2023, 2033 from the dataset (year)
year[, timestamp := year(timestamp)] # converting timestamp into only year
# identifying row indexes of errors
(error <- which(year$timestamp %in% c(1970, 2023, 2033)))
## [1] 48185922 228129312 297776742 586436779
# too large to remove through copying - overwriting NAs
year[error, `:=`(user_id = NA, track_id = NA, timestamp = NA, playcount_yearly = NA)]
On the illustration below one can see the changes in the yearly user-base. The number of users present in the dataset was gradually growing until its rapid rise in 2012. The 2012 peak, amounting to 114,797 users was succeeded by a gradual attrition of the user-base, the effects of which were the most abrupt in 2013, 2014 and 2015 – each year coinciding with a shrinkage approximating, subsequently, 30,000; 25,000 and 10,000 users. However, this trend in reduction became more stabilized after 2015, the differences oscillating near 5,000 users each year.
ggplot(summary_stats, aes(x = year, y = n_users)) +
geom_col(fill = "steelblue", alpha = 0.7) +
labs(title = "Number of Users Over Time",
x = "Year",
y = "Users") +
theme_minimal()
The plot below visualizes the number of tracks that were being listened to in each year (blue bars and the y-axis on the left), as well as the average number of tracks played by each user (red line and the right y-axis).
# tracks over time + avg number of tracks per user
coeff <- max(summary_stats$n_tracks)/max(summary_stats$avg_tracks_per_user)
ggplot(summary_stats, aes(x = year)) +
geom_col(aes(y = n_tracks), fill = "steelblue", alpha = 0.7) +
geom_line(aes(y = avg_tracks_per_user * coeff), color = "darkred", size = 0.7) +
geom_point(aes(y = avg_tracks_per_user * coeff), color = "darkred", size = 0.8) +
scale_y_continuous(
name = "Total Tracks",
sec.axis = sec_axis(~./coeff, name = "Avg Tracks per User")
) +
labs(title = "Total Tracks vs Avg Tracks per User", x = "Year") +
theme_minimal()
The plot below illustrates the global trends of music consumption in the LFM-2b dataset – the global playcounts displayed a gradual rise up until 2012 – the year that the userbase peaked – and since then the yearly consumption experienced a decline. This decline in aggregated levels of music consumption can be attributed mostly to users’ attrition, seeing as the red line – illustrating average playcounts per user – was on the rise since 2012, excluding a brief period of stagnation in 2014-2015. The year 2020 can be treated as an anomaly due to its limited coverage – the data collection process ended in March 2020.
# total playcount vs avg playcount per user
coeff2 <- max(summary_stats$total_playcount)/max(summary_stats$avg_user_playcount)
ggplot(summary_stats, aes(x = year)) +
geom_col(aes(y = total_playcount), fill = "steelblue", alpha = 0.7) +
geom_line(aes(y = avg_user_playcount * coeff2), color = "darkred", size = 0.7) +
geom_point(aes(y = avg_user_playcount * coeff2), color = "darkred", size = 0.8) +
scale_y_continuous(
name = "Aggregated Playcounts",
sec.axis = sec_axis(~./coeff2, name = "Avg Playcounts per User")
) +
labs(title = "Aggregated Playcounts vs Avg Playcounts per User", x = "Year") +
theme_minimal()
Brynjolfsson et al. (2003) sounding out an early claim supporting the long tail hypothesis in the Amazon book market having no direct data of the number of sold books inferred the shape of the demand curve from the assumption of a power law relationship between the rank of an object and its units of sales (meaning that the relationship between the logarithm of sales’ rank and the logarithm of sales units follows a straight line). The data in our disposition is characterized by better granularity, providing an insight not only into the aggregate units of sales, but also into individual consumption.
A quick look into the output of the summary of the yearly aggregated track-level playcounts reveals a large dispersion in the total playcounts variable – the first quantile is still at the level of one playcount, the median level rises to two playcounts per track, the third quantile is equal to six, while the mean is equal to 15.76 and the maximum level of playcounts aggregated by one track in one year amounts to 179,749. This finding already points towards the supposition of a significant long tail in aggregated music consumption.
# preparing data - yearly playcounts for songs
tracks_yearly_playcounts <- year[, .(
total_playcounts = sum(as.numeric(playcount_yearly))),
by = .(track_id, timestamp)]
summary(tracks_yearly_playcounts)
## track_id timestamp total_playcounts
## Min. : 0 Min. :2005 Min. : 1.00
## 1st Qu.:13438200 1st Qu.:2012 1st Qu.: 1.00
## Median :25675888 Median :2014 Median : 2.00
## Mean :25675706 Mean :2014 Mean : 15.76
## 3rd Qu.:38018248 3rd Qu.:2017 3rd Qu.: 6.00
## Max. :50813372 Max. :2020 Max. :179749.00
## NA's :1 NA's :1 NA's :1
con <- dbConnect(duckdb::duckdb(), dbdir = "tails.duckdb")
duckdb::duckdb_register(con, "tracks_yearly_playcounts", tracks_yearly_playcounts)
This section verifies the previously assumed power law distribution. The interactive log-log plots below, allow for an identification of a breach of the previously assumed power law (Zipf’s) distribution.
The analysis of log-log plots yields a similar conclusion to Liebowitz and Zentner (2024) critique of Brynjolfsson et al. (2003) assumption of power-law relationship between the units of sales and its ranks. The relationship between the logarithm of rank and the logarithms of sales seems to follow a concave shape – suggesting a smaller significance of the long tail than that of previous studies (smaller variety-induced welfare gains).
years <- 2005:2020
fig_zipf <- plot_ly()
for (yr in years) {
query <- sprintf("
WITH ranked_data AS (
SELECT
total_playcounts,
ROW_NUMBER() OVER(ORDER BY total_playcounts DESC) as rank
FROM tracks_yearly_playcounts
WHERE timestamp = %d
)
SELECT
log10(rank) as log_rank,
log10(total_playcounts) as log_plays
FROM ranked_data
WHERE rank <= 1000000 -- ograniczamy do miliona dla log-log
AND (
rank <= 1000 OR
(rank <= 10000 AND rank %% 10 = 0) OR
(rank <= 100000 AND rank %% 100 = 0) OR
(rank %% 1000 = 0)
)", yr)
dt_log <- dbGetQuery(con, query)
fit <- lm(log_plays ~ log_rank, data = dt_log)
dt_log$fit_line <- predict(fit)
is_visible <- (yr == 2005)
fig_zipf <- fig_zipf %>% add_markers(
data = dt_log, x = ~log_rank, y = ~log_plays,
name = paste("Data", yr), visible = is_visible,
marker = list(color = "midnightblue", opacity = 0.3, size = 4)
)
fig_zipf <- fig_zipf %>% add_lines(
data = dt_log, x = ~log_rank, y = ~fit_line,
name = paste("Slope", yr), visible = is_visible,
line = list(color = "red", dash = "dash", width = 2)
)
}
buttons <- lapply(seq_along(years), function(i) {
vis_vector <- rep(FALSE, length(years) * 2)
vis_vector[c(2*i - 1, 2*i)] <- TRUE
list(method = "restyle",
args = list("visible", vis_vector),
label = as.character(years[i]))
})
fig_zipf <- fig_zipf %>% layout(
title = "<b>Log-Log Frequency Plot </b>",
xaxis = list(title = "Log10(Rank)"),
yaxis = list(title = "Log10(Playcounts)"),
updatemenus = list(list(buttons = buttons, direction = "down", x = 0.1, y = 1.15)),
showlegend = FALSE
) %>% toWebGL()
fig_zipf
The interactive plots below, simulating an analysis similar to that performed in the introduction to Anderson’s (2006) study of the long tails. The first plot illustrates the playcounts of top 25k tracks ordered in terms of their aggregated yearly popularity, the next displays the next 25k - 100k, while the last plot shows the further end of the tail – illustrating the distribution of playcounts for the top 100k - 800k tracks. These figures illustrate that while a relatively small number of items account for a disproportionately large fraction of total consumption, the tail is nevertheless heavy. Individually the further sections of tails are not popular, but the aggregate of items of which they are comprised represents a substantial fraction of the market.
years <- 2005:2020
fig1 <- plot_ly() # Head
fig2 <- plot_ly() # Middle
fig3 <- plot_ly() # Far Tail
for (yr in years) {
# data from duckdb - with thinning
query <- sprintf("WITH ranked_data AS (
SELECT total_playcounts, ROW_NUMBER() OVER(ORDER BY total_playcounts DESC) as rank
FROM tracks_yearly_playcounts WHERE timestamp = %d
)
SELECT rank, total_playcounts,
CASE WHEN rank <= 25000 THEN 1 WHEN rank <= 100000 THEN 2 ELSE 3 END as segment
FROM ranked_data
WHERE rank <= 800000 AND (
(rank <= 25000 AND rank %% 10 = 0) OR
(rank > 25000 AND rank <= 100000 AND rank %% 50 = 0) OR
(rank > 100000 AND rank %% 250 = 0))", yr)
dt_yr <- dbGetQuery(con, query)
is_visible <- (yr == 2005)
fig1 <- fig1 %>% add_lines(data = dt_yr[dt_yr$segment == 1,], x = ~rank, y = ~total_playcounts,
name = as.character(yr), visible = is_visible, line = list(color = 'steelblue'))
fig2 <- fig2 %>% add_lines(data = dt_yr[dt_yr$segment == 2,], x = ~rank, y = ~total_playcounts,
name = as.character(yr), visible = is_visible, line = list(color = 'steelblue'))
fig3 <- fig3 %>% add_lines(data = dt_yr[dt_yr$segment == 3,], x = ~rank, y = ~total_playcounts,
name = as.character(yr), visible = is_visible, line = list(color = 'steelblue'))
}
# merging plots into one subplot
final_fig <- subplot(fig1, fig2, fig3, nrows = 1, margin = 0.05, shareX = FALSE, shareY = FALSE) %>%
toWebGL()
# dropdown buttons
buttons <- lapply(seq_along(years), function(i) {
vis_vector <- rep(FALSE, length(years) * 3)
vis_vector[i] <- TRUE
vis_vector[i + length(years)] <- TRUE
vis_vector[i + 2 * length(years)] <- TRUE
list(method = "restyle",
args = list("visible", vis_vector),
label = as.character(years[i]))
})
final_fig <- final_fig %>% layout(
title = list(text = "<b>Long Tail Analysis: Head, Middle, and Far Tail Trends</b>", y = 0.95),
xaxis = list(title = "Rank (Head)"),
xaxis2 = list(title = "Rank (Middle)"),
xaxis3 = list(title = "Rank (Far Tail)"),
yaxis = list(title = "Plays"),
updatemenus = list(list(buttons = buttons, direction = "down", x = 0.05, y = 1.1)),
margin = list(t = 100, b = 50, l = 50, r = 20),
showlegend = FALSE
)
final_fig
To verify whether the Pareto principle (20% of all products are responsible for 80% of all revenue from sales) can still be observed in the data post-digitization, the relative cumulative share curve is analyzed. The dashed gray line represents the Pareto principle point – the analyzed cumulative distributions meet this point with varying degrees – 2006, 2007 and 2020 (an anomaly) pass through the 20%-80% point, while for all other years, save for 2005, the 20% of top titles account for an even larger proportion of all consumption. Contrary to Anderson’s (2004,2006) findings a rather small proportion of the top tracks seems to be responsible for the majority of aggregated consumption. This, however, does not disregard the long tail hypothesis as, still, a significant portion of the market is unsatiated by even the top 50% of tracks.
# calculate the share of titles vs. the share of plays
dbExecute(con, "CREATE OR REPLACE TABLE concentration_data AS
WITH yearly_stats AS (
SELECT
timestamp,
total_playcounts,
SUM(total_playcounts) OVER(PARTITION BY timestamp) AS sum_total,
COUNT(*) OVER(PARTITION BY timestamp) AS n_total,
ROW_NUMBER() OVER(PARTITION BY timestamp ORDER BY total_playcounts DESC) AS rank
FROM tracks_yearly_playcounts
WHERE timestamp BETWEEN 2005 AND 2020
)
SELECT
timestamp,
rank,
CAST(rank AS DOUBLE) / n_total AS title_share,
SUM(total_playcounts) OVER(
PARTITION BY timestamp
ORDER BY total_playcounts DESC
ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
) / sum_total AS play_share
FROM yearly_stats;")
## [1] 127784220
# retrieving
plot_query <- "SELECT * FROM concentration_data
WHERE title_share <= 0.5
AND (rank <= 1000 OR rank % 100 = 0)"
concentration_plot_df <- dbGetQuery(con, plot_query)
ggplot(concentration_plot_df, aes(x = title_share, y = play_share, color = as.factor(timestamp))) +
geom_line(size = 0.5) +
# Pareto
geom_hline(yintercept = 0.80, linetype = "dashed", color = "gray", size = 0.3) +
geom_vline(xintercept = 0.20, linetype = "dashed", color = "gray", size = 0.3) +
coord_cartesian(xlim = c(0, 0.5), ylim = c(0, 1)) +
scale_x_continuous(labels = percent) +
scale_y_continuous(labels = percent) +
labs(title = "Cumulative Share Concentration (2005-2020)",
x = "Top % of Titles", y = "Cumulative % of Plays") +
theme_minimal()
Building on the Anderson’s classification of the long tail as that, which exists thanks to the digitization, an share of the absolute ranks is analyzed. Anderson (2006) and Goel et al. (2010) suggested the average brick-and-mortar music shop supply to be, subsequently, 55 thousand tracks and 50k – the benefits of the expansion of supply attributed to digitization seems plain to see – with the long tail as defined by an absolute cutoff of 50k exceeding 40% of playcount share in all years except for 2005.
plot_query_absolute <- "SELECT timestamp, rank, play_share
FROM concentration_data
WHERE rank <= 60000
AND (rank <= 1000 OR rank % 50 = 0)"
concentration_abs_df <- dbGetQuery(con, plot_query_absolute)
ggplot(concentration_abs_df, aes(x = rank / 1000, y = play_share, color = as.factor(timestamp))) +
geom_line(size = 0.5) +
coord_cartesian(xlim = c(0, 60), ylim = c(0, 1)) +
scale_x_continuous(breaks = seq(0, 60, 20)) +
scale_y_continuous(labels = label_percent(), breaks = seq(0, 1, 0.2)) +
labs(title = "Share of Playcounts Using Absolute Cutoffs",
x = "Rank of Title (1,000s)",
y = "Percent of Playcounts",
color = "Year"
) +
theme_minimal() +
theme(
legend.position = "bottom",
panel.grid.minor = element_blank())
HHI is an index widely used in industrial economics used to illustrate the competition between firms as a measure of their relative sizes in their industry. It is calculated as the sum of squares of market shares of each player. However, it can also be borrowed to illustrate the concentration, or dispersion, of consumed items (tv shows: Dalla Torre et al. 2025, household consumption: Neiman and Vavra 2023).
The Hirschman-Herfindahl Index is calculated using the following formula:
\[ {\displaystyle HHI=\sum _{i=1}^{N}(MS_{i})^{2}} \]
Where \(MS_i\) is the market share of a the item \(i\).
The results of HHI calculation indicate a strong dispersion in the data, each year falling far below the threshold of 0.15 for the decimal HHI (1500 for HHI points), which would indicate a high concentration.
hhi_query <- "WITH yearly_totals AS (
SELECT
timestamp AS year,
SUM(total_playcounts) AS total_market_plays
FROM tracks_yearly_playcounts
GROUP BY timestamp
),
track_shares AS (
SELECT
t.timestamp AS year,
POWER(CAST(t.total_playcounts AS DOUBLE) / y.total_market_plays, 2) AS squared_share
FROM tracks_yearly_playcounts t
JOIN yearly_totals y ON t.timestamp = y.year
)
SELECT
year,
SUM(squared_share) AS hhi_decimal,
SUM(squared_share) * 10000 AS hhi_points,
1 / SUM(squared_share) AS effective_n_tracks
FROM track_shares
GROUP BY year
ORDER BY year;"
hhi_results <- dbGetQuery(con, hhi_query)
print(hhi_results, row.names = FALSE, digits = 6)
## year hhi_decimal hhi_points effective_n_tracks
## 2005 0.00002149102 0.2149102 46531.1
## 2006 0.00001304861 0.1304861 76636.5
## 2007 0.00001071718 0.1071718 93308.1
## 2008 0.00000830928 0.0830928 120347.4
## 2009 0.00000805122 0.0805122 124204.8
## 2010 0.00000737729 0.0737729 135551.2
## 2011 0.00000736464 0.0736464 135784.0
## 2012 0.00000804319 0.0804319 124328.7
## 2013 0.00000758308 0.0758308 131872.5
## 2014 0.00000656951 0.0656951 152218.3
## 2015 0.00000610178 0.0610178 163886.6
## 2016 0.00000587314 0.0587314 170266.6
## 2017 0.00000569391 0.0569391 175626.1
## 2018 0.00000538620 0.0538620 185659.7
## 2019 0.00000607047 0.0607047 164731.8
## 2020 0.00001125963 0.1125963 88812.9
A crucial implication of the long tail hypothesis that remains largely unstudied is the diversification of the demand-side of cultural goods at the individual level. Anderson (2006) states that along with the democratization of access to increasing catalogues of cultural goods – the consumers will select the content most fitting to their individual tastes, possibly increasing the significance of niche products belonging to the tails – leading to a dispersion of sales (in this case – plays). Borreau et al. (2021) outline the mechanisms affecting the diversity of individual consumption emphasizing the dual role of new digital tools simultaneously steering the users towards more popular titles through the publication of various rankings (Billboard Hot 100, Spotify Charts, etc.), possibly amplifying the Superstar effect and facilitating more adequate and personalized search results through the use of recommendation algorithms, which effectively lower the search costs.
The long tail hypothesis is thus consistent with two distinct behavioral theories:
Goel et al. (2010) find support for the latter hypothesis, stating that the observed eccentricity – while higher than that implied by mass-behavior theories, such as that formulated by McPhee (1963): double jeopardy of niche products, which are generally not known and, supposedly, not liked by those who do know them – is still lower than random models would imply. The findings of Goel et al. (2010) would suggests that music markets are characterized by the highest level of eccentricity among individuals.
To verify this hypothesis and examine the significance of the long tail on individual level, a semi-random sample of 10,000 users is drawn from LFM-2b (a random sample of users with non-missing demographic details – rather than selected from 120,322 users, the pool is restricted to 42,683 users). The yearly consumption of these users is then analyzed in a similar way to that of the aggregate – a ranking of tracks is computed and the tail of individual consumption is defined as all songs below the top 1000.
Subsequently, the relationship between the aggregate and individual tails is analyzed.
The analysis presented below caps the head of the tail at 1000 top tracks or each individual.
The relationship between the individual and aggregated tails is analyzed with the following metrics:
# merging individual rankings with global
dbExecute(con, "CREATE OR REPLACE TABLE individual_global_overlap AS
WITH user_track_data AS (
SELECT
m.user_id, m.track_id, m.playcount_yearly,
YEAR(CAST(m.timestamp AS DATE)) as year,
ROW_NUMBER() OVER(PARTITION BY m.user_id, YEAR(CAST(m.timestamp AS DATE))
ORDER BY m.playcount_yearly DESC) as personal_rank
FROM data_yearly m
WHERE m.user_id IN (SELECT user_id FROM target_ids)
)
SELECT
u.*,
r.global_rank,
r.catalog_size,
-- is the track in the global tail > 10%
(r.global_rank > (0.1 * r.catalog_size)) as is_global_tail
FROM user_track_data u
LEFT JOIN global_ranks r ON u.track_id = r.track_id AND u.year = r.year;")
## [1] 104324124
# calculating discovery as the rate of personal tail within the global tail
dbExecute(con, "CREATE OR REPLACE TABLE discovery_metrics AS
SELECT
user_id,
year,
SUM(CASE WHEN is_global_tail AND personal_rank > 1000 THEN playcount_yearly ELSE 0 END) * 1.0 /
NULLIF(SUM(CASE WHEN is_global_tail THEN playcount_yearly ELSE 0 END), 0) AS global_tail_discovery_ratio,
-- share of the global tail in personal consumption
SUM(CASE WHEN is_global_tail THEN playcount_yearly ELSE 0 END) * 1.0 / SUM(playcount_yearly) AS global_tail_share
FROM individual_global_overlap
GROUP BY 1, 2;")
## [1] 61642
The values of the average global tail share over the years are illustrated on the plot below – suggesting that the share of global tail in individual consumption was declining until 2012, followed by a gradual rise in each subsequent year. The values never went below 15%, implying a constant significant average niche consumption.
global_tail_share_summary <- dbGetQuery(con, "SELECT
year,
AVG(global_tail_share) as mean_global_tail_share,
AVG(global_tail_discovery_ratio) as mean_discovery_ratio
FROM discovery_metrics
WHERE year >= 2005 AND year <=2020
GROUP BY year
ORDER BY year;")
ggplot(global_tail_share_summary, aes(x = year, y = mean_global_tail_share)) +
geom_line(color = "darkgreen", size = 1) +
geom_point(color = "darkorange", size = 3) +
scale_y_continuous(labels = scales::percent_format()) +
scale_x_continuous(breaks = seq(2005, 2020, by = 2)) +
labs(x = "Year", y = "Global Tail Share (%)", title = "Global Tail Share Over the Years", subtitle = "Average share of global tail in individual consumption") +
theme_minimal()
The plot below illustrates the annual changes in the discovery ratio, displaying that the discovery ratio (the ratio of aggregate long tail within the individual long tail) follows an upward tendency, reaching relatively high values, which would indicate that a substantial share of the globally long tail items are consumed in the users’ long tail – being explored and discovered rather than consumed intensively.
ggplot(global_tail_share_summary, aes(x = year, y = mean_discovery_ratio)) +
geom_line(color = "purple", size = 1.2) +
geom_point(color = "black", size = 3) +
scale_y_continuous(labels = scales::percent_format()) +
scale_x_continuous(breaks = seq(2005, 2020, by = 2)) +
labs(
title = "Discovery Ratio",
subtitle = "Share of individual tail within the global tail consumption",
x = "Year",
y = "Discovery Ratio (%)"
) +
theme_minimal()
The analysis performed in this section employs a different definition of the had and tail definition, the top 100 approach is rooted in the composition of global charts, which most often encompass the top 100 songs, as well as the new practices of online personal listening behavior analysis, such as that captured by Spotify Wrapped – each year delivering users their top 100 tracks lists.
# yearly statistics for the selected users
dbExecute(con, "
CREATE OR REPLACE TABLE user_personal_metrics AS
WITH user_track_ranks AS (
SELECT
user_id,
track_id,
playcount_yearly,
YEAR(CAST(timestamp AS DATE)) as year,
ROW_NUMBER() OVER(PARTITION BY user_id, YEAR(CAST(timestamp AS DATE))
ORDER BY playcount_yearly DESC) as personal_rank
FROM data_yearly
WHERE user_id IN (SELECT user_id FROM target_ids)
)
SELECT
user_id,
year,
-- Personal Long Tail share (everything except the top 100)
SUM(CASE WHEN personal_rank > 100 THEN playcount_yearly ELSE 0 END) * 1.0 /
SUM(playcount_yearly) AS personal_tail_share,
SUM(playcount_yearly) AS total_user_plays
FROM user_track_ranks
GROUP BY 1, 2;
")
## [1] 61642
# aggregating yearly average
indiv_tails_summary_100 <- dbGetQuery(con, "
SELECT year, AVG(personal_tail_share) as mean_personal_tail_share
FROM user_personal_metrics
GROUP BY year ORDER BY year;")
cat("\n--- MEAN INDIVIDUAL TAIL SHARE BY YEAR ---\n")
##
## --- MEAN INDIVIDUAL TAIL SHARE BY YEAR ---
print(indiv_tails_summary_100, row.names = FALSE)
## year mean_personal_tail_share
## 1970 0.0000000
## 2005 0.5175638
## 2006 0.5397971
## 2007 0.5498866
## 2008 0.5677136
## 2009 0.5541268
## 2010 0.5582624
## 2011 0.5652173
## 2012 0.5973070
## 2013 0.5559723
## 2014 0.5482123
## 2015 0.5378088
## 2016 0.5482287
## 2017 0.5551845
## 2018 0.5657919
## 2019 0.6110878
## 2020 0.5311157
The plot below illustrates the share of individual tails, when the head of the tail is defined as the top 100 tracks.
plot_data_100 <- indiv_tails_summary_100 %>%
filter(year >=2005)
ggplot(plot_data_100, aes(x = year, y = mean_personal_tail_share)) +
geom_line(color = "darkblue", size = 1) +
geom_point(color = "darkred", size = 3) +
scale_y_continuous(labels = percent_format(),
limits = c(0, max(plot_data_100$mean_personal_tail_share) * 1.1)) +
scale_x_continuous(breaks = seq(2005, 2020, by = 2)) +
labs(x = "Year", y = "Avg Individual Tail Share (%)", title = "Individual Tails Over the Years") +
theme_minimal()
# merging individual rankings with global
dbExecute(con, "CREATE OR REPLACE TABLE individual_global_overlap AS
WITH user_track_data AS (
SELECT
m.user_id, m.track_id, m.playcount_yearly,
YEAR(CAST(m.timestamp AS DATE)) as year,
ROW_NUMBER() OVER(PARTITION BY m.user_id, YEAR(CAST(m.timestamp AS DATE))
ORDER BY m.playcount_yearly DESC) as personal_rank
FROM data_yearly m
WHERE m.user_id IN (SELECT user_id FROM target_ids)
)
SELECT
u.*,
r.global_rank,
r.catalog_size,
-- is the track in the global tail > 10%
(r.global_rank > (0.1 * r.catalog_size)) as is_global_tail
FROM user_track_data u
LEFT JOIN global_ranks r ON u.track_id = r.track_id AND u.year = r.year;")
## [1] 104324124
# calculating discovery as the rate of personal tail within the global tail
dbExecute(con, "CREATE OR REPLACE TABLE discovery_metrics AS
SELECT
user_id,
year,
SUM(CASE WHEN is_global_tail AND personal_rank > 100 THEN playcount_yearly ELSE 0 END) * 1.0 /
NULLIF(SUM(CASE WHEN is_global_tail THEN playcount_yearly ELSE 0 END), 0) AS global_tail_discovery_ratio,
-- share of the global tail in personal consumption
SUM(CASE WHEN is_global_tail THEN playcount_yearly ELSE 0 END) * 1.0 / SUM(playcount_yearly) AS global_tail_share
FROM individual_global_overlap
GROUP BY 1, 2;")
## [1] 61642
The plot below illustrates the annual changes in the discovery ratio, implying that most of the globally long tail items are consumed in the users’ long tail – being explored and discovered rather than consumed as a global hyperfixation. The discovery ratio values mostly follow an upward trend, however a significant stagnation can be seen during the years 2012-2018, which, combined with the fact that tail share ratio was on the rise during that year would imply that the users who stayed in the dataset following the mass exodus in 2012 represent the more focused niche users, whose consumption of the long tail is mostly encompassed by their top tracks. The substantial values of the discovery ratio support the second hypothesis – most of the users eclectically combine the consumption of both the popular and the niche items. The fact that the change between the discovery ratio as calculated for the tail defined by these tracks, which are outside of the user’s top 100 and that outside of top 1000 is so substantial, would indicate that the aggregate long tail consumption is mostly encompassed in the middle tail of user’s personal consumption (between the top 100 and top 1000th track).
global_tail_share_summary_100 <- dbGetQuery(con, "SELECT
year,
AVG(global_tail_share) as mean_global_tail_share,
AVG(global_tail_discovery_ratio) as mean_discovery_ratio
FROM discovery_metrics
WHERE year >= 2005 AND year <=2020
GROUP BY year
ORDER BY year;")
ggplot(global_tail_share_summary_100, aes(x = year, y = mean_discovery_ratio)) +
geom_line(color = "purple", size = 1.2) +
geom_point(color = "black", size = 3) +
scale_y_continuous(labels = scales::percent_format()) +
scale_x_continuous(breaks = seq(2005, 2020, by = 2)) +
labs(
title = "Discovery Ratio",
subtitle = "Share of individual tail within the global tail consumption",
x = "Year",
y = "Discovery Ratio (%)"
) +
theme_minimal()
To identify the drivers of long tail consumption at the individual level, an Ordinary Least Squares (OLS) regression was performed, with the Global Tail Share (the proportion of listening dedicated tot he bottom 90% of the global market) as the dependent variable. The model accounts for the demogrpahic characteristics (including age and gender), consumption intensity (total annual playcounts) and temporal period effects.
The regression’s results indicate that the long tail consumption is significantly influenced by a user’s demographic profile. This analysis yields the following results:
# importing demoghraphics and connceting to data
path_demo <- "/Users/gosia/Desktop/sampled_users.tsv"
dbExecute(con, sprintf("CREATE OR REPLACE TABLE demographics AS SELECT * FROM read_csv_auto('%s');", path_demo))
## [1] 10000
df_final <- dbGetQuery(con, "
SELECT s.*, d.age, d.gender, d.country
FROM discovery_metrics s
JOIN demographics d ON s.user_id = d.user_id;")
# linear regression with demographic information
model_demo <- lm(global_tail_share ~ age + gender + log(total_annual_plays) + as.factor(year),
data = dbGetQuery(con, "SELECT s.*, d.age, d.gender, u.total_user_plays as total_annual_plays
FROM discovery_metrics s
JOIN demographics d ON s.user_id = d.user_id
JOIN user_personal_metrics u ON s.user_id = u.user_id AND s.year = u.year"))
cat("\n--- DEMOGRAPHIC MODEL (GLOBAL TAIL SHARE) ---\n")
##
## --- DEMOGRAPHIC MODEL (GLOBAL TAIL SHARE) ---
print(summary(model_demo))
##
## Call:
## lm(formula = global_tail_share ~ age + gender + log(total_annual_plays) +
## as.factor(year), data = dbGetQuery(con, "SELECT s.*, d.age, d.gender, u.total_user_plays as total_annual_plays \n FROM discovery_metrics s \n JOIN demographics d ON s.user_id = d.user_id\n JOIN user_personal_metrics u ON s.user_id = u.user_id AND s.year = u.year"))
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.64031 -0.11517 -0.04046 0.07858 0.82189
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.84363423 0.17729397 4.758 0.00000196
## age 0.00470197 0.00009449 49.759 < 0.0000000000000002
## genderm 0.03848193 0.00172407 22.320 < 0.0000000000000002
## gendern 0.03881655 0.00363793 10.670 < 0.0000000000000002
## log(total_annual_plays) -0.01353101 0.00036869 -36.700 < 0.0000000000000002
## as.factor(year)2005 -0.40166966 0.17802392 -2.256 0.024057
## as.factor(year)2006 -0.50903072 0.17753987 -2.867 0.004143
## as.factor(year)2007 -0.56764036 0.17740367 -3.200 0.001376
## as.factor(year)2008 -0.61749390 0.17735438 -3.482 0.000499
## as.factor(year)2009 -0.64965674 0.17733074 -3.664 0.000249
## as.factor(year)2010 -0.67434958 0.17731704 -3.803 0.000143
## as.factor(year)2011 -0.69482576 0.17730880 -3.919 0.00008911
## as.factor(year)2012 -0.71616291 0.17730552 -4.039 0.00005371
## as.factor(year)2013 -0.70590311 0.17730475 -3.981 0.00006862
## as.factor(year)2014 -0.69725612 0.17730706 -3.932 0.00008417
## as.factor(year)2015 -0.67673528 0.17730977 -3.817 0.000135
## as.factor(year)2016 -0.66268674 0.17731347 -3.737 0.000186
## as.factor(year)2017 -0.65499628 0.17731684 -3.694 0.000221
## as.factor(year)2018 -0.64627542 0.17732066 -3.645 0.000268
## as.factor(year)2019 -0.64192560 0.17732834 -3.620 0.000295
## as.factor(year)2020 -0.56990586 0.17732948 -3.214 0.001310
##
## (Intercept) ***
## age ***
## genderm ***
## gendern ***
## log(total_annual_plays) ***
## as.factor(year)2005 *
## as.factor(year)2006 **
## as.factor(year)2007 **
## as.factor(year)2008 ***
## as.factor(year)2009 ***
## as.factor(year)2010 ***
## as.factor(year)2011 ***
## as.factor(year)2012 ***
## as.factor(year)2013 ***
## as.factor(year)2014 ***
## as.factor(year)2015 ***
## as.factor(year)2016 ***
## as.factor(year)2017 ***
## as.factor(year)2018 ***
## as.factor(year)2019 ***
## as.factor(year)2020 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1772 on 61621 degrees of freedom
## Multiple R-squared: 0.1257, Adjusted R-squared: 0.1254
## F-statistic: 442.9 on 20 and 61621 DF, p-value: < 0.00000000000000022
The model confirms that the digital era has provided the tools for individuals to tailor their consumption to their eclectic tastes, this process is heaily moderated by age and gender. Furthermore, the most active consumers are not necessarily the most diverse as the pull of the superstar effect remains a dmonant force across the digital landscape.
Anderson, C., Nissley, C., & Anderson, C. (2006, April). The long tail.
Bourreau, M., Moreau, F., & Wikström, P. (2022). Does digitization lead to the homogenization of cultural content?. Economic Inquiry, 60(1), 427-453.
Brynjolfsson, E., Yu (Jeffrey) Hu, & Smith, M. D. (2003). Consumer Surplus in the Digital Economy: Estimating the Value of Increased Product Variety at Online Booksellers. Management Science, 49(11), 1580–1596. http://www.jstor.org/stable/4134002
Dalla Torre, P., Fantozzi, P., & Naldi, M. (2025). Unraveling the Long Tail Phenomenon. THE MATTER OF INTELLECTUAL PROPERTY, 62.
Davies, C., Page, B., Driesener, C., Anesbury, Z., Yang, S., & Bruwer, J. (2022). The power of nostalgia: Age and preference for popular music. Marketing Letters, 33(4), 681–692.
Elberse, A. (2008). Should you invest in the long tail?. Harvard business review, 86(7/8), 88.
Glevarec, H., Nowak, R., & Mahut, D. (2020). Tastes of our time: Analysing age cohort effects in the contemporary distribution of music tastes. Cultural Trends, 29(3), 182–198.
Goel, S., Broder, A., Gabrilovich, E., & Pang, B. (2010, February). Anatomy of the long tail: ordinary people with extraordinary tastes. In Proceedings of the third ACM international conference on Web search and data mining (pp. 201-210).
Lesota, O., Melchiorre, A., Rekabsaz, N., Brandl, S., Kowald, D., Lex, E., & Schedl, M. (2021, September). Analyzing item popularity bias of music recommender systems: are different genders equally affected?. In Proceedings of the 15th ACM conference on recommender systems (pp. 601-606).
Liebowitz, S., Ward, M., & Zentner, A. (2025). Only a “longish” tail. Production and Operations Management, 34(8), 2331-2347.
McPhee, W. N. (1963). Formal theories of mass behavior.
Neiman, B., & Vavra, J. (2023). The rise of niche consumption. American Economic Journal: Macroeconomics, 15(3), 224-264.
Schedl, M., Brandl, S., Lesota, O., Parada-Cabaleiro, E., Penz, D., & Rekabsaz, N. (2022, March). LFM-2b: A dataset of enriched music listening events for recommender systems research and fairness analysis. In Proceedings of the 2022 Conference on Human Information Interaction and Retrieval (pp. 337-341).
Waldfogel, J. (2017). How digitization has created a golden age of music, movies, books, and television. Journal of economic perspectives, 31(3), 195-214.