Music TAILoring: How Individual Habits Shape the Aggregate Long Tail in the Digital Era

Introduction

The physical restrictions of the 20th century creative industries have been dramatically overturned by the digital transformation – substantially lowering the entry barriers and supply restrictions previously dictated by shelf-space constraints (Waldfogel 2017). The theories describing this new landscape of near-infinite supply have been shaped by the long tail phenomenon – popularized by Anderson (2004, 2006), which suggests that the aggregate value of low-selling niche titles (the emergence of which was made possible by the democratization of the means of production due to digitization) can rival and even best that of mainstream blockbusters. While the supply-side of this shift is well-documented and widely studied, the underlying individual demand-side dynamics remain largely opaque. This study aims to bridge this gap through an examination of how individual listening habits (individual music tailoring) contribute to the broader aggregate trends.

Research Questions

RQ1: Has the share of long tail consumption grown over time?

Many studies succeeding the exploration of the long tail phenomenon by Anderson (2006) have disputed the significance of this hypothesis (among these: Elberse 2008, Liebowitz and Zetner 2025, among others), finding a coexisting Superstar effect, where a fraction fo titels captures an even larger share of total plays/sales, potentially depressing the sales of niche titles. Through an examination of a 15 year period, covering the rise of streaming services, the fall of piracy and the maturation of social music platforms, RQ1 tests whether the market has truly diversified or if it has succumbed to a blockbuster strategy.

RQ2: Are long tails significant on individual level?

Traditional market-level analyses often overlook whether the emergence of the long tails is a result of a collective shift of consumption habits towards more diversified individual patterns or from homogeneous niche and focused consumer clusters. According to a hypothesis presented by Anderson (2006) – the democratization of access allows consumers to tailor their consumption to content that best fits their tastes.

The analysis presented in this study, based on the LFM-2b dataset (Schedl et al. 2022), illustrates an evolution of this tailoring process (see section Individual Tails), ultimately revealing a gradual shift towards more exploratory behavior, where users increasingly engage with the niche segments of their own libraries.

RQ3: Who are the listeners behind the long tail distributions?

Descriptive Statistics

Beginning by loading the data – I will use the file, which stores users’ aggregated playcounts by year (due to the large size loaded with the use of package “data.table”).

The analysis in this project will be based on LFM-2b – a dataset made available by Schedl et al. (2022) – currently accessible solely through the Wayback Machine, as the source platform revoked the license for publishing large datasets derived from their site. LFM-2b contains listening records of over 120,000 users of a music platform - last.fm, amounting to more than 2 billion listening events (interactions between a distinct user and a specific song enriched by its metadata) of more than 50 million distinct tracks by over 5 million artists. Additionally, LFM-2b contains demographic information concerning users (age, gender, country) and additional track-level metadata, such as tags, microgenres and vector embeddings of lyrics.

This file contains 744,785,644 observations, each a tuple of <user_id, track_id, timestamp, playcount_yearly>.

year <- fread("/Users/gosia/Desktop/tails/lastfm_data/yearly.tsv", sep = "\t")

To gain a deeper understanding of the data, the summary statistics are displayed in the table below.

These descriptive statistics include:

the number of active users,
the number of unique tracks,
the total playcount (aggregated sum of all listening events taking place in a given year)
average number of tracks per user,
average user playcount.

These metrics already shed a light on the nature of consumption and temporal changes in music listening habits.

# using package data.table for efficiency
# user statistics
user_yearly_stats <- year[, .(
    user_unique_tracks = uniqueN(track_id),
    user_total_playcount = sum(as.numeric(playcount_yearly))), 
  by = .(user_id, year = year(timestamp))]

tracks <- year[,.(
  n_tracks = uniqueN(track_id)),
  by =.(year = year(timestamp))]

# summary statistics
summary_stats <- user_yearly_stats[, .(
    n_users = .N,
    avg_tracks_per_user = mean(user_unique_tracks),
    avg_user_playcount = mean(user_total_playcount),
    total_playcount = sum(user_total_playcount)), 
  by = year][order(year)]

summary_stats <- merge(summary_stats, tracks, by = "year")[order(year)]

print(summary_stats)

## Key: <year>
##      year n_users avg_tracks_per_user avg_user_playcount total_playcount
##     <int>   <int>               <num>              <num>           <num>
##  1:  1970       2              1.0000              1.000               2
##  2:  2005     713           1019.4951           3015.400         2149980
##  3:  2006    2394           1122.1792           3352.736         8026451
##  4:  2007    5609           1183.9863           3492.770        19590949
##  5:  2008   10889           1178.3385           3255.910        35453602
##  6:  2009   21231           1081.7030           2915.909        61907659
##  7:  2010   34152           1092.6612           2949.087       100717236
##  8:  2011   61138           1014.4626           2715.807       166039028
##  9:  2012  114797           1078.4056           2785.362       319751175
## 10:  2013   84836           1221.7860           3223.327       273454128
## 11:  2014   59861           1506.1195           4408.282       263884146
## 12:  2015   43787           1560.1864           4424.597       193739838
## 13:  2016   33123           1633.9497           4393.827       145536732
## 14:  2017   27057           1837.5453           4856.936       131414118
## 15:  2018   22708           2059.4292           5592.733       126999777
## 16:  2019   18316           2645.2807           7378.372       135142263
## 17:  2020   15258            957.9756           1989.631        30357786
## 18:  2023       1              1.0000              1.000               1
## 19:  2033       1              1.0000              1.000               1
##     n_tracks
##        <int>
##  1:        2
##  2:   390610
##  3:  1045089
##  4:  2086446
##  5:  3316272
##  6:  4917837
##  7:  6894332
##  8:  9690020
##  9: 14987437
## 10: 14390889
## 11: 14276578
## 12: 12210916
## 13: 10636571
## 14:  9935709
## 15:  9497939
## 16:  9425045
## 17:  4082530
## 18:        1
## 19:        1

Another thing quickly drawing the attention are the three singular, outlying observations comprised of listening events taking place in 1970, 2023, 2033 – most likely resulting from a UNIX time error. LFM-2b provides coverage of listening events spanning the period between the 14th of February in 2005 and the 20th of March in 2020 – all of the instances outside of this period are eliminated from further analysis due to the faultiness of records.

# excluding errors from summary_stats
summary_stats <- summary_stats[!summary_stats$year %in% c(1970, 2023, 2033)]

# excluding 1970, 2023, 2033 from the dataset (year)
year[, timestamp := year(timestamp)] # converting timestamp into only year

# identifying row indexes of errors
(error <- which(year$timestamp %in% c(1970, 2023, 2033)))

## [1]  48185922 228129312 297776742 586436779

# too large to remove through copying - overwriting NAs
year[error, `:=`(user_id = NA, track_id = NA, timestamp = NA, playcount_yearly = NA)]

Visualization

On the illustration below one can see the changes in the yearly user-base. The number of users present in the dataset was gradually growing until its rapid rise in 2012. The 2012 peak, amounting to 114,797 users was succeeded by a gradual attrition of the user-base, the effects of which were the most abrupt in 2013, 2014 and 2015 – each year coinciding with a shrinkage approximating, subsequently, 30,000; 25,000 and 10,000 users. However, this trend in reduction became more stabilized after 2015, the differences oscillating near 5,000 users each year.

ggplot(summary_stats, aes(x = year, y = n_users)) +
  geom_col(fill = "steelblue", alpha = 0.7) +
  labs(title = "Number of Users Over Time",
       x = "Year",
       y = "Users") +
  theme_minimal()

The plot below visualizes the number of tracks that were being listened to in each year (blue bars and the y-axis on the left), as well as the average number of tracks played by each user (red line and the right y-axis).

# tracks over time + avg number of tracks per user
coeff <- max(summary_stats$n_tracks)/max(summary_stats$avg_tracks_per_user)
ggplot(summary_stats, aes(x = year)) +
  geom_col(aes(y = n_tracks), fill = "steelblue", alpha = 0.7) +
  geom_line(aes(y = avg_tracks_per_user * coeff), color = "darkred", size = 0.7) +
  geom_point(aes(y = avg_tracks_per_user * coeff), color = "darkred", size = 0.8) +
  scale_y_continuous(
    name = "Total Tracks",
    sec.axis = sec_axis(~./coeff, name = "Avg Tracks per User")
  ) +
  labs(title = "Total Tracks vs Avg Tracks per User", x = "Year") +
  theme_minimal()

The plot below illustrates the global trends of music consumption in the LFM-2b dataset – the global playcounts displayed a gradual rise up until 2012 – the year that the userbase peaked – and since then the yearly consumption experienced a decline. This decline in aggregated levels of music consumption can be attributed mostly to users’ attrition, seeing as the red line – illustrating average playcounts per user – was on the rise since 2012, excluding a brief period of stagnation in 2014-2015. The year 2020 can be treated as an anomaly due to its limited coverage – the data collection process ended in March 2020.

# total playcount vs avg playcount per user
coeff2 <- max(summary_stats$total_playcount)/max(summary_stats$avg_user_playcount)
ggplot(summary_stats, aes(x = year)) +
  geom_col(aes(y = total_playcount), fill = "steelblue", alpha = 0.7) +
  geom_line(aes(y = avg_user_playcount * coeff2), color = "darkred", size = 0.7) +
  geom_point(aes(y = avg_user_playcount * coeff2), color = "darkred", size = 0.8) +
  scale_y_continuous(
    name = "Aggregated Playcounts",
    sec.axis = sec_axis(~./coeff2, name = "Avg Playcounts per User")
  ) +
  labs(title = "Aggregated Playcounts vs Avg Playcounts per User", x = "Year") +
  theme_minimal()

Long Tail on Aggregate Level

Brynjolfsson et al. (2003) sounding out an early claim supporting the long tail hypothesis in the Amazon book market having no direct data of the number of sold books inferred the shape of the demand curve from the assumption of a power law relationship between the rank of an object and its units of sales (meaning that the relationship between the logarithm of sales’ rank and the logarithm of sales units follows a straight line). The data in our disposition is characterized by better granularity, providing an insight not only into the aggregate units of sales, but also into individual consumption.

A quick look into the output of the summary of the yearly aggregated track-level playcounts reveals a large dispersion in the total playcounts variable – the first quantile is still at the level of one playcount, the median level rises to two playcounts per track, the third quantile is equal to six, while the mean is equal to 15.76 and the maximum level of playcounts aggregated by one track in one year amounts to 179,749. This finding already points towards the supposition of a significant long tail in aggregated music consumption.

# preparing data - yearly playcounts for songs
tracks_yearly_playcounts <- year[, .(
    total_playcounts = sum(as.numeric(playcount_yearly))), 
  by = .(track_id, timestamp)]

summary(tracks_yearly_playcounts)

##     track_id          timestamp    total_playcounts   
##  Min.   :       0   Min.   :2005   Min.   :     1.00  
##  1st Qu.:13438200   1st Qu.:2012   1st Qu.:     1.00  
##  Median :25675888   Median :2014   Median :     2.00  
##  Mean   :25675706   Mean   :2014   Mean   :    15.76  
##  3rd Qu.:38018248   3rd Qu.:2017   3rd Qu.:     6.00  
##  Max.   :50813372   Max.   :2020   Max.   :179749.00  
##  NA's   :1          NA's   :1      NA's   :1

con <- dbConnect(duckdb::duckdb(), dbdir = "tails.duckdb")
duckdb::duckdb_register(con, "tracks_yearly_playcounts", tracks_yearly_playcounts)

Ranking Distribution

This section verifies the previously assumed power law distribution. The interactive log-log plots below, allow for an identification of a breach of the previously assumed power law (Zipf’s) distribution.

The analysis of log-log plots yields a similar conclusion to Liebowitz and Zentner (2024) critique of Brynjolfsson et al. (2003) assumption of power-law relationship between the units of sales and its ranks. The relationship between the logarithm of rank and the logarithms of sales seems to follow a concave shape – suggesting a smaller significance of the long tail than that of previous studies (smaller variety-induced welfare gains).

years <- 2005:2020
fig_zipf <- plot_ly()

for (yr in years) {
  query <- sprintf("
    WITH ranked_data AS (
        SELECT 
            total_playcounts,
            ROW_NUMBER() OVER(ORDER BY total_playcounts DESC) as rank
        FROM tracks_yearly_playcounts
        WHERE timestamp = %d
    )
    SELECT 
        log10(rank) as log_rank,
        log10(total_playcounts) as log_plays
    FROM ranked_data
    WHERE rank <= 1000000 -- ograniczamy do miliona dla log-log
    AND (
      rank <= 1000 OR 
      (rank <= 10000 AND rank %% 10 = 0) OR
      (rank <= 100000 AND rank %% 100 = 0) OR
      (rank %% 1000 = 0)
    )", yr)
  
  dt_log <- dbGetQuery(con, query)
  
  fit <- lm(log_plays ~ log_rank, data = dt_log)
  dt_log$fit_line <- predict(fit)
  
  is_visible <- (yr == 2005)
  
  fig_zipf <- fig_zipf %>% add_markers(
    data = dt_log, x = ~log_rank, y = ~log_plays,
    name = paste("Data", yr), visible = is_visible,
    marker = list(color = "midnightblue", opacity = 0.3, size = 4)
  )
  
  fig_zipf <- fig_zipf %>% add_lines(
    data = dt_log, x = ~log_rank, y = ~fit_line,
    name = paste("Slope", yr), visible = is_visible,
    line = list(color = "red", dash = "dash", width = 2)
  )
}

buttons <- lapply(seq_along(years), function(i) {
  vis_vector <- rep(FALSE, length(years) * 2)
  vis_vector[c(2*i - 1, 2*i)] <- TRUE 
  
  list(method = "restyle",
       args = list("visible", vis_vector),
       label = as.character(years[i]))
})

fig_zipf <- fig_zipf %>% layout(
  title = "<b>Log-Log Frequency Plot </b>",
  xaxis = list(title = "Log10(Rank)"),
  yaxis = list(title = "Log10(Playcounts)"),
  updatemenus = list(list(buttons = buttons, direction = "down", x = 0.1, y = 1.15)),
  showlegend = FALSE
) %>% toWebGL()

fig_zipf

Ranking Plot - From Heads to Tails

The interactive plots below, simulating an analysis similar to that performed in the introduction to Anderson’s (2006) study of the long tails. The first plot illustrates the playcounts of top 25k tracks ordered in terms of their aggregated yearly popularity, the next displays the next 25k - 100k, while the last plot shows the further end of the tail – illustrating the distribution of playcounts for the top 100k - 800k tracks. These figures illustrate that while a relatively small number of items account for a disproportionately large fraction of total consumption, the tail is nevertheless heavy. Individually the further sections of tails are not popular, but the aggregate of items of which they are comprised represents a substantial fraction of the market.

years <- 2005:2020

fig1 <- plot_ly() # Head
fig2 <- plot_ly() # Middle
fig3 <- plot_ly() # Far Tail

for (yr in years) {
  # data from duckdb - with thinning
  query <- sprintf("WITH ranked_data AS (
        SELECT total_playcounts, ROW_NUMBER() OVER(ORDER BY total_playcounts DESC) as rank
        FROM tracks_yearly_playcounts WHERE timestamp = %d
    )
    SELECT rank, total_playcounts,
           CASE WHEN rank <= 25000 THEN 1 WHEN rank <= 100000 THEN 2 ELSE 3 END as segment
    FROM ranked_data
    WHERE rank <= 800000 AND (
      (rank <= 25000 AND rank %% 10 = 0) OR
      (rank > 25000 AND rank <= 100000 AND rank %% 50 = 0) OR
      (rank > 100000 AND rank %% 250 = 0))", yr)
  
  dt_yr <- dbGetQuery(con, query)
  is_visible <- (yr == 2005)
  fig1 <- fig1 %>% add_lines(data = dt_yr[dt_yr$segment == 1,], x = ~rank, y = ~total_playcounts, 
                             name = as.character(yr), visible = is_visible, line = list(color = 'steelblue'))
  
  fig2 <- fig2 %>% add_lines(data = dt_yr[dt_yr$segment == 2,], x = ~rank, y = ~total_playcounts, 
                             name = as.character(yr), visible = is_visible, line = list(color = 'steelblue'))
  
  fig3 <- fig3 %>% add_lines(data = dt_yr[dt_yr$segment == 3,], x = ~rank, y = ~total_playcounts, 
                             name = as.character(yr), visible = is_visible, line = list(color = 'steelblue'))
}

# merging plots into one subplot
final_fig <- subplot(fig1, fig2, fig3, nrows = 1, margin = 0.05, shareX = FALSE, shareY = FALSE) %>%
  toWebGL() 

# dropdown buttons
buttons <- lapply(seq_along(years), function(i) {

  vis_vector <- rep(FALSE, length(years) * 3)
  vis_vector[i] <- TRUE 
  vis_vector[i + length(years)] <- TRUE 
  vis_vector[i + 2 * length(years)] <- TRUE 
  list(method = "restyle",
       args = list("visible", vis_vector),
       label = as.character(years[i]))
})

final_fig <- final_fig %>% layout(
  title = list(text = "<b>Long Tail Analysis: Head, Middle, and Far Tail Trends</b>", y = 0.95),
  xaxis = list(title = "Rank (Head)"),
  xaxis2 = list(title = "Rank (Middle)"),
  xaxis3 = list(title = "Rank (Far Tail)"),
  yaxis = list(title = "Plays"),
  updatemenus = list(list(buttons = buttons, direction = "down", x = 0.05, y = 1.1)),
  margin = list(t = 100, b = 50, l = 50, r = 20),
  showlegend = FALSE
)

final_fig

Pareto Principle Verification

To verify whether the Pareto principle (20% of all products are responsible for 80% of all revenue from sales) can still be observed in the data post-digitization, the relative cumulative share curve is analyzed. The dashed gray line represents the Pareto principle point – the analyzed cumulative distributions meet this point with varying degrees – 2006, 2007 and 2020 (an anomaly) pass through the 20%-80% point, while for all other years, save for 2005, the 20% of top titles account for an even larger proportion of all consumption. Contrary to Anderson’s (2004,2006) findings a rather small proportion of the top tracks seems to be responsible for the majority of aggregated consumption. This, however, does not disregard the long tail hypothesis as, still, a significant portion of the market is unsatiated by even the top 50% of tracks.

# calculate the share of titles vs. the share of plays 
dbExecute(con, "CREATE OR REPLACE TABLE concentration_data AS
WITH yearly_stats AS (
    SELECT 
        timestamp,
        total_playcounts,
        SUM(total_playcounts) OVER(PARTITION BY timestamp) AS sum_total,
        COUNT(*) OVER(PARTITION BY timestamp) AS n_total,
        ROW_NUMBER() OVER(PARTITION BY timestamp ORDER BY total_playcounts DESC) AS rank
    FROM tracks_yearly_playcounts
    WHERE timestamp BETWEEN 2005 AND 2020
)
SELECT 
    timestamp,
    rank,
    CAST(rank AS DOUBLE) / n_total AS title_share,
    SUM(total_playcounts) OVER(
        PARTITION BY timestamp 
        ORDER BY total_playcounts DESC 
        ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
    ) / sum_total AS play_share
FROM yearly_stats;")

## [1] 127784220

# retrieving
plot_query <- "SELECT * FROM concentration_data 
WHERE title_share <= 0.5 
  AND (rank <= 1000 OR rank % 100 = 0)"
concentration_plot_df <- dbGetQuery(con, plot_query)

ggplot(concentration_plot_df, aes(x = title_share, y = play_share, color = as.factor(timestamp))) +
  geom_line(size = 0.5) +
  # Pareto 
  geom_hline(yintercept = 0.80, linetype = "dashed", color = "gray", size = 0.3) +
  geom_vline(xintercept = 0.20, linetype = "dashed", color = "gray", size = 0.3) +
  coord_cartesian(xlim = c(0, 0.5), ylim = c(0, 1)) +
  scale_x_continuous(labels = percent) +
  scale_y_continuous(labels = percent) +
  labs(title = "Cumulative Share Concentration (2005-2020)",
       x = "Top % of Titles", y = "Cumulative % of Plays") +
  theme_minimal()

Top Charts - the Absolute Cut-Off

Building on the Anderson’s classification of the long tail as that, which exists thanks to the digitization, an share of the absolute ranks is analyzed. Anderson (2006) and Goel et al. (2010) suggested the average brick-and-mortar music shop supply to be, subsequently, 55 thousand tracks and 50k – the benefits of the expansion of supply attributed to digitization seems plain to see – with the long tail as defined by an absolute cutoff of 50k exceeding 40% of playcount share in all years except for 2005.

plot_query_absolute <- "SELECT timestamp, rank, play_share 
FROM concentration_data 
WHERE rank <= 60000 
  AND (rank <= 1000 OR rank % 50 = 0)"

concentration_abs_df <- dbGetQuery(con, plot_query_absolute)

ggplot(concentration_abs_df, aes(x = rank / 1000, y = play_share, color = as.factor(timestamp))) +
  geom_line(size = 0.5) +
  coord_cartesian(xlim = c(0, 60), ylim = c(0, 1)) +
  scale_x_continuous(breaks = seq(0, 60, 20)) + 
  scale_y_continuous(labels = label_percent(), breaks = seq(0, 1, 0.2)) +
  labs(title = "Share of Playcounts Using Absolute Cutoffs",
    x = "Rank of Title (1,000s)",
    y = "Percent of Playcounts",
    color = "Year"
  ) +
  theme_minimal() +
  theme(
    legend.position = "bottom",
    panel.grid.minor = element_blank())

Hirschman-Herfindahl Index

HHI is an index widely used in industrial economics used to illustrate the competition between firms as a measure of their relative sizes in their industry. It is calculated as the sum of squares of market shares of each player. However, it can also be borrowed to illustrate the concentration, or dispersion, of consumed items (tv shows: Dalla Torre et al. 2025, household consumption: Neiman and Vavra 2023).

The Hirschman-Herfindahl Index is calculated using the following formula:

\[ {\displaystyle HHI=\sum _{i=1}^{N}(MS_{i})^{2}} \]

Where \(MS_i\) is the market share of a the item \(i\).

The results of HHI calculation indicate a strong dispersion in the data, each year falling far below the threshold of 0.15 for the decimal HHI (1500 for HHI points), which would indicate a high concentration.

hhi_query <- "WITH yearly_totals AS (
    SELECT 
        timestamp AS year,
        SUM(total_playcounts) AS total_market_plays
    FROM tracks_yearly_playcounts
    GROUP BY timestamp
),
track_shares AS (
    SELECT 
        t.timestamp AS year,
        POWER(CAST(t.total_playcounts AS DOUBLE) / y.total_market_plays, 2) AS squared_share
    FROM tracks_yearly_playcounts t
    JOIN yearly_totals y ON t.timestamp = y.year
)
SELECT 
    year,
    SUM(squared_share) AS hhi_decimal,
    SUM(squared_share) * 10000 AS hhi_points,
    1 / SUM(squared_share) AS effective_n_tracks
FROM track_shares
GROUP BY year
ORDER BY year;"

hhi_results <- dbGetQuery(con, hhi_query)

print(hhi_results, row.names = FALSE, digits = 6)

##  year   hhi_decimal hhi_points effective_n_tracks
##  2005 0.00002149102  0.2149102            46531.1
##  2006 0.00001304861  0.1304861            76636.5
##  2007 0.00001071718  0.1071718            93308.1
##  2008 0.00000830928  0.0830928           120347.4
##  2009 0.00000805122  0.0805122           124204.8
##  2010 0.00000737729  0.0737729           135551.2
##  2011 0.00000736464  0.0736464           135784.0
##  2012 0.00000804319  0.0804319           124328.7
##  2013 0.00000758308  0.0758308           131872.5
##  2014 0.00000656951  0.0656951           152218.3
##  2015 0.00000610178  0.0610178           163886.6
##  2016 0.00000587314  0.0587314           170266.6
##  2017 0.00000569391  0.0569391           175626.1
##  2018 0.00000538620  0.0538620           185659.7
##  2019 0.00000607047  0.0607047           164731.8
##  2020 0.00001125963  0.1125963            88812.9

Long Tail Share Over the Years

Following Anderson’s (2006) approach, who defined the long tail as that, which exists thanks to the digitization, an absolute cutoff approach is employed, capping the head of the tail at 50k (Goel et al. 2010) – illustrating the difference in supply corresponding to the previous limits inferred by brick-and-mortar strict supply limits. Additionally, a 1% and 10% cutoff is analyzxed – as the supply of tracks is changing each year, while empirically unvariably rising, the number of tracks in this dataset dispalys a declining tendency since the 2012 (the year signifying the time from which LFM-2b loses users). The 1% cutoff often oscillates around 100k tracks.

# tail shares 
tail_trends <- dbGetQuery(con, "WITH yearly_metrics AS (
    SELECT 
        timestamp,
        total_playcounts,
        ROW_NUMBER() OVER(PARTITION BY timestamp ORDER BY total_playcounts DESC) as rank,
        COUNT(*) OVER(PARTITION BY timestamp) as catalog_size,
        SUM(total_playcounts) OVER(PARTITION BY timestamp) as total_plays
    FROM tracks_yearly_playcounts
)
SELECT 
    timestamp AS year,
    SUM(CASE WHEN rank > 50000 THEN total_playcounts ELSE 0 END) / total_plays AS absolute_tail_share,
    SUM(CASE WHEN rank > (0.01 * catalog_size) THEN total_playcounts ELSE 0 END) / total_plays AS relative_tail_001_share,
    SUM(CASE WHEN rank > (0.1 * catalog_size) THEN total_playcounts ELSE 0 END) / total_plays AS relative_tail_01_share
FROM yearly_metrics
GROUP BY timestamp, total_plays, catalog_size
ORDER BY timestamp;")

tail_trends_long <- tail_trends %>%
  pivot_longer(cols = ends_with("share"), names_to = "Definition", values_to = "Share")

# dual-trend
ggplot(tail_trends_long, aes(x = year, y = Share, color = Definition, group = Definition)) +
  geom_line(size = 1.2) +
  geom_point(size = 3) +
  scale_y_continuous(labels = scales::percent, limits = c(0, NA)) +
  scale_color_manual(
    values = c("absolute_tail_share" = "darkblue", "relative_tail_001_share" = "darkorange", "relative_tail_01_share" = "darkred"),
    labels = c("Absolute Tail (Rank > 50k)", "Relative Tail (Bottom 99%)", "Relative Tail (Bottom 90%")
  ) +
  labs(title = "Evolution of the Long Tail Importance (2005-2020)",
    subtitle = "Comparing Absolute vs. Relative (Market Share) Definitions",
    x = "Year",
    y = "Share of Total Playcounts",
    color = "Tail Definition") +
  theme_minimal() +
  theme(legend.position = "bottom")

## Warning: Removed 3 rows containing missing values or values outside the scale range
## (`geom_line()`).

## Warning: Removed 3 rows containing missing values or values outside the scale range
## (`geom_point()`).

The long tail’s significance appears to be highly dependent on the definition of measurement. The share of tail reaches the highest values when an absolute cutoff definition is employed – it captures the consumptions of products that likely could not have existed in a pphysical brick-and-mortar store due to the limited shelf-space, its high share suggests that a significant portion of music consumption in the digital era is dedicated to the titles that were previously gatekpt. The values are nearly constant in time since 2010, approximating 60% of total playcounts. In contrast, when the long tail is defined by a relative cutoff its importance appears much lower and shows a downward trend after 2012, which could potentially be attributed to the falling number of tracks in the data or to the rising importance of the superstar effect with the implementation of recommender systems.

The absolute cutoff capped at 50k would suggest that consumers are successfully using digital tools to tailor their consumption towards niches. The importance of the tail seems to be mostly a matter of perspective, however this could be an anomaly within the LFM-2b, which could be attributed to the attrition of users following 2012. The ever-growing supply of tracks justifies the use of a relative tail cutoff - capped at 10%.

share_columns <- c("absolute_tail_share", "relative_tail_001_share", "relative_tail_01_share")

regression_results <- list()

for(col in share_columns) {
  formula_str <- paste(col, "~ year")
  fit <- lm(as.formula(formula_str), data = tail_trends)
  s_fit <- summary(fit)
  regression_results[[col]] <- data.frame(
    Definition = col,
    Average_Annual_Slope = coef(fit)["year"],
    P_Value = s_fit$coefficients["year", 4],
    R_Squared = s_fit$r.squared
  )
}
final_regression_table <- do.call(rbind, regression_results)

cat("\n--- REGRESSION RESULTS: SHARE ~ YEAR (2005-2020) ---\n")

## 
## --- REGRESSION RESULTS: SHARE ~ YEAR (2005-2020) ---

print(final_regression_table, row.names = FALSE)

##               Definition Average_Annual_Slope       P_Value R_Squared
##      absolute_tail_share          0.013999225 0.00007211493 0.6871761
##  relative_tail_001_share         -0.008988679 0.06362876774 0.2246611
##   relative_tail_01_share         -0.007674340 0.04218017644 0.2630828

Individual Tails

A crucial implication of the long tail hypothesis that remains largely unstudied is the diversification of the demand-side of cultural goods at the individual level. Anderson (2006) states that along with the democratization of access to increasing catalogues of cultural goods – the consumers will select the content most fitting to their individual tastes, possibly increasing the significance of niche products belonging to the tails – leading to a dispersion of sales (in this case – plays). Borreau et al. (2021) outline the mechanisms affecting the diversity of individual consumption emphasizing the dual role of new digital tools simultaneously steering the users towards more popular titles through the publication of various rankings (Billboard Hot 100, Spotify Charts, etc.), possibly amplifying the Superstar effect and facilitating more adequate and personalized search results through the use of recommendation algorithms, which effectively lower the search costs.

The long tail hypothesis is thus consistent with two distinct behavioral theories:

H1: the majority of consumers follow the crowds – consuming mostly popular goods, while only a minority is interested in the niche content
H2: individual tastes are, themselves, diversified and most of the consumers engage with, both, the popular and niche products.

Goel et al. (2010) find support for the latter hypothesis, stating that the observed eccentricity – while higher than that implied by mass-behavior theories, such as that formulated by McPhee (1963): double jeopardy of niche products, which are generally not known and, supposedly, not liked by those who do know them – is still lower than random models would imply. The findings of Goel et al. (2010) would suggests that music markets are characterized by the highest level of eccentricity among individuals.

To verify this hypothesis and examine the significance of the long tail on individual level, a semi-random sample of 10,000 users is drawn from LFM-2b (a random sample of users with non-missing demographic details – rather than selected from 120,322 users, the pool is restricted to 42,683 users). The yearly consumption of these users is then analyzed in a similar way to that of the aggregate – a ranking of tracks is computed and the tail of individual consumption is defined as all songs below the top 1000.

Subsequently, the relationship between the aggregate and individual tails is analyzed.

Head as TOP 1000

The analysis presented below caps the head of the tail at 1000 top tracks or each individual.

Individual Tail Share

# loading yearly data into duckdb 
path <- "/Users/gosia/Desktop/tails/lastfm_data/yearly.tsv"

dbExecute(con, sprintf("CREATE OR REPLACE VIEW data_yearly AS 
  SELECT * FROM read_csv_auto('%s');", path))

## [1] 0

# sampled user ids
path_ids <- "/Users/gosia/Desktop/sampled_user_ids.txt"
dbExecute(con, sprintf("CREATE OR REPLACE VIEW target_ids AS 
  SELECT column0 AS user_id FROM read_csv_auto('%s', header=False);", path_ids))

## [1] 0

# yearly statistics for the selected users
dbExecute(con, "
CREATE OR REPLACE TABLE user_personal_metrics AS
WITH user_track_ranks AS (
    SELECT 
        user_id, 
        track_id, 
        playcount_yearly,
        YEAR(CAST(timestamp AS DATE)) as year,
        ROW_NUMBER() OVER(PARTITION BY user_id, YEAR(CAST(timestamp AS DATE)) 
                          ORDER BY playcount_yearly DESC) as personal_rank
    FROM data_yearly
    WHERE user_id IN (SELECT user_id FROM target_ids)
)
SELECT 
    user_id, 
    year,
    -- Personal Long Tail share (everything except the top 1000)
    SUM(CASE WHEN personal_rank > 1000 THEN playcount_yearly ELSE 0 END) * 1.0 / 
    SUM(playcount_yearly) AS personal_tail_share,
    SUM(playcount_yearly) AS total_user_plays
FROM user_track_ranks
GROUP BY 1, 2;
")

## [1] 61642

# aggregating yearly average
indiv_tails_summary <- dbGetQuery(con, "
SELECT year, AVG(personal_tail_share) as mean_personal_tail_share
FROM user_personal_metrics
GROUP BY year ORDER BY year;")

cat("\n---  MEAN INDIVIDUAL TAIL SHARE BY YEAR ---\n")

## 
## ---  MEAN INDIVIDUAL TAIL SHARE BY YEAR ---

print(indiv_tails_summary, row.names = FALSE)

##  year mean_personal_tail_share
##  1970               0.00000000
##  2005               0.08693373
##  2006               0.11013774
##  2007               0.11782978
##  2008               0.12069150
##  2009               0.12470523
##  2010               0.11816379
##  2011               0.11612445
##  2012               0.12699289
##  2013               0.13790649
##  2014               0.14643900
##  2015               0.14018413
##  2016               0.14977465
##  2017               0.16011830
##  2018               0.16976310
##  2019               0.21603481
##  2020               0.08581187

The plot below illustrating the average individual tail share over the years shows an upward tendency, suggesting a rise in personal consumption of items from outside of the user’s top 1000 tracks – rising from the level of approx. 9% in 2005 to 22% in 2019. This finding would support the second hypothesis – that the consumers engage in consumption of both the popular and the niche products, exploring the nearly unlimited supply of tracks.

plot_data <- indiv_tails_summary %>%
  filter(year >=2005)
ggplot(plot_data, aes(x = year, y = mean_personal_tail_share)) +
  geom_line(color = "darkblue", size = 1) +
  geom_point(color = "darkred", size = 3) +
  scale_y_continuous(labels = percent_format(), 
                     limits = c(0, max(plot_data$mean_personal_tail_share) * 1.1)) +
  scale_x_continuous(breaks = seq(2005, 2020, by = 2)) +
  labs(x = "Year", y = "Avg Individual Tail Share (%)", title = "Individual Tails Over the Years") +
  theme_minimal()

Global vs. Individual Tails

The relationship between the individual and aggregated tails is analyzed with the following metrics:

Global Tail Discovery Ratio – illustrating what share of user’s tail (tracks outside of their top 1000) is encompassed by tracks belonging to the global tail, calculated as a ratio of tracks belonging simultaneously to the user’s and global tail and their global tail playcounts, this metric captures the user’s discovery or exploration,
Global Tail Share – illustrates the share of global tail in all of the user’s consumption, calculated as the ratio of the number of globally niche tracks (belonging to the long tail) and all of the user’s playcounts in a given year.

# merging individual rankings with global 
dbExecute(con, "CREATE OR REPLACE TABLE individual_global_overlap AS
WITH user_track_data AS (
    SELECT 
        m.user_id, m.track_id, m.playcount_yearly,
        YEAR(CAST(m.timestamp AS DATE)) as year,
        ROW_NUMBER() OVER(PARTITION BY m.user_id, YEAR(CAST(m.timestamp AS DATE)) 
                          ORDER BY m.playcount_yearly DESC) as personal_rank
    FROM data_yearly m
    WHERE m.user_id IN (SELECT user_id FROM target_ids)
)
SELECT 
    u.*,
    r.global_rank,
    r.catalog_size,
    -- is the track in the global tail > 10%
    (r.global_rank > (0.1 * r.catalog_size)) as is_global_tail
FROM user_track_data u
LEFT JOIN global_ranks r ON u.track_id = r.track_id AND u.year = r.year;")

## [1] 104324124

# calculating discovery as the rate of personal tail within the global tail  
dbExecute(con, "CREATE OR REPLACE TABLE discovery_metrics AS
SELECT 
    user_id, 
    year,
    SUM(CASE WHEN is_global_tail AND personal_rank > 1000 THEN playcount_yearly ELSE 0 END) * 1.0 / 
    NULLIF(SUM(CASE WHEN is_global_tail THEN playcount_yearly ELSE 0 END), 0) AS global_tail_discovery_ratio,
    -- share of the global tail in personal consumption
    SUM(CASE WHEN is_global_tail THEN playcount_yearly ELSE 0 END) * 1.0 / SUM(playcount_yearly) AS global_tail_share
FROM individual_global_overlap
GROUP BY 1, 2;")

## [1] 61642

The values of the average global tail share over the years are illustrated on the plot below – suggesting that the share of global tail in individual consumption was declining until 2012, followed by a gradual rise in each subsequent year. The values never went below 15%, implying a constant significant average niche consumption.

global_tail_share_summary <- dbGetQuery(con, "SELECT
                                        year, 
                                        AVG(global_tail_share) as mean_global_tail_share, 
                                        AVG(global_tail_discovery_ratio) as mean_discovery_ratio 
                                        FROM discovery_metrics
                                        WHERE year >= 2005 AND year <=2020
                                        GROUP BY year
                                        ORDER BY year;")
ggplot(global_tail_share_summary, aes(x = year, y = mean_global_tail_share)) +
  geom_line(color = "darkgreen", size = 1) +
  geom_point(color = "darkorange", size = 3) +
  scale_y_continuous(labels = scales::percent_format()) +
  scale_x_continuous(breaks = seq(2005, 2020, by = 2)) +
  labs(x = "Year", y = "Global Tail Share (%)", title = "Global Tail Share Over the Years", subtitle = "Average share of global tail in individual consumption") +
  theme_minimal()

The plot below illustrates the annual changes in the discovery ratio, displaying that the discovery ratio (the ratio of aggregate long tail within the individual long tail) follows an upward tendency, reaching relatively high values, which would indicate that a substantial share of the globally long tail items are consumed in the users’ long tail – being explored and discovered rather than consumed intensively.

ggplot(global_tail_share_summary, aes(x = year, y = mean_discovery_ratio)) +
  geom_line(color = "purple", size = 1.2) +
  geom_point(color = "black", size = 3) +
  scale_y_continuous(labels = scales::percent_format()) +
  scale_x_continuous(breaks = seq(2005, 2020, by = 2)) +
  labs(
    title = "Discovery Ratio",
    subtitle = "Share of individual tail within the global tail consumption",
    x = "Year",
    y = "Discovery Ratio (%)"
  ) +
  theme_minimal()

Head as TOP 100

The analysis performed in this section employs a different definition of the had and tail definition, the top 100 approach is rooted in the composition of global charts, which most often encompass the top 100 songs, as well as the new practices of online personal listening behavior analysis, such as that captured by Spotify Wrapped – each year delivering users their top 100 tracks lists.

# yearly statistics for the selected users
dbExecute(con, "
CREATE OR REPLACE TABLE user_personal_metrics AS
WITH user_track_ranks AS (
    SELECT 
        user_id, 
        track_id, 
        playcount_yearly,
        YEAR(CAST(timestamp AS DATE)) as year,
        ROW_NUMBER() OVER(PARTITION BY user_id, YEAR(CAST(timestamp AS DATE)) 
                          ORDER BY playcount_yearly DESC) as personal_rank
    FROM data_yearly
    WHERE user_id IN (SELECT user_id FROM target_ids)
)
SELECT 
    user_id, 
    year,
    -- Personal Long Tail share (everything except the top 100)
    SUM(CASE WHEN personal_rank > 100 THEN playcount_yearly ELSE 0 END) * 1.0 / 
    SUM(playcount_yearly) AS personal_tail_share,
    SUM(playcount_yearly) AS total_user_plays
FROM user_track_ranks
GROUP BY 1, 2;
")

## [1] 61642

# aggregating yearly average
indiv_tails_summary_100 <- dbGetQuery(con, "
SELECT year, AVG(personal_tail_share) as mean_personal_tail_share
FROM user_personal_metrics
GROUP BY year ORDER BY year;")

cat("\n---  MEAN INDIVIDUAL TAIL SHARE BY YEAR ---\n")

## 
## ---  MEAN INDIVIDUAL TAIL SHARE BY YEAR ---

print(indiv_tails_summary_100, row.names = FALSE)

##  year mean_personal_tail_share
##  1970                0.0000000
##  2005                0.5175638
##  2006                0.5397971
##  2007                0.5498866
##  2008                0.5677136
##  2009                0.5541268
##  2010                0.5582624
##  2011                0.5652173
##  2012                0.5973070
##  2013                0.5559723
##  2014                0.5482123
##  2015                0.5378088
##  2016                0.5482287
##  2017                0.5551845
##  2018                0.5657919
##  2019                0.6110878
##  2020                0.5311157

Individual Tails

The plot below illustrates the share of individual tails, when the head of the tail is defined as the top 100 tracks.

plot_data_100 <- indiv_tails_summary_100 %>%
  filter(year >=2005)
ggplot(plot_data_100, aes(x = year, y = mean_personal_tail_share)) +
  geom_line(color = "darkblue", size = 1) +
  geom_point(color = "darkred", size = 3) +
  scale_y_continuous(labels = percent_format(), 
                     limits = c(0, max(plot_data_100$mean_personal_tail_share) * 1.1)) +
  scale_x_continuous(breaks = seq(2005, 2020, by = 2)) +
  labs(x = "Year", y = "Avg Individual Tail Share (%)", title = "Individual Tails Over the Years") +
  theme_minimal()

# merging individual rankings with global 
dbExecute(con, "CREATE OR REPLACE TABLE individual_global_overlap AS
WITH user_track_data AS (
    SELECT 
        m.user_id, m.track_id, m.playcount_yearly,
        YEAR(CAST(m.timestamp AS DATE)) as year,
        ROW_NUMBER() OVER(PARTITION BY m.user_id, YEAR(CAST(m.timestamp AS DATE)) 
                          ORDER BY m.playcount_yearly DESC) as personal_rank
    FROM data_yearly m
    WHERE m.user_id IN (SELECT user_id FROM target_ids)
)
SELECT 
    u.*,
    r.global_rank,
    r.catalog_size,
    -- is the track in the global tail > 10%
    (r.global_rank > (0.1 * r.catalog_size)) as is_global_tail
FROM user_track_data u
LEFT JOIN global_ranks r ON u.track_id = r.track_id AND u.year = r.year;")

## [1] 104324124

# calculating discovery as the rate of personal tail within the global tail  
dbExecute(con, "CREATE OR REPLACE TABLE discovery_metrics AS
SELECT 
    user_id, 
    year,
    SUM(CASE WHEN is_global_tail AND personal_rank > 100 THEN playcount_yearly ELSE 0 END) * 1.0 / 
    NULLIF(SUM(CASE WHEN is_global_tail THEN playcount_yearly ELSE 0 END), 0) AS global_tail_discovery_ratio,
    -- share of the global tail in personal consumption
    SUM(CASE WHEN is_global_tail THEN playcount_yearly ELSE 0 END) * 1.0 / SUM(playcount_yearly) AS global_tail_share
FROM individual_global_overlap
GROUP BY 1, 2;")

## [1] 61642

The plot below illustrates the annual changes in the discovery ratio, implying that most of the globally long tail items are consumed in the users’ long tail – being explored and discovered rather than consumed as a global hyperfixation. The discovery ratio values mostly follow an upward trend, however a significant stagnation can be seen during the years 2012-2018, which, combined with the fact that tail share ratio was on the rise during that year would imply that the users who stayed in the dataset following the mass exodus in 2012 represent the more focused niche users, whose consumption of the long tail is mostly encompassed by their top tracks. The substantial values of the discovery ratio support the second hypothesis – most of the users eclectically combine the consumption of both the popular and the niche items. The fact that the change between the discovery ratio as calculated for the tail defined by these tracks, which are outside of the user’s top 100 and that outside of top 1000 is so substantial, would indicate that the aggregate long tail consumption is mostly encompassed in the middle tail of user’s personal consumption (between the top 100 and top 1000th track).

global_tail_share_summary_100 <- dbGetQuery(con, "SELECT
                                        year, 
                                        AVG(global_tail_share) as mean_global_tail_share, 
                                        AVG(global_tail_discovery_ratio) as mean_discovery_ratio 
                                        FROM discovery_metrics
                                        WHERE year >= 2005 AND year <=2020
                                        GROUP BY year
                                        ORDER BY year;")
ggplot(global_tail_share_summary_100, aes(x = year, y = mean_discovery_ratio)) +
  geom_line(color = "purple", size = 1.2) +
  geom_point(color = "black", size = 3) +
  scale_y_continuous(labels = scales::percent_format()) +
  scale_x_continuous(breaks = seq(2005, 2020, by = 2)) +
  labs(
    title = "Discovery Ratio",
    subtitle = "Share of individual tail within the global tail consumption",
    x = "Year",
    y = "Discovery Ratio (%)"
  ) +
  theme_minimal()

Listener Profiles

To identify the drivers of long tail consumption at the individual level, an Ordinary Least Squares (OLS) regression was performed, with the Global Tail Share (the proportion of listening dedicated tot he bottom 90% of the global market) as the dependent variable. The model accounts for the demogrpahic characteristics (including age and gender), consumption intensity (total annual playcounts) and temporal period effects.

The regression’s results indicate that the long tail consumption is significantly influenced by a user’s demographic profile. This analysis yields the following results:

age is a positive and hihgly significant predictor of niche consumption – for every additional year of age, a user’s share of global long-tail tracks increases by 0.47 percentage points ?, which supports the hypothesis that older listeners possess more stable and specialized preferences that deviate from the current mainstream trends (Davies et al. 2022, Glevarec et al. 2020).
male and non-binary users exhibit a significantly higher preference for niche content consumption compared to the female baseline, with both groups consuming approximately 3.8% more globally niche music, holding other variables constant. This may be a result of a bias in algorithmic recommendation tools, as analyzed by Lesota et al. (2021).
the significantly negative coefficient for the logarithm of total annual playcounts of a user suggests that as users increase their activity a larger proportion of that is dedicated toward mainstream content, rather than the tail. This might reflect a double jeopardy effect, as described by McPhee (1963), in which the heavy users frequently return to a core set of popular hits.
the year specific dummy variables capture the trends that affected all users equally, their significantly negative values reflect a market pressure towards consumption concentration. While the global tail share has been declining in 2005-2012, it was on the rise in 2012-2020 – the negative coefficient of yearly dummies would suggest that this trend can be attributed to personal characteristics rather than global trends.

# importing demoghraphics and connceting to data
path_demo <- "/Users/gosia/Desktop/sampled_users.tsv"
dbExecute(con, sprintf("CREATE OR REPLACE TABLE demographics AS SELECT * FROM read_csv_auto('%s');", path_demo))

## [1] 10000

df_final <- dbGetQuery(con, "
SELECT s.*, d.age, d.gender, d.country 
FROM discovery_metrics s
JOIN demographics d ON s.user_id = d.user_id;")

# linear regression with demographic information
model_demo <- lm(global_tail_share ~ age + gender + log(total_annual_plays) + as.factor(year), 
                 data = dbGetQuery(con, "SELECT s.*, d.age, d.gender, u.total_user_plays as total_annual_plays 
                                         FROM discovery_metrics s 
                                         JOIN demographics d ON s.user_id = d.user_id
                                         JOIN user_personal_metrics u ON s.user_id = u.user_id AND s.year = u.year"))

cat("\n--- DEMOGRAPHIC MODEL (GLOBAL TAIL SHARE) ---\n")

## 
## --- DEMOGRAPHIC MODEL (GLOBAL TAIL SHARE) ---

print(summary(model_demo))

## 
## Call:
## lm(formula = global_tail_share ~ age + gender + log(total_annual_plays) + 
##     as.factor(year), data = dbGetQuery(con, "SELECT s.*, d.age, d.gender, u.total_user_plays as total_annual_plays \n                                         FROM discovery_metrics s \n                                         JOIN demographics d ON s.user_id = d.user_id\n                                         JOIN user_personal_metrics u ON s.user_id = u.user_id AND s.year = u.year"))
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.64031 -0.11517 -0.04046  0.07858  0.82189 
## 
## Coefficients:
##                            Estimate  Std. Error t value             Pr(>|t|)
## (Intercept)              0.84363423  0.17729397   4.758           0.00000196
## age                      0.00470197  0.00009449  49.759 < 0.0000000000000002
## genderm                  0.03848193  0.00172407  22.320 < 0.0000000000000002
## gendern                  0.03881655  0.00363793  10.670 < 0.0000000000000002
## log(total_annual_plays) -0.01353101  0.00036869 -36.700 < 0.0000000000000002
## as.factor(year)2005     -0.40166966  0.17802392  -2.256             0.024057
## as.factor(year)2006     -0.50903072  0.17753987  -2.867             0.004143
## as.factor(year)2007     -0.56764036  0.17740367  -3.200             0.001376
## as.factor(year)2008     -0.61749390  0.17735438  -3.482             0.000499
## as.factor(year)2009     -0.64965674  0.17733074  -3.664             0.000249
## as.factor(year)2010     -0.67434958  0.17731704  -3.803             0.000143
## as.factor(year)2011     -0.69482576  0.17730880  -3.919           0.00008911
## as.factor(year)2012     -0.71616291  0.17730552  -4.039           0.00005371
## as.factor(year)2013     -0.70590311  0.17730475  -3.981           0.00006862
## as.factor(year)2014     -0.69725612  0.17730706  -3.932           0.00008417
## as.factor(year)2015     -0.67673528  0.17730977  -3.817             0.000135
## as.factor(year)2016     -0.66268674  0.17731347  -3.737             0.000186
## as.factor(year)2017     -0.65499628  0.17731684  -3.694             0.000221
## as.factor(year)2018     -0.64627542  0.17732066  -3.645             0.000268
## as.factor(year)2019     -0.64192560  0.17732834  -3.620             0.000295
## as.factor(year)2020     -0.56990586  0.17732948  -3.214             0.001310
##                            
## (Intercept)             ***
## age                     ***
## genderm                 ***
## gendern                 ***
## log(total_annual_plays) ***
## as.factor(year)2005     *  
## as.factor(year)2006     ** 
## as.factor(year)2007     ** 
## as.factor(year)2008     ***
## as.factor(year)2009     ***
## as.factor(year)2010     ***
## as.factor(year)2011     ***
## as.factor(year)2012     ***
## as.factor(year)2013     ***
## as.factor(year)2014     ***
## as.factor(year)2015     ***
## as.factor(year)2016     ***
## as.factor(year)2017     ***
## as.factor(year)2018     ***
## as.factor(year)2019     ***
## as.factor(year)2020     ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.1772 on 61621 degrees of freedom
## Multiple R-squared:  0.1257, Adjusted R-squared:  0.1254 
## F-statistic: 442.9 on 20 and 61621 DF,  p-value: < 0.00000000000000022

The model confirms that the digital era has provided the tools for individuals to tailor their consumption to their eclectic tastes, this process is heaily moderated by age and gender. Furthermore, the most active consumers are not necessarily the most diverse as the pull of the superstar effect remains a dmonant force across the digital landscape.

Bibliography

Anderson, C., Nissley, C., & Anderson, C. (2006, April). The long tail.

Bourreau, M., Moreau, F., & Wikström, P. (2022). Does digitization lead to the homogenization of cultural content?. Economic Inquiry, 60(1), 427-453.

Brynjolfsson, E., Yu (Jeffrey) Hu, & Smith, M. D. (2003). Consumer Surplus in the Digital Economy: Estimating the Value of Increased Product Variety at Online Booksellers. Management Science, 49(11), 1580–1596. http://www.jstor.org/stable/4134002

Dalla Torre, P., Fantozzi, P., & Naldi, M. (2025). Unraveling the Long Tail Phenomenon. THE MATTER OF INTELLECTUAL PROPERTY, 62.

Davies, C., Page, B., Driesener, C., Anesbury, Z., Yang, S., & Bruwer, J. (2022). The power of nostalgia: Age and preference for popular music. Marketing Letters, 33(4), 681–692.

Elberse, A. (2008). Should you invest in the long tail?. Harvard business review, 86(7/8), 88.

Glevarec, H., Nowak, R., & Mahut, D. (2020). Tastes of our time: Analysing age cohort effects in the contemporary distribution of music tastes. Cultural Trends, 29(3), 182–198.

Goel, S., Broder, A., Gabrilovich, E., & Pang, B. (2010, February). Anatomy of the long tail: ordinary people with extraordinary tastes. In Proceedings of the third ACM international conference on Web search and data mining (pp. 201-210).

Lesota, O., Melchiorre, A., Rekabsaz, N., Brandl, S., Kowald, D., Lex, E., & Schedl, M. (2021, September). Analyzing item popularity bias of music recommender systems: are different genders equally affected?. In Proceedings of the 15th ACM conference on recommender systems (pp. 601-606).

Liebowitz, S., Ward, M., & Zentner, A. (2025). Only a “longish” tail. Production and Operations Management, 34(8), 2331-2347.

McPhee, W. N. (1963). Formal theories of mass behavior.

Neiman, B., & Vavra, J. (2023). The rise of niche consumption. American Economic Journal: Macroeconomics, 15(3), 224-264.

Schedl, M., Brandl, S., Lesota, O., Parada-Cabaleiro, E., Penz, D., & Rekabsaz, N. (2022, March). LFM-2b: A dataset of enriched music listening events for recommender systems research and fairness analysis. In Proceedings of the 2022 Conference on Human Information Interaction and Retrieval (pp. 337-341).

Waldfogel, J. (2017). How digitization has created a golden age of music, movies, books, and television. Journal of economic perspectives, 31(3), 195-214.

Music TAILoring: How Individual Habits Shape the Aggregate Long Tail in the Digital Era

Małgorzata Hodurek

Introduction

Research Questions

Descriptive Statistics

Visualization

Long Tail on Aggregate Level

Ranking Distribution

Ranking Plot - From Heads to Tails

Pareto Principle Verification

Top Charts - the Absolute Cut-Off

Hirschman-Herfindahl Index

Long Tail Share Over the Years

Individual Tails

Head as TOP 1000

Individual Tail Share

Global vs. Individual Tails

Head as TOP 100

Individual Tails

Listener Profiles

Bibliography