Introduction

The current landscape of digital content, in the era of streaming platforms, can, at times, seem hard to navigate for a consumer. The democratization of access, both to the consumption, as well as the production and publishing of music that followed digitization, can simultaneously feel like a blessing and a curse. The challenge that current music consumers face lies not within the limitations imposed by rigid supplies of the old brick-and-mortar shops, but in finding the desired product in the thicket of each streaming platform’s near infinite assortment. Machine learning has not been indifferent to this issue – ushering the users towards the items they are most likely to prefer with the help of recommendation algorithms. While to the ordinary eye these may look like black boxes, generating recommendations out of the thin air, riddled with various advertisements of whatever record label owes the highest number of shares in a given streaming platform, they often rely heavily on clustering. The implementation of unsupervised learning methods provides a crucial insight into the nature of platform’s supply, as well as demand – users, whose listening histories have been analyzed to obtain their consumption habits receive more personalized recommendation of products, who have previously been analyzed and grouped so as to match user’s preferences.

The aim of this project is to perform a similar analysis of user’s behavioral music consumption patterns – to see whether any specific patterns of user’s navigation across the long tail of music products can be detected. For this purpose, clustering methods will be utilized to discover segmentation of music listeners based on their registered listening habits. The main purpose of cluster analysis is to discover the natural groupings within the data, characterized by high intra-class similarity and inter-class dissimilarity.

This study is structured as follows: the first section provides an overview of the dataset used in this project and prepares the data for further analysis. Subsequently, additional behavioral metrics, derived from literature concerning music recommender systems, are computed for each user in this project’s data sample. The derived variables are then analyzed and explorative clustering analysis is performed – the clusterability of data is investigated and the optimal number of clusters is analyzed. Afterwards, clustering algorithms, such as K-Means and Hierarchical Clustering are performed and subsequently evaluated in terms of statistical purity and practical utility. The clusters are then analyzed through the lens of demographic variables registered in the original dataset.


Data Description and Preprocessing

The analysis in this project will be based on LFM-2b – a dataset made available by Schedl et al. (2022) – currently accessible solely through the Wayback Machine, as the source platform revoked the license for publishing large datasets derived from their site. LFM-2b contains listening records of over 120,000 users of a music platform - last.fm, amounting to more than 2 billion listening events (interactions between a distinct user and a specific song enriched by its metadata) of more than 50 million distinct tracks by over 5 million artists. Additionally, LFM-2b contains demographic information concerning users (age, gender, country) and additional track-level metadata, such as tags, microgenres and vector embeddings of lyrics. LFM-2b includes 10 data elements contained in separate files, focused on varying levels of metadata. For this analysis I will use: - listening-events – the main data file containing tuples of <user id, track id, album id, timestamp>, - users – file linking distinct users with their demographic data, <user id, country, age, gender, creation time>, - tags-micro-genres – file linking distinct tracks to the user-generated tags filtered so as to only represent genres provided by Every Noise at Once.

Due to the exceptionally large size of the dataset and limited computational powers, in this study I will draw a semi-random sample of 10,000 users whose listening events the further analysis will be focused on. In order to ensure proper user-level metadata representation, the sample is picked out of those users, who provide information of their gender and country, and whose age ranges betweeen 10 to 95 years. Thus filtered dataset consists of 42,683 users, meaning that only 35% of the users in the dataset provide representative demographic data.

## Classes 'data.table' and 'data.frame':   120322 obs. of  5 variables:
##  $ user_id      : int  0 1 2 3 4 5 6 7 8 9 ...
##  $ country      : chr  "UK" "US" "UK" "BR" ...
##  $ age          : int  31 43 35 31 51 38 28 -1 36 45 ...
##  $ gender       : chr  "m" "m" "m" "m" ...
##  $ creation_time: POSIXct, format: "2002-12-28 01:00:00" "2003-04-15 02:00:00" ...
##  - attr(*, ".internal.selfref")=<externalptr>
## [1]   0 112

Further analysis will be focused on the previously selected users, whose listening histories will be imported from the main data file (listening-events.tsv). The original dataset records the activity of 120,322 users spanning over the period between February 14th 2005 to March 20th 2020. This adds up to a total of over 2 billions listening events, hence the need for sampling and for importing target users from the file with the use of DuckDB library. DuckDB is an in-process SQL OLAP database management system, which allows for a faster and more computationally efficient data processing (DuckDB Docs).

# creating a connection (DuckDB in memory)
con <- dbConnect(duckdb::duckdb(), dbdir = ":memory:")

# reading sampled user id list into the connection
dbWriteTable(con, "target_users", sampled_users %>% select(user_id))

input_file <- "/Users/gosia/Desktop/tails/lastfm_data/listening-events.tsv"
output_file <- "/Users/gosia/Desktop/sampled_events.csv"
# defining the query - extracting the listening events of target users
query <- paste0("
SELECT 
    le.user_id, 
    le.track_id, 
    le.timestamp
FROM read_csv('", input_file, "', delim='\t', header=TRUE, auto_detect=TRUE) AS le
INNER JOIN target_users tu ON CAST(le.user_id AS VARCHAR) = tu.user_id
")
# execute the query and save 
sql_cmd <- paste0("COPY (", query, ") TO '", output_file, "' (HEADER, DELIMITER ',')")

dbExecute(con, sql_cmd)
## [1] 296947364
dbDisconnect(con)

Calculating Behavioral Metrics

Diversity

User’s diversity is defined in three ways, one based on the entirety of user’s listening habits, indicating what portion of user’s total playcount (the sum of all of user’s listening events) is constituted by unique tracks, the other two representing user’s micro-genre diversity.

Track Level

Track level diversity (\(TD_i\)) is computed for each user in the sample as:

\[TD_{i} = {UT}_i/{TP}_i\] Where \(UT_i\) are the distinct, unique tracks in the entirety of user’s \(i\) listening history and \(TP_i\) is the total amount of listening events constituting user’s \(i\) listening history.

# connecting to the file 
ds <- open_dataset("/Users/gosia/Desktop/sampled_events.parquet")
# calculating track diversity 
diversity_results <- ds %>%
  group_by(user_id) %>%
  summarise(total_playcount = n(),
    unique_tracks = n_distinct(track_id)) %>%
  mutate(diversity_score = unique_tracks / total_playcount) %>%

  collect()

# taking a look
head(diversity_results)
## # A tibble: 6 × 4
##   user_id total_playcount unique_tracks diversity_score
##     <int>           <int>         <int>           <dbl>
## 1   34590           30832          9698          0.315 
## 2   83295           13059          5510          0.422 
## 3   61327           21333          1732          0.0812
## 4    2409           25032          7896          0.315 
## 5   21574           22595          3366          0.149 
## 6   16648           12215          7513          0.615

Micro-genre Level

Micro-genre file contains user-generated tags describing tracks with more than 10 listens globally, filtered so as to only present the tags defining music genres and micro-genres provided by Every Noise at Once. Therefore not all of the songs represented in users’ listening histories will be assigned a micro-genre, however those songs are either quite obscure (less than 10 listens globally), or not represented by uniformly informative tags (the downside to user-generated tags can be their lack of informative value, due to humorous tags providing deceitful information or a lot of noise).

To compute micro-genre level diversity, I will limit the analysis to such extent so as to include only the dominant tag for each track. A lot of records in the dataset were described with tags such as: “pop”, “indie pop”, “alternative pop”, most often than not, in the same order. The latter tags do not deepen the understanding of the track’s genre in terms of diversity. Therefore, following the rule of sparsity - only the first tag will be considered in further analysis.

# loading file
lines <- read_lines("/Users/gosia/Desktop/tails/lastfm_data/tags-micro-genres.json")
track_genres <- data.table(track_id = as.integer(str_extract(lines, "(?<=\"i\":)\\d+")), # extracting track_id (in the file a number following " "i":")
                        top_genre=str_extract(lines,"(?<=\"tags\":\\{\")[^\"]+")) # extracting top_genre - always the first appearing on the list, following ""tags":"

# cleaning RAM
rm(lines)

# taking a look
head(track_genres)
##    track_id  top_genre
##       <int>     <char>
## 1: 36346257        pop
## 2:  7247506        pop
## 3: 10540381 indie rock
## 4: 36367985       soul
## 5: 36039983     grunge
## 6: 24361147       folk
# saving file 
write_parquet(track_genres, "/Users/gosia/Desktop/top_genres.parquet")

setDT(track_genres)
total_n_genres <- uniqueN(track_genres$top_genre)

# aggregating users listening history 
# simplifying the listening events dataset so that it is in a <user_id, track_id, playcount> format
# where playcount is the number of times that a given track was listened to by a given user
con <- dbConnect(duckdb::duckdb()) 
ds <- open_dataset("/Users/gosia/Desktop/sampled_events.parquet") %>% 
  to_duckdb(con = con, table_name = "events")

sampled_playcounts <- ds %>%
  group_by(user_id, track_id) %>%
  summarise(
    playcount = n(),
    .groups = "drop"
  ) %>%
  collect()

dbDisconnect(con)
setDT(sampled_playcounts)

# connecting dataset with genres 

full_data <- merge(sampled_playcounts, track_genres, by = "track_id", all.x = TRUE)
# tags are provided only for the songs with more than 10 scrobbles on the platform
# hence not all songs have genres
# percent of na values
sum(is.na(full_data$top_genre))/length(full_data$track_id) # 41% NAs 
## [1] 0.4146919
full_data[is.na(top_genre), top_genre := "Unknown"]
Genre Ratio

One metric of user’s genre diversity denotes the ratio of genres listened by the user \(i\) (\(G_i\)) and the total amount of genres represented in the sample of listening histories (\(TG\)).

\[ GD_i = G_i/TG \]

Genre Entropy

The former representation of genre diversity may not be the one that provides the most information on the user’s actual listening habits. Both, user who listens to mainly rock music, but has tried pop, metal, hip-hop or jazz once or twice and the user who listens to all of those genres equally will have the same score.

To alleviate this issue an alternative metric of user genre diversity – based on Shannon entropy – is proposed. It is computed using the following formula: \[ GE_i = - \sum_{g \in G_i} p(g) * log_2p(g) \] Where, \(GE_i\) representes genre entropy for a user \(i\), \(G_i\) represents all of the unique micro-genres replayed by a user \(i\) and \(p(g)\) is the probability (relative frequency) of listening to a micro-genre \(g\), calculated for each user as follows: \[ p(g) = \text {playcount of genre g}/\text {user's total playcount}\]

### CALCULATING DIVERSITY METRICS ###

# Shannon entropy
calculate_weighted_entropy <- function(counts) {
  total <- sum(counts)
  if(total == 0) return(0)
  p <- counts/total # probability of a given genre being played
  -sum(p*log2(p)) # Shannon entropy formula
}

# user - genre table 
user_genre <- full_data[, .(genre_playcount = sum(playcount)),
                        by = .(user_id, top_genre)]

# calculating metrics for each user
user_genre_metrics <- user_genre[, .(unique_genres = .N,
                            genre_ratio = .N/total_n_genres,
                            genre_entropy = calculate_weighted_entropy(genre_playcount)),
                            by = user_id]
head(user_genre_metrics)
##    user_id unique_genres genre_ratio genre_entropy
##      <int>         <int>       <num>         <num>
## 1:    7546           163  0.07882012      3.388006
## 2:   53343            74  0.03578337      3.336050
## 3:   29864           138  0.06673114      3.801346
## 4:   98958           289  0.13974855      3.412806
## 5:    6307           321  0.15522244      4.176551
## 6:    1342           274  0.13249516      3.255282

Novelty

The methodology for calcualting novelty is adapted from Schedl and Hauger (2015), it models user’s tendency for listening previously unknown music. It is calculated through dividing user’s listening history into 6 month periods and computing a ratio of novel items – those appearing for the first time in the entirety of user’s listening history. Subsequently, overall novelty score of a user \(i\) is computed by averaging over all of the 6 month time windows.

It is computed using the following formula: \[ N_{i_t} = NT_{i_t}/ TP_{i_t} \] Where \(NT_{i_t}\) denotes the amount of novel tracks in a time window \(t\) for user \(i\) and \(TP_{i_t}\) is the total playcount of a given user in time slot \(t\).

Subsequently, total novelty for a user \(i\) is calculated as: \[ N_i = 1/n_t \sum_{\text t \in n_t} N_{i_t} \] Where \(n_t\) is the number of all of the time windows that the novelty score – \(N_{i_t}\) was computed for.

# connecting to DuckDB
con <- dbConnect(duckdb::duckdb())

# installing icu extension within duckdb for date handling 
dbExecute(con, "INSTALL icu;")
## [1] 0
dbExecute(con, "LOAD icu;")
## [1] 0
# importing file to the etsbalished connection 
ds <- open_dataset("/Users/gosia/Desktop/sampled_events.parquet") %>% 
  to_duckdb(con = con, table_name = "events")

# calculating discovery_moments - the first time that a user listened to a given track
discovery_moment <- ds %>%
  group_by(user_id, track_id) %>%
  summarise(first_timestamp = min(timestamp), .groups = "drop")

### calculating novelty metrics in 6 month long windows
novelty_metrics <- ds %>%
  left_join(discovery_moment, by = c("user_id", "track_id")) %>%
  mutate(curr_year = year(timestamp),
         curr_month = month(timestamp),
         disc_year = year(first_timestamp),
         disc_month = month(first_timestamp)) %>% # establishing time windows
  # mutating window id so that they are in a YYYYH format, where YYYY denotes the year and H which half of the year the event took place in
  mutate(current_window = curr_year*10 + if_else(curr_month<= 6, 1, 2), 
         discovery_window = disc_year*10 + if_else(disc_month <= 6, 1, 2)) %>%
  mutate(is_novel = if_else(current_window == discovery_window, 1, 0)) %>%
  # grouping by users and windows
  group_by(user_id, current_window) %>%
  summarise(window_novelty = mean(is_novel), 
    .groups = "drop") %>%
  # average for users
  group_by(user_id) %>%
  summarise(
    novelty_score = mean(window_novelty)) %>%
  collect()

# cleaning up and disconnecting
dbDisconnect(con)
head(novelty_metrics)
## # A tibble: 6 × 2
##   user_id novelty_score
##     <int>         <dbl>
## 1   94635         0.597
## 2   31374         0.756
## 3     634         0.625
## 4   26948         0.450
## 5   15989         0.588
## 6   19643         0.653

Mainstreaminess

Another behavioral metric describing user’s listening habits is represented by mainstreaminess (Schedl and Hauger, 2015), which shows user’s inclination to listen to songs that are popular, or mainstream.

In this analysis, mainstream tracks are defined by the frequency with which they were listened to in the entire timespan of the data cover on LFM-2b. To construct a ranking of all of the songs in the dataset, the file containing <user id, track id, playcount> – listening-counts.tsv – is used. For each track, all of the users’ playcounts are aggregated (global playcount).

# big dataset - duckdb
con <- dbConnect(duckdb::duckdb(), dbdir = ":memory:")
path <- '/Users/gosia/Desktop/tails/lastfm_data/listening-counts.tsv'
dbExecute(con, paste0("
                      CREATE OR REPLACE VIEW all_counts AS
                      SELECT * FROM read_csv('", path, "', delim = '\t', header = TRUE, auto_detect = TRUE)"))
## [1] 0
# query for aggregating all of the times that a track was played
query <- "SELECT 
          track_id,
          SUM(count) as global_playcount
          FROM all_counts
          GROUP BY track_id"

# executing and saving
global_output <- "/Users/gosia/Desktop/global_track_ranks.csv"
dbExecute(con, paste0("COPY (", query, ") TO '", global_output, "' (HEADER, DELIMITER ',')"))
## [1] 50813373
dbDisconnect(con)

global_tracks <- fread('/Users/gosia/Desktop/global_track_ranks.csv')
options(scipen = 999)

plot_data <- global_tracks[global_playcount >= 5]

plot_data$log <- log10(plot_data$global_playcount)

# plot of raw distribution, zoomed 
p_raw <- ggplot(plot_data, aes(x = global_playcount)) +
  geom_histogram(fill = "violet", color = "white", binwidth = 10) +
  coord_cartesian(xlim = c(0, 500)) + 
  labs(title = "Raw Distribution", 
       x = "Global Playcount",
       y = "Track Count") +
  theme_minimal()

# log plot
p_log <- ggplot(plot_data, aes(x = log)) +
  geom_histogram(bins = 60, fill = "purple", color = "white", alpha = 0.9) +
  labs(title = "Log-Transformed Distribution", 
       x = "Log10(Global Playcount)",
       y = "Track Count") +
  theme_minimal()

grid.arrange(p_raw, p_log, ncol = 2)

The distribution of track popularity clearly illustrates the long tail phenomenon (Anderson, 2006) – characteristic for cultural industries post-digitization. Digitization has drastically transformed the supply of music accessible to an average listener. From the old brick-and-mortar shops with strictly limited supply, promoting mostly the best-selling products and, hence, perpetuating their popularity, through the rise of music piracy and easily distributable digital copies in the early 2000s, all the way to the streaming platforms used nowadays – the available supply of music products has increased exponentially in the 21st century. This development led to a shift in how consumers explore and navigate across as good as an infinite supply of experience goods, whose true value is known to the consumer only after the good is consumed. As made visible by the plots above, most of the songs in the dataset have been replayed less than 100 times globally, however there still persist songs that have been replayed many times, constituting the mainstream. Due to this distribution of track popularity, logarithmic transformation is applied to the global playcount in further analysis.

Subsequently, the mainstreaminess score of each track is computed as: \[ M_s = log_{10}(\text{global playcount}_s + 1)/log_{10}(\text{max(global playcount)+1})\] Where \(s\) is a subscript denoting a track \(s\) and \(max(\text{global playcount})\) is the number of times that the top 1 track in the dataset was played.

Mainstreaminess score per user is then calculated as a mean of \(M_s\) in user’s history weighted by user’s playcount of that song.

# logarithmic normalization
max_playcount <- max(global_tracks$global_playcount)

# formula: log(playcount + 1) / log(max + 1)
global_tracks[, track_score := log10(global_playcount + 1) / log10(max_playcount + 1)]

# cleaning up - only the score needed
global_tracks[, global_playcount := NULL]

# merging with sampled users' playcount history 
merged <- merge(sampled_playcounts, global_tracks, by = "track_id", all.x = TRUE)
summary(merged)
##     track_id           user_id         playcount          track_score    
##  Min.   :       1   Min.   :     1   Min.   :     1.00   Min.   :0.0552  
##  1st Qu.:13811903   1st Qu.:  9375   1st Qu.:     1.00   1st Qu.:0.3449  
##  Median :25210466   Median : 22198   Median :     1.00   Median :0.5172  
##  Mean   :25296113   Mean   : 29663   Mean   :     4.23   Mean   :0.4933  
##  3rd Qu.:37120819   3rd Qu.: 44877   3rd Qu.:     3.00   3rd Qu.:0.6577  
##  Max.   :50813367   Max.   :120319   Max.   :108903.00   Max.   :1.0000
# mainstreaminess per user 
mainstreaminess <- merged[, .(mainstreaminess_score = weighted.mean(track_score, w = playcount)),
                          by = user_id]

Listening Density

Listening density is a metric reflecting each user’s average yearly playcount. First, the number of years that each user was active is computed as the duration between user’s first activity on the platform (as in the date of user’s first listening event, not registration date) and the last activity registered in the dataset divided by 365 for yearly approximation.

This process led to the identification of 4 outlying observations – users, whose activity has been registered only one time in the duration of a one day – eliminated from further analysis. Another user is eliminated due to faulty data, his first registered activity was on 1970-01-01, while the last.fm platform has been running since 2002.

Afterwards, listening density is calculated as: \[ LD_i = TP_i/AY_i\] Where \(TP_i\) is the total playcount of a user and \(AY_i\) represents the number of years that a user was active.

con <- dbConnect(duckdb::duckdb())
parquet_input <- "/Users/gosia/Desktop/sampled_events.parquet"
# query for finding min and max date of listening events
query <- paste0(" SELECT
          user_id, 
          MIN(timestamp) as first_seen,
          MAX(timestamp) as last_seen
          FROM '", parquet_input, "'
          GROUP BY user_id")

user_lifespan <-dbGetQuery(con, query)
setDT(user_lifespan)
dbDisconnect(con)

# calculating the difference 
# in days
user_lifespan[, duration_days := as.numeric(difftime(last_seen, first_seen, units = "days"))]

user_lifespan[, one_timer := ifelse(duration_days<1, 1, 0)]
sum(user_lifespan$one_timer) # 4 one-timers - eliminate as outliers 
## [1] 4
# checking whether there are any data errors
summary(user_lifespan) # yes, min first_seen is 1970-01-01  
##     user_id         first_seen                    
##  Min.   :     1   Min.   :1970-01-01 01:00:36.00  
##  1st Qu.: 15476   1st Qu.:2009-12-29 18:42:44.50  
##  Median : 35662   Median :2011-06-30 16:23:32.00  
##  Mean   : 42538   Mean   :2010-12-25 23:10:08.85  
##  3rd Qu.: 64079   3rd Qu.:2012-04-08 21:26:48.00  
##  Max.   :120319   Max.   :2014-08-29 01:37:59.00  
##    last_seen                      duration_days       one_timer     
##  Min.   :2012-06-01 02:48:22.00   Min.   :    0.0   Min.   :0.0000  
##  1st Qu.:2013-10-14 19:05:45.75   1st Qu.:  959.9   1st Qu.:0.0000  
##  Median :2015-11-14 20:36:13.00   Median : 1936.4   Median :0.0000  
##  Mean   :2016-05-18 13:46:35.40   Mean   : 1970.6   Mean   :0.0004  
##  3rd Qu.:2019-09-05 05:57:08.25   3rd Qu.: 2835.9   3rd Qu.:0.0000  
##  Max.   :2020-03-20 12:59:58.00   Max.   :17248.2   Max.   :1.0000
# Unix Epoch glitch 
# last.fm has been running since 2002 - eliminate all of the users whose first_seen dates are before that as unreliable
to_eliminate <- user_lifespan[year(first_seen) < 2002 | duration_days < 1, user_id] # losing 5 users - saving to eliminate from final data
user_lifespan <- user_lifespan[!user_id %in% to_eliminate]
summary(user_lifespan) # all good
##     user_id         first_seen                    
##  Min.   :     1   Min.   :2005-02-14 01:07:46.00  
##  1st Qu.: 15468   1st Qu.:2009-12-29 17:19:38.00  
##  Median : 35671   Median :2011-06-30 13:21:22.00  
##  Mean   : 42534   Mean   :2010-12-27 02:09:53.03  
##  3rd Qu.: 64061   3rd Qu.:2012-04-08 20:09:25.50  
##  Max.   :120319   Max.   :2014-08-29 01:37:59.00  
##    last_seen                      duration_days       one_timer
##  Min.   :2012-06-01 02:48:22.00   Min.   :   2.07   Min.   :0  
##  1st Qu.:2013-10-15 06:24:05.00   1st Qu.: 960.69   1st Qu.:0  
##  Median :2015-11-14 23:44:52.00   Median :1937.12   Median :0  
##  Mean   :2016-05-18 23:00:52.91   Mean   :1969.87   Mean   :0  
##  3rd Qu.:2019-09-06 23:10:52.00   3rd Qu.:2835.91   3rd Qu.:0  
##  Max.   :2020-03-20 12:59:58.00   Max.   :5512.99   Max.   :0
# converting into years
user_lifespan[, active_years := duration_days/365]

write.csv(user_lifespan, "user_lifespan_sample.csv", row.names = F)

datasets <- list(sampled_users, user_lifespan, user_genre_metrics, novelty_metrics, mainstreaminess, diversity_results)

final_data <- datasets %>%
  reduce(function(x,y) merge(x,y, by = "user_id", all.x = TRUE))

# eliminate the users that were previously identified as outliers
final_data<- final_data[!user_id %in% to_eliminate]
colSums(is.na(final_data)) # no NAs
##               user_id               country                   age 
##                     0                     0                     0 
##                gender         creation_time            first_seen 
##                     0                     0                     0 
##             last_seen         duration_days             one_timer 
##                     0                     0                     0 
##          active_years         unique_genres           genre_ratio 
##                     0                     0                     0 
##         genre_entropy         novelty_score mainstreaminess_score 
##                     0                     0                     0 
##       total_playcount         unique_tracks       diversity_score 
##                     0                     0                     0
final_data[, listening_density := total_playcount/active_years]

Selecting Features for Clustering

Clustering will be performed only on the previously calculated–numeric–behavioral metrics: * Diversity (track-level), * Genre Entropy, * Novelty, * Mainstreaminess, * Listening Density.

Demographic information provided by the users will later be used for the analysis of the final clusters.

str(final_data)
## Classes 'data.table' and 'data.frame':   9995 obs. of  19 variables:
##  $ user_id              : int  1 5 8 12 25 31 32 35 48 50 ...
##  $ country              : chr  "US" "ES" "US" "US" ...
##  $ age                  : int  43 38 36 33 31 36 36 28 35 37 ...
##  $ gender               : chr  "m" "m" "m" "m" ...
##  $ creation_time        : POSIXct, format: "2003-04-15 02:00:00" "2003-07-22 02:00:00" ...
##  $ first_seen           : POSIXct, format: "2006-04-30 23:36:15" "2006-06-09 16:58:38" ...
##  $ last_seen            : POSIXct, format: "2014-02-20 23:05:56" "2018-06-14 18:32:00" ...
##  $ duration_days        : num  2853 4388 1570 3316 3462 ...
##  $ one_timer            : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ active_years         : num  7.82 12.02 4.3 9.08 9.48 ...
##  $ unique_genres        : int  65 355 173 185 290 172 178 141 77 141 ...
##  $ genre_ratio          : num  0.0314 0.1717 0.0837 0.0895 0.1402 ...
##  $ genre_entropy        : num  3.45 4.75 3.48 5.11 4.98 ...
##  $ novelty_score        : num  0.987 0.726 0.703 0.758 0.768 ...
##  $ mainstreaminess_score: num  0.417 0.537 0.423 0.609 0.55 ...
##  $ total_playcount      : int  868 52823 20000 16189 22597 12502 21642 22128 9947 16251 ...
##  $ unique_tracks        : int  775 24442 10180 7876 12961 8220 6755 4255 4569 5634 ...
##  $ diversity_score      : num  0.893 0.463 0.509 0.487 0.574 ...
##  $ listening_density    : num  111 4394 4649 1782 2383 ...
##  - attr(*, ".internal.selfref")=<externalptr> 
##  - attr(*, "sorted")= chr "user_id"
summary(final_data)
##     user_id         country               age           gender         
##  Min.   :     1   Length:9995        Min.   :10.00   Length:9995       
##  1st Qu.: 15468   Class :character   1st Qu.:20.00   Class :character  
##  Median : 35671   Mode  :character   Median :23.00   Mode  :character  
##  Mean   : 42534                      Mean   :25.01                     
##  3rd Qu.: 64061                      3rd Qu.:27.00                     
##  Max.   :120319                      Max.   :95.00                     
##  creation_time                      first_seen                    
##  Min.   :2003-02-10 19:05:23.00   Min.   :2005-02-14 01:07:46.00  
##  1st Qu.:2007-09-30 17:36:06.50   1st Qu.:2009-12-29 17:19:38.00  
##  Median :2009-07-07 16:35:47.00   Median :2011-06-30 13:21:22.00  
##  Mean   :2009-05-30 07:15:25.22   Mean   :2010-12-27 02:09:53.03  
##  3rd Qu.:2011-04-15 14:13:51.50   3rd Qu.:2012-04-08 20:09:25.50  
##  Max.   :2012-07-19 22:07:27.00   Max.   :2014-08-29 01:37:59.00  
##    last_seen                      duration_days       one_timer
##  Min.   :2012-06-01 02:48:22.00   Min.   :   2.07   Min.   :0  
##  1st Qu.:2013-10-15 06:24:05.00   1st Qu.: 960.69   1st Qu.:0  
##  Median :2015-11-14 23:44:52.00   Median :1937.12   Median :0  
##  Mean   :2016-05-18 23:00:52.91   Mean   :1969.87   Mean   :0  
##  3rd Qu.:2019-09-06 23:10:52.00   3rd Qu.:2835.91   3rd Qu.:0  
##  Max.   :2020-03-20 12:59:58.00   Max.   :5512.99   Max.   :0  
##   active_years       unique_genres    genre_ratio        genre_entropy    
##  Min.   : 0.005672   Min.   :  2.0   Min.   :0.0009671   Min.   :0.03367  
##  1st Qu.: 2.632024   1st Qu.: 68.0   1st Qu.:0.0328820   1st Qu.:3.30627  
##  Median : 5.307192   Median :114.0   Median :0.0551257   Median :3.87241  
##  Mean   : 5.396901   Mean   :126.8   Mean   :0.0613068   Mean   :3.74852  
##  3rd Qu.: 7.769602   3rd Qu.:172.0   3rd Qu.:0.0831721   3rd Qu.:4.32403  
##  Max.   :15.104076   Max.   :793.0   Max.   :0.3834623   Max.   :5.63307  
##  novelty_score     mainstreaminess_score total_playcount  unique_tracks   
##  Min.   :0.08837   Min.   :0.1616        Min.   :     9   Min.   :     9  
##  1st Qu.:0.51300   1st Qu.:0.5109        1st Qu.:  5306   1st Qu.:  1594  
##  Median :0.63544   Median :0.5732        Median : 18923   Median :  4248  
##  Mean   :0.64205   Mean   :0.5634        Mean   : 29707   Mean   :  7020  
##  3rd Qu.:0.76709   3rd Qu.:0.6273        3rd Qu.: 34722   3rd Qu.:  8932  
##  Max.   :1.00000   Max.   :0.8554        Max.   :662625   Max.   :151467  
##  diversity_score    listening_density  
##  Min.   :0.005538   Min.   :     7.38  
##  1st Qu.:0.188262   1st Qu.:  1665.20  
##  Median :0.298967   Median :  3624.64  
##  Mean   :0.341958   Mean   :  5518.25  
##  3rd Qu.:0.454493   3rd Qu.:  6857.03  
##  Max.   :1.000000   Max.   :177363.83

The visual inspection of the scatter plots reveals a high density and uniform coverage of the behavioral features, indicating that users are effectively differentiated. In sharp contrast to these, listening density displays a sparse distribution with few, but drastic outlying values, exceeding the value of 100,000 playcounts per year. These structural differences render the need for data standardization, so that the overall cluster distance calculations are not disproportionately dominated by the consumption volume, masking the more nuanced behavioral patterns to the algorithm.

To further examine the relationships between the final clustering variables, a correlation matrix is calculated. Its analysis reveals logical dependencies between the variables. There is a strong positive correlation (r=0.68) between diversity and novelty, indicating that users, who are more inclined to discover new tracks also tend to have a more diversified consumption. Interestingly, mainstreaminess correlates positively with genre entropy, which suggests that mainstream consumption also involves eclectic listening across popular hits from various genres. Listening density displays a moderate negative correlation with, both, diversity and novelty, implying that the most active users (heavy listeners) tend to have more focused and repetitive listening habits – favouring familiarity over novelty and exploration. All of the variables are retained for further analysis, as there is no severe multicollinearity within the data – all of the correlation coefficients remain below the critical threshold and none of the features are deemed redundant.

features <- c("genre_entropy", "novelty_score",
              "mainstreaminess_score", "diversity_score", "listening_density")

music_data <- final_data[, ..features]

# scaling data 
music_scaled <- scale(music_data)

Pre-clustering Assessment

Clusterability

Prior to the launch of clustering algorithms, the data needs to be further analyzed in terms of its clusterability and the appropriate parameter selections, such as the optimal number of clusters (defined as \(k\)).

First, Hopkin’s statistic is calculated in order to verify whether data is suitable for clustering. The value of Hopkin’s statistic for this project’s data is equal to \(0.8255\), therefore the null hypothesis of the uniformity of data’s distribution is rejected.

The Visual Assessment of Tendency (VAT) plot, displayed below, illustrates the Euclidean distance between data points reordered according to their mutual similarity, where red indicates low distance (inter-class similarity) and blue indicates high distance (intra-class dissimilarity). The plot generated for data used in this analysis displays a block-diagonal structure with reddish squares along the main diagonal–representing similar groups of observations– separated by the blue, perpendicular regions of high dissimilarity. The presence of these structures serves as visual evidence that the dataset possesses a non-random clusterable structure, proving the application of clustering algorithms feasible.

### HOPKINS STATISTIC ###
set.seed(123)
random_rows <- sample(1:nrow(music_scaled), 1000)
data_sample <- music_scaled[random_rows,]
get_clust_tendency(data_sample, n = 500, graph = TRUE)
## $hopkins_stat
## [1] 0.8255465
## 
## $plot

### The Optimal Number of Clusters

Having ensured that the data used in this project does, indeed, possess a non-random structure with potential for meaningful clusters, further examination of data structure is performed. This section will focus on the determination of the optimal number of meaningful clusters.

Sousa et al. (2019) find four main types of user behaviors (\(k=4\)) in terms of how curiosity stimuli captured by novelty, uncertainty, conflict and complexity drive the user accesses to online music. Performing Mini-Batch K-Menas on 150 millions of listening events generated by 43,000 users, they discovered four distinct types of user behavior: * Underground - describing access towards an item with similar stimuli metrics and a tendency to stay in the relaxation zone for each metric (moderate values), * Eclectic - with access regularly spread across each metric, which achieve higher values compared with the underground access, * Niche Radical - type of access, where the user listens to repeated songs (low novelty) with the songs being more comples, * Explorer - pursuing more complex, novel music.

While the authors stress that they are analyzing access profiles – a characteristic of listening events– rather than user profiles, I suspect that a similar analysis can be applied to user behavior as a whole. However, Marques et al. (2013) perform a similar analysis based on familiarity, mainstreaminess and metrics measuring attention – revealing seven user clusters (\(k=7\)), computed with Ward hierarchical clustering method: fond of surprises, mainstream upholder, mainstream explorer, crowd follower, niche radical, highly eclectic and underground.

There is an extensive and ever-growing field of literature concerning the analysis of music recommendation systems– a crucial tool in the post-digitization era providing an increasingly more personalized way to navigate the near infinite supply of digital content. Schedl and Hauger (2015) propose an implementation of user-specific characteristics modelling their listening behavior, such as: diversity, mainstreaminess and novelty. Based on those behavioral features they mannually divide the users into 14 categories based on imposed statistical divisions. This project implements the same approach used in their study but, instead of artificially imposed user segmentation, clustering algorithms will be applied to analyze the natural groupings of the data. Chang and Zhang (2022) use unsupervised learning methods, such as Principal Component Analysis (PCA) and K-means clustering to categorize users based on their behaviors, preferences and demographic information ending up with seven (\(k=7\)) clusters characterized with age, genre, the intensity of user’s activity and interaction with streaming platform’s social functions. Kużelewska and Wichowski (2021), however, propose the use of modified DBSCAN clustering algorithms for more efficient Collaborative Filtering recommender systems (based on user’s interaction with the items).

To sum up, literature concerning the clustering of music consumers related data points neither to the most effective method of clustering, nor to a clear optimal number of clusters. In such case, to determine the appropriate number of clusters, the Elbow Method, Silhouette and Calinsky-Harabasz Index will be applied in the following section.

Elbow Method

The elbow plot illustrates explained variation as a function of the number of clusters. According to this method, the optimal number of clusters is determined by identifying th epoint where the marginal gain in reduction in variance begins to diminish significantly. The first substantial drop in variance occurs at \(k=2\), reducing it to 0.74, the slope still continues its significant drop at \(k=3\), reducing the explained variance by 0.13. Another distinct inflection point becomes visible aroun \(k=4\), \(k=5\) – the transition from \(k=4\) to \(k=5\) still offers a noticeable reduction in variance. However, beyond \(k=5\), the slope flattens suggesting that subsequent clusters provide diminishing returns in terms of explaining the data structure.Consequently, the elbow method produces ambiguous results, probing further exploration.

Optimal_Clusters_KMeans(music_scaled, max_clusters = 10, plot_clusters = T)

##  [1] 1.0000000 0.7357964 0.6084016 0.5195449 0.4633887 0.4416606 0.3984118
##  [8] 0.3738682 0.3581979 0.3313506

Silhouette

Exploratory analysis of silhouettte value to determine the optimal number of clusters works as though data had already been separated into a given amount of clusters using a specified clustering method (here k-means is used). The analysis of silhouette plot reveals \(k=4\) as an unequivocally optimal number of clustering. However, it is only higher than \(k=2\), which would offer significantly lower utility. Therefore, this analysis will implement \(k=4\) as the optimal parameter selection.

Optimal_Clusters_KMeans(music_scaled, max_clusters = 10, plot_clusters = T, criterion = "silhouette")

##  [1] 0.0000000 0.2740422 0.2639991 0.2831830 0.2045231 0.2243158 0.2000706
##  [8] 0.1906235 0.1866816 0.1862293

Calinsky-Harabasz Index

The Calinsky-Harabasz Index, a metric of internal validation, measures the ratio of between-cluster variance and within-cluster variance. This index was calculated using the Ward’s linkage method and suggests \(k=2\) as the optimal number of clusters. This result, however, is inconsistent with previously cited literature, as well as the market logic. While being statistically correct, it would not provide informative results.

NbClust(music_scaled, distance = 'euclidean', min.nc = 2, max.nc = 10, method = 'ward.D', index = 'ch')
## $All.index
##        2        3        4        5        6        7        8        9 
## 2585.229 2286.134 2176.760 2012.294 1927.729 1975.745 1836.711 1717.333 
##       10 
## 1674.566 
## 
## $Best.nc
## Number_clusters     Value_Index 
##           2.000        2585.229 
## 
## $Best.partition
##    [1] 1 1 1 2 2 1 1 2 1 1 1 1 2 1 1 1 1 1 2 1 1 2 1 2 1 2 1 2 2 1 1 2 2 1 2 2 1
##   [38] 2 1 1 1 2 2 2 1 2 1 2 2 1 1 2 1 2 2 1 1 1 1 2 1 2 2 2 2 2 1 2 1 2 1 1 2 2
##   [75] 2 1 2 1 1 1 2 2 2 1 1 1 2 1 1 1 1 2 2 2 2 2 1 2 2 1 1 1 1 2 1 2 1 2 2 1 2
##  [112] 2 1 2 2 2 1 1 1 2 1 2 2 2 2 1 2 2 2 1 1 1 1 1 2 1 1 1 1 2 1 2 2 2 1 1 2 2
##  [149] 2 2 2 2 1 1 2 2 1 1 2 1 2 1 1 2 1 1 1 2 2 1 2 1 2 1 2 2 2 1 2 2 1 1 2 1 2
##  [186] 2 2 2 2 1 2 1 1 2 2 2 1 1 2 2 1 1 2 1 1 2 2 1 1 2 2 1 1 2 2 1 2 2 1 1 1 1
##  [223] 2 2 2 2 2 2 2 1 1 1 1 2 2 2 1 2 2 1 1 2 2 2 1 2 2 2 1 2 2 1 1 1 2 1 1 2 1
##  [260] 1 2 2 1 2 1 1 2 1 2 2 2 2 1 2 2 1 1 1 2 2 1 1 2 2 2 2 2 2 2 1 1 1 2 1 1 2
##  [297] 2 2 2 2 1 1 1 2 1 2 2 2 1 2 1 1 2 2 1 1 2 2 1 2 1 2 2 1 2 1 2 2 1 2 2 2 2
##  [334] 2 1 2 1 2 2 1 2 2 1 1 1 2 1 2 1 1 2 2 2 2 1 2 2 2 1 2 1 1 2 2 2 1 2 1 2 1
##  [371] 2 2 1 1 1 2 2 1 2 2 2 2 2 1 1 1 2 2 1 2 2 2 2 1 1 2 1 2 2 2 2 1 2 1 2 2 1
##  [408] 2 1 2 2 1 2 1 1 1 2 1 2 2 1 2 2 2 2 2 2 2 2 2 1 1 2 2 2 1 2 2 2 2 2 2 1 1
##  [445] 1 1 2 1 2 1 2 1 1 2 2 2 2 1 1 2 2 1 2 1 2 1 1 1 1 2 2 1 2 1 2 1 1 2 1 1 1
##  [482] 2 2 1 1 1 2 1 2 1 2 2 2 2 2 1 1 2 2 1 2 2 2 2 2 2 1 1 2 2 2 2 2 2 2 2 1 2
##  [519] 2 1 1 2 1 1 2 2 1 2 2 1 2 1 1 1 2 2 2 2 2 2 2 2 2 2 1 2 2 1 2 1 2 1 1 2 2
##  [556] 1 2 1 1 2 1 2 2 2 2 1 2 1 2 1 2 1 1 2 2 1 1 2 1 2 2 2 2 2 2 2 1 1 2 2 2 2
##  [593] 2 1 2 1 2 2 1 2 2 1 1 2 2 2 2 2 1 1 2 2 1 2 2 1 1 2 1 1 2 2 2 2 2 2 1 2 2
##  [630] 2 2 1 1 2 1 2 2 2 2 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 1 2 1 1 2 2 1 2 2 2
##  [667] 1 1 2 1 1 1 2 1 2 2 2 2 2 2 1 2 1 2 2 2 2 2 2 2 2 1 2 2 2 1 2 2 2 2 1 2 1
##  [704] 1 1 2 2 1 2 2 2 2 2 2 2 1 2 2 2 1 2 1 1 2 1 2 2 2 1 2 2 1 2 1 1 2 2 2 2 2
##  [741] 2 2 1 2 2 2 1 2 2 2 1 1 2 2 2 1 2 1 2 2 2 2 2 2 2 2 1 1 2 2 1 2 1 1 2 2 1
##  [778] 2 2 1 2 2 1 2 1 2 1 2 2 2 1 2 2 1 1 2 2 1 2 2 1 1 1 2 2 1 1 2 2 2 1 1 2 2
##  [815] 2 1 2 1 1 1 2 2 1 1 1 2 1 2 2 2 2 1 1 2 1 1 2 1 2 2 1 2 1 1 1 1 2 2 1 2 1
##  [852] 2 2 2 2 1 2 1 2 2 1 1 1 1 2 2 2 1 2 2 1 1 2 2 2 2 1 2 2 2 1 1 2 2 2 1 2 1
##  [889] 2 2 2 2 2 1 2 2 2 2 2 2 2 1 2 1 1 2 2 1 1 2 2 1 2 2 1 1 2 2 2 1 2 2 1 2 2
##  [926] 1 1 2 1 1 2 2 2 2 2 2 2 2 2 2 1 1 2 1 1 1 1 2 1 1 2 2 2 2 1 1 2 2 2 1 2 1
##  [963] 1 2 2 2 2 2 2 1 2 1 2 2 1 2 2 1 2 1 2 1 2 1 1 2 1 2 2 2 2 2 1 1 2 1 2 1 1
## [1000] 1 2 1 1 2 2 2 1 2 2 2 2 2 1 2 2 2 1 2 2 2 2 2 1 1 2 1 2 2 1 2 2 2 1 2 2 2
## [1037] 2 2 2 1 1 2 2 2 2 2 2 2 2 2 1 2 2 2 2 2 1 2 2 2 1 1 1 1 2 1 2 1 1 2 1 1 1
## [1074] 2 1 2 2 1 2 1 2 1 2 2 2 2 1 1 2 2 2 2 1 2 1 2 2 2 2 2 2 2 2 1 2 2 1 1 2 1
## [1111] 1 2 2 1 2 2 1 2 2 2 2 1 2 2 1 2 1 2 2 2 2 1 1 2 2 2 2 2 2 1 2 1 2 1 1 2 1
## [1148] 1 2 2 1 2 1 1 1 1 1 2 2 2 2 2 2 1 2 2 2 2 1 2 2 1 2 2 1 2 1 2 2 2 2 1 1 2
## [1185] 1 1 2 1 1 2 2 2 2 1 2 2 1 1 2 2 2 2 2 2 2 2 2 2 1 2 2 1 2 2 2 1 2 1 1 1 2
## [1222] 1 2 2 2 2 2 1 2 1 1 1 1 1 1 2 2 2 1 2 2 1 1 1 1 1 1 1 1 2 2 2 2 1 1 1 1 2
## [1259] 1 2 1 1 2 1 2 2 1 2 2 2 1 2 2 2 2 2 2 2 1 2 2 2 1 2 1 2 2 2 2 2 1 2 2 1 1
## [1296] 1 2 1 2 1 1 2 2 2 2 1 1 2 1 2 1 2 2 1 2 2 1 2 2 2 1 2 2 1 2 2 1 2 2 1 2 1
## [1333] 2 2 2 1 2 2 1 2 1 2 1 1 2 1 1 1 2 2 2 2 2 2 1 1 2 1 1 1 2 1 2 2 1 1 1 1 2
## [1370] 1 1 1 1 1 1 1 2 2 1 2 2 2 2 1 1 2 2 2 2 2 1 2 1 1 1 2 2 2 2 2 1 2 2 2 1 2
## [1407] 1 2 2 1 2 2 1 1 2 1 1 1 1 2 2 1 1 2 1 2 1 1 2 1 1 2 1 2 2 1 1 1 2 2 1 2 1
## [1444] 2 1 1 2 2 2 1 2 1 2 1 2 2 2 1 1 1 1 2 1 1 2 2 2 2 1 2 2 2 1 1 1 2 2 2 1 2
## [1481] 2 2 1 1 2 2 2 1 2 2 2 2 1 1 1 2 1 2 1 1 2 1 2 1 2 1 1 1 2 2 1 2 2 2 2 2 2
## [1518] 2 1 1 1 1 1 1 1 2 2 2 1 2 1 1 2 2 1 1 2 1 1 2 1 1 1 1 1 1 2 1 2 2 2 2 2 1
## [1555] 2 2 1 1 1 2 1 2 2 1 2 2 1 2 2 2 2 2 2 2 1 2 2 1 2 1 1 2 2 1 2 2 2 2 1 1 2
## [1592] 2 1 2 2 2 2 1 2 1 2 2 2 1 1 2 1 2 1 2 2 2 1 2 1 2 1 1 2 2 1 2 2 2 2 1 1 2
## [1629] 2 2 2 2 1 1 1 2 2 2 2 2 1 1 1 2 2 2 1 2 1 2 1 2 2 2 2 2 2 1 2 2 2 1 1 2 1
## [1666] 1 1 2 2 2 1 1 1 2 1 2 2 2 1 1 1 1 2 2 2 2 2 1 1 1 2 2 2 1 2 1 2 1 2 2 2 2
## [1703] 1 2 2 2 1 1 2 1 1 2 1 1 2 2 2 2 2 2 2 2 2 2 1 2 2 1 1 1 1 1 2 1 1 2 1 1 2
## [1740] 2 2 1 2 2 1 1 2 1 1 1 1 2 1 2 2 2 2 2 2 1 2 1 1 2 2 2 2 1 2 2 1 1 2 2 2 2
## [1777] 2 2 1 2 1 2 1 1 1 2 1 2 1 2 1 2 2 2 1 2 1 1 1 1 1 2 2 2 2 2 1 1 2 2 2 2 2
## [1814] 2 1 2 1 1 2 1 1 1 1 2 2 2 1 2 2 2 2 2 2 2 1 2 2 2 2 2 1 1 1 1 1 2 1 2 1 1
## [1851] 1 2 1 1 2 2 2 2 2 2 2 2 1 1 2 2 1 2 1 2 2 2 1 2 2 2 2 2 1 2 1 2 1 1 2 2 2
## [1888] 1 2 2 2 2 2 1 2 2 2 2 2 2 2 2 1 1 2 2 2 2 2 1 1 1 2 1 1 2 1 2 2 2 1 2 1 2
## [1925] 2 1 2 2 1 2 1 1 2 2 2 2 1 2 2 1 2 1 2 2 2 1 1 2 2 1 1 2 2 2 2 2 2 1 1 1 2
## [1962] 2 1 1 1 1 2 2 1 2 1 2 2 1 1 2 2 2 1 1 1 2 2 1 2 2 1 1 2 1 2 2 2 1 1 2 2 1
## [1999] 2 2 1 1 1 2 2 2 2 1 2 2 1 2 2 1 2 2 2 2 2 2 2 2 2 2 1 2 1 1 1 2 2 1 2 2 2
## [2036] 2 1 1 1 2 2 1 2 2 1 1 2 2 1 2 1 1 2 1 1 1 1 2 1 2 1 1 1 2 2 2 1 1 2 2 2 1
## [2073] 2 1 2 1 2 2 1 2 2 2 2 1 2 2 1 2 2 1 1 1 2 2 2 2 2 1 2 2 2 2 1 2 2 1 2 2 1
## [2110] 2 2 2 1 2 2 1 2 1 1 2 2 1 1 1 2 2 2 2 1 2 1 1 2 2 1 1 2 1 2 2 1 2 2 2 2 2
## [2147] 2 2 1 2 1 2 2 2 2 1 1 2 2 2 1 2 2 1 2 2 2 2 2 1 2 2 1 2 1 2 2 2 1 1 1 1 2
## [2184] 2 2 1 2 1 2 1 2 2 1 2 1 2 2 1 2 2 1 2 2 1 1 2 2 2 1 2 2 2 2 1 2 1 1 2 2 1
## [2221] 1 2 2 2 2 2 2 2 2 1 2 2 2 2 1 1 1 1 1 1 2 1 2 1 1 2 2 2 2 2 2 2 1 2 2 2 2
## [2258] 2 2 2 2 1 1 2 1 2 2 1 1 1 2 1 2 2 2 2 2 1 1 2 2 2 2 2 1 1 2 2 2 2 2 2 1 2
## [2295] 2 2 1 1 1 2 1 1 1 2 1 2 1 2 2 2 2 2 1 1 2 2 2 2 2 2 2 1 1 2 2 1 2 2 2 1 2
## [2332] 2 2 2 1 2 2 2 2 1 2 2 2 2 2 2 1 1 2 2 2 1 2 1 1 2 2 2 1 2 2 2 2 2 2 1 1 1
## [2369] 1 2 2 2 2 1 1 1 2 2 2 2 2 1 2 2 2 2 1 2 2 1 2 1 2 2 1 1 2 2 1 1 1 2 2 2 2
## [2406] 2 1 2 1 2 2 1 2 2 2 2 2 2 1 2 2 2 2 2 2 2 1 2 1 2 1 2 1 2 1 2 1 1 2 2 1 2
## [2443] 1 2 1 2 2 2 2 2 2 1 1 1 2 1 1 2 2 1 2 2 1 1 2 2 1 1 2 2 2 2 1 2 1 2 2 2 1
## [2480] 2 1 2 2 2 2 2 1 2 2 2 2 2 2 2 2 2 1 2 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 1 1 2
## [2517] 2 2 2 2 2 2 2 1 1 2 2 2 2 1 1 1 2 2 2 2 1 1 2 1 2 2 1 1 1 2 2 2 2 2 1 1 2
## [2554] 2 1 2 2 2 1 1 1 2 2 1 1 2 2 2 2 2 2 2 2 1 1 2 1 1 2 2 1 2 1 2 1 2 1 2 1 2
## [2591] 2 2 2 2 2 1 1 2 1 2 1 2 1 2 2 2 1 1 2 1 2 1 2 2 1 2 1 2 2 2 2 1 2 2 2 1 1
## [2628] 1 1 2 2 1 1 1 1 2 1 2 1 1 2 1 2 1 1 2 2 1 2 2 1 1 2 2 2 1 2 2 2 2 2 2 2 2
## [2665] 2 2 1 2 2 2 2 1 1 2 1 2 2 2 2 1 2 1 1 1 1 2 1 1 1 1 1 1 1 2 1 2 1 2 1 1 1
## [2702] 2 2 2 2 2 2 2 2 2 2 2 1 1 2 1 1 2 1 2 2 2 1 2 2 2 2 2 2 1 1 1 2 1 2 2 1 2
## [2739] 1 2 2 1 2 1 1 2 1 2 2 1 2 1 2 2 2 1 2 1 1 1 2 2 1 2 1 2 2 1 2 2 2 2 1 1 2
## [2776] 2 2 2 1 2 2 2 2 2 1 1 2 1 2 1 2 1 2 2 1 2 2 2 2 2 2 1 2 2 1 1 2 2 2 1 1 2
## [2813] 2 2 2 2 1 2 2 2 1 2 1 2 1 2 2 2 1 1 2 2 2 1 1 2 2 2 2 2 1 1 2 2 2 2 1 2 2
## [2850] 1 2 1 1 2 1 1 2 2 1 2 1 2 2 1 1 2 1 2 1 1 2 1 2 1 2 1 1 1 1 1 2 1 2 1 1 2
## [2887] 1 2 1 1 2 2 1 2 1 1 2 1 1 1 1 2 2 1 1 2 2 2 2 2 2 1 2 2 2 1 2 2 2 1 1 2 1
## [2924] 1 2 2 2 1 2 1 2 2 1 1 2 1 2 1 1 1 2 2 2 2 1 2 1 2 2 2 2 2 2 1 2 2 2 2 2 1
## [2961] 1 2 2 1 2 2 2 2 2 2 1 2 2 2 2 1 2 2 1 2 2 1 1 1 2 1 2 2 2 1 1 2 2 2 1 1 1
## [2998] 2 2 2 2 2 2 2 2 2 1 1 2 2 1 2 1 2 2 2 2 2 2 1 1 1 2 1 1 2 1 1 1 1 1 1 1 1
## [3035] 2 2 2 2 2 2 1 1 1 1 2 2 2 1 2 2 2 2 2 2 2 2 1 2 2 2 2 2 1 2 2 2 1 1 2 2 1
## [3072] 2 1 2 1 1 1 2 2 1 2 2 2 1 2 2 2 2 1 1 2 1 1 2 2 1 2 2 1 2 2 2 2 1 1 2 2 2
## [3109] 2 2 2 2 1 2 1 1 1 2 2 1 1 1 1 2 2 2 2 1 2 2 1 2 1 2 2 1 2 2 2 2 1 2 1 2 1
## [3146] 1 2 2 2 2 2 1 2 2 2 2 1 1 1 2 2 1 2 1 1 1 1 1 1 2 2 1 2 2 1 1 2 2 1 2 1 2
## [3183] 1 2 2 2 1 2 2 1 2 2 1 2 1 2 1 1 1 2 1 2 1 2 1 2 1 2 2 1 2 1 1 2 2 1 2 2 1
## [3220] 1 2 1 2 1 1 2 2 1 2 1 1 2 1 2 2 1 2 2 2 1 2 1 1 1 2 1 2 2 1 1 2 1 2 2 2 1
## [3257] 2 2 1 2 2 2 1 2 2 2 1 2 1 2 1 1 1 2 2 1 1 1 2 1 2 2 1 2 1 2 2 2 1 1 1 2 1
## [3294] 1 2 1 1 1 2 2 1 2 2 1 2 2 2 2 2 2 1 2 2 2 2 1 1 1 2 1 2 2 2 1 2 2 2 2 2 2
## [3331] 2 1 2 1 2 2 2 1 2 1 2 2 1 2 2 1 2 2 1 2 2 2 2 2 2 2 2 1 1 2 2 1 2 2 1 2 1
## [3368] 2 2 1 1 2 1 2 1 2 2 1 2 1 2 1 2 2 2 1 2 2 2 2 2 1 2 2 1 2 2 1 2 1 2 2 2 2
## [3405] 1 2 2 1 2 1 2 2 2 1 1 2 2 1 2 1 2 1 1 1 1 1 1 2 1 2 2 1 2 2 2 2 2 2 2 1 2
## [3442] 2 2 2 2 2 2 2 1 2 2 1 2 2 2 2 1 2 2 2 2 1 2 1 2 1 2 1 2 2 1 2 1 2 2 2 2 1
## [3479] 2 1 1 2 2 2 1 2 2 1 2 2 2 2 2 2 1 1 2 1 1 1 1 2 1 2 1 1 2 1 2 1 2 2 2 2 2
## [3516] 2 2 1 2 2 1 2 1 1 1 1 2 1 2 1 2 1 2 2 1 2 1 1 1 1 2 1 1 2 2 2 2 1 2 1 1 2
## [3553] 1 2 1 2 2 2 2 2 2 1 2 1 1 1 1 1 2 1 1 1 2 1 2 1 2 2 2 2 1 1 1 1 1 2 2 2 2
## [3590] 1 1 1 1 2 2 2 2 1 2 1 2 2 2 2 1 1 2 1 1 2 2 2 2 2 2 2 2 2 2 2 1 1 2 2 1 1
## [3627] 2 2 1 1 1 1 1 2 2 2 1 1 2 2 2 2 2 2 2 1 1 2 2 1 2 1 2 2 2 2 2 2 2 1 2 1 2
## [3664] 2 2 1 2 2 2 2 1 2 2 2 2 2 2 1 2 2 2 2 2 2 2 1 2 1 2 2 2 2 2 2 2 1 2 1 2 1
## [3701] 1 2 2 1 2 2 2 2 1 2 2 1 2 2 2 2 1 1 1 2 2 1 2 1 2 2 1 2 1 1 1 2 2 2 2 1 1
## [3738] 2 2 2 2 1 2 1 2 2 2 1 2 2 1 2 2 2 2 2 2 1 1 1 2 2 1 2 2 2 1 2 1 1 2 1 1 1
## [3775] 2 2 2 2 1 2 1 2 2 1 1 1 2 2 2 2 2 2 2 2 2 2 1 2 2 2 1 1 2 2 2 2 2 2 2 2 1
## [3812] 1 2 2 1 2 2 2 1 2 2 2 2 1 1 1 2 1 2 2 1 1 2 2 1 2 2 2 2 1 2 2 2 2 1 1 1 1
## [3849] 2 1 1 1 1 2 1 2 2 2 1 2 2 1 1 2 2 1 1 2 1 2 2 2 1 1 2 2 2 2 1 2 1 2 1 2 2
## [3886] 2 2 2 2 1 2 2 2 1 2 1 2 1 1 2 2 1 2 1 2 2 2 2 2 2 2 1 2 1 1 2 1 2 1 2 2 1
## [3923] 2 1 2 2 1 2 1 1 2 2 1 1 1 2 1 1 1 2 1 2 2 2 2 1 2 2 2 2 2 2 1 2 1 2 2 2 2
## [3960] 1 1 2 1 2 2 2 2 2 1 1 1 1 2 1 2 2 1 2 1 2 2 2 1 1 2 1 1 1 1 2 1 1 1 2 1 2
## [3997] 1 1 2 2 2 1 2 2 2 1 2 2 2 2 2 2 1 2 1 2 1 2 2 1 1 2 2 2 2 2 2 1 2 2 2 1 2
## [4034] 2 1 2 1 1 2 2 1 1 2 1 2 1 2 1 1 2 1 1 1 1 2 2 1 2 1 1 2 1 1 2 1 2 2 1 2 2
## [4071] 2 2 2 2 2 1 2 2 2 1 1 1 2 2 2 1 2 1 1 1 2 2 2 1 1 1 1 1 1 1 2 2 1 2 2 1 2
## [4108] 1 1 1 1 1 2 2 2 2 1 2 2 1 2 2 1 1 1 1 1 1 1 2 1 2 2 2 2 2 2 2 1 1 2 1 2 1
## [4145] 1 2 2 2 1 2 1 1 1 1 1 1 2 2 2 2 2 2 2 1 1 2 1 2 1 2 2 1 2 2 1 2 2 2 2 1 1
## [4182] 1 1 1 1 1 2 2 1 1 1 1 2 2 2 1 2 2 2 2 1 2 2 2 2 1 2 1 1 2 2 2 2 2 2 2 2 2
## [4219] 2 1 2 2 1 1 2 2 1 1 1 1 2 2 1 2 2 2 1 1 1 2 2 1 2 2 1 2 2 2 1 1 1 2 1 2 2
## [4256] 2 2 2 1 1 1 1 2 1 2 1 2 2 2 2 2 2 2 1 1 1 2 2 1 2 1 2 2 2 1 1 1 1 2 1 2 1
## [4293] 1 2 2 2 1 1 2 2 1 2 1 1 2 2 2 1 1 1 1 2 2 2 1 2 2 2 2 1 2 2 1 1 2 1 2 1 1
## [4330] 2 2 1 1 2 2 2 2 2 2 2 1 1 2 1 1 2 2 2 2 1 2 2 1 2 2 2 1 2 2 1 2 2 2 2 2 1
## [4367] 2 2 2 1 1 2 1 2 1 2 1 1 2 1 2 1 2 1 2 2 2 1 2 2 2 2 1 2 2 2 2 2 2 1 1 2 2
## [4404] 2 2 2 2 2 1 2 2 1 2 1 1 2 2 2 2 1 1 1 1 2 1 1 2 1 1 2 2 2 1 1 2 1 2 1 2 2
## [4441] 2 2 2 2 1 1 1 2 1 1 1 2 2 2 1 2 1 1 1 1 2 2 2 2 2 2 2 2 2 2 1 1 2 1 1 1 1
## [4478] 2 1 2 1 1 1 2 2 2 2 1 1 2 1 2 1 1 1 2 1 2 2 1 2 2 2 2 2 1 2 2 2 2 2 2 1 2
## [4515] 2 1 2 1 1 2 1 2 1 1 2 2 2 1 2 2 2 2 1 1 2 2 2 2 2 2 2 2 1 1 2 1 2 1 1 2 1
## [4552] 1 2 2 2 2 1 2 2 2 2 2 2 2 2 2 2 2 1 1 2 2 1 1 2 2 2 2 1 2 2 2 1 1 2 2 2 1
## [4589] 1 2 2 2 1 2 1 2 1 2 2 2 1 1 2 1 1 1 2 1 2 1 1 1 1 2 2 2 2 2 1 2 2 1 2 1 2
## [4626] 1 1 1 1 1 2 2 1 1 2 1 2 1 2 2 2 2 2 1 1 1 2 2 2 2 2 2 1 1 2 2 1 1 1 1 2 1
## [4663] 1 1 2 2 2 2 2 1 1 2 2 1 1 2 2 2 2 1 2 2 2 1 2 2 1 2 2 1 2 2 2 2 1 2 2 2 2
## [4700] 2 2 2 1 2 2 2 2 1 2 1 2 2 1 1 2 1 1 2 1 2 1 2 2 2 2 2 1 1 2 1 2 1 2 1 1 1
## [4737] 2 2 2 1 2 1 2 1 1 1 1 1 2 2 2 1 1 1 2 2 1 2 2 2 2 1 2 1 2 2 2 2 2 1 2 1 2
## [4774] 2 1 1 2 2 2 1 2 1 2 2 1 1 2 2 1 1 2 2 2 2 2 2 2 1 2 1 2 1 1 2 2 1 1 1 2 2
## [4811] 1 2 1 2 1 1 2 1 1 2 2 2 1 2 2 1 1 2 1 2 1 2 2 1 2 2 2 1 2 1 1 2 2 2 2 1 2
## [4848] 2 2 2 2 2 1 1 2 1 2 1 2 1 2 2 1 2 1 2 2 2 1 2 2 2 2 2 2 1 1 2 2 1 1 2 2 2
## [4885] 1 1 1 1 2 2 1 2 2 2 2 2 1 1 2 2 2 2 2 2 2 2 1 2 1 2 1 2 2 1 1 1 1 2 1 1 1
## [4922] 2 2 2 2 1 2 2 1 2 1 1 2 1 1 2 2 2 2 1 1 1 2 1 2 1 1 2 1 1 1 1 1 2 2 1 2 2
## [4959] 2 2 1 1 1 1 2 1 1 2 2 1 2 1 1 1 2 1 2 1 2 2 2 2 2 2 2 1 1 2 2 2 2 2 2 2 2
## [4996] 1 2 2 2 1 2 2 2 2 2 2 1 2 2 2 2 2 1 1 2 2 2 1 1 2 1 2 1 2 1 1 2 2 2 1 1 2
## [5033] 1 2 2 1 1 2 1 2 2 2 2 2 1 2 2 2 2 2 1 2 1 1 2 2 2 2 2 1 2 2 2 2 2 2 2 2 1
## [5070] 2 2 2 2 2 1 2 2 2 2 2 2 1 1 1 1 2 2 2 1 2 2 2 2 2 2 2 2 2 1 2 2 1 2 1 2 1
## [5107] 2 1 2 1 1 2 1 2 2 2 1 2 1 1 1 1 2 2 1 1 2 2 2 1 2 2 1 2 2 2 2 2 1 2 2 2 1
## [5144] 1 2 2 1 2 2 2 2 2 2 2 2 2 2 1 1 2 2 1 2 1 1 1 2 1 1 2 1 1 1 2 1 1 1 2 1 2
## [5181] 1 2 2 2 2 2 1 1 1 1 2 2 2 2 2 1 2 2 2 1 1 2 1 2 1 2 1 1 1 1 2 2 2 1 2 2 2
## [5218] 1 2 2 1 2 2 2 2 1 2 2 1 1 2 2 1 1 2 2 2 2 2 2 1 1 1 2 1 2 1 1 2 1 2 2 1 2
## [5255] 2 1 1 2 2 1 2 1 1 2 1 2 1 1 2 2 1 1 2 2 1 2 1 2 1 1 1 1 2 1 2 1 2 2 1 1 1
## [5292] 2 2 2 2 2 2 1 2 1 1 2 2 1 1 2 2 2 2 2 2 2 2 2 2 1 1 2 2 1 2 1 2 2 2 2 2 1
## [5329] 1 2 2 2 1 2 2 2 2 2 2 2 2 2 2 1 1 2 2 2 2 1 1 1 1 2 1 2 2 2 2 2 2 2 2 2 2
## [5366] 2 2 1 2 2 2 2 2 2 2 2 1 2 1 1 1 2 2 2 1 2 2 2 2 1 1 1 2 2 2 2 1 2 2 2 2 1
## [5403] 2 2 1 2 1 2 2 1 2 2 1 1 1 2 1 2 1 2 2 1 2 1 2 1 2 2 1 1 2 2 1 2 2 2 2 2 1
## [5440] 2 1 2 2 1 1 2 2 1 1 2 2 2 2 2 1 2 2 2 1 2 2 2 1 2 2 2 2 2 2 1 1 2 2 2 1 2
## [5477] 1 1 2 2 2 2 2 1 2 1 2 2 2 2 1 2 1 2 1 1 1 1 1 1 1 1 1 1 1 2 2 2 1 1 2 1 1
## [5514] 2 1 2 2 1 2 1 2 1 2 1 2 2 1 2 2 1 2 1 1 2 2 1 2 1 2 1 2 1 2 2 2 1 2 2 2 1
## [5551] 2 1 1 2 1 2 2 1 2 2 2 1 1 1 2 2 2 2 1 2 2 1 2 2 1 2 1 1 1 1 1 2 2 1 1 2 1
## [5588] 2 1 1 2 1 2 2 1 1 1 1 2 2 2 2 1 2 2 2 2 2 2 2 1 2 2 1 2 2 2 2 1 2 1 2 2 2
## [5625] 2 1 2 2 2 1 1 2 2 1 1 2 2 2 1 2 2 2 1 2 2 2 2 1 2 1 2 1 2 2 1 1 2 2 2 2 2
## [5662] 2 2 2 2 2 2 2 2 1 1 2 2 1 2 1 1 1 2 2 2 2 1 1 2 2 2 2 2 2 1 2 2 2 2 2 1 2
## [5699] 1 1 1 2 2 1 2 2 2 2 1 2 2 2 2 1 1 2 2 2 2 1 2 2 2 1 1 2 1 2 2 2 1 1 2 1 2
## [5736] 1 2 1 1 1 2 2 1 2 2 2 1 2 1 1 2 1 2 2 2 1 2 2 2 1 2 1 2 1 2 2 2 1 2 2 1 2
## [5773] 2 2 2 2 2 1 2 2 1 1 1 1 1 2 2 2 2 2 2 2 2 1 1 1 2 2 1 1 2 1 2 1 2 1 1 1 2
## [5810] 2 1 1 2 2 2 1 1 2 1 1 1 1 2 2 2 2 1 2 2 2 2 2 1 2 2 2 1 1 2 2 2 2 2 2 1 1
## [5847] 2 2 1 2 1 1 1 2 2 2 2 2 2 1 2 2 2 2 2 2 1 1 2 2 1 2 2 1 1 2 1 1 1 2 2 1 1
## [5884] 2 2 2 2 2 1 2 2 1 2 2 1 2 2 1 2 1 2 2 2 1 2 1 2 2 2 1 1 1 2 2 1 2 1 1 2 1
## [5921] 1 2 2 1 2 2 1 2 2 1 2 2 2 2 2 1 2 2 2 1 2 1 2 2 2 2 2 1 2 2 2 2 2 2 2 2 2
## [5958] 2 2 2 2 2 2 2 2 2 2 1 2 1 2 1 2 1 2 1 2 2 2 1 1 2 1 2 2 2 1 1 2 1 2 1 2 1
## [5995] 2 2 1 1 1 2 2 1 2 1 1 2 1 2 1 2 1 2 2 2 2 2 1 1 2 1 2 2 2 1 2 1 2 2 2 2 2
## [6032] 2 2 2 2 2 2 1 1 1 2 2 1 1 2 2 1 2 2 1 1 1 1 1 1 2 2 1 1 2 2 1 2 1 2 1 2 2
## [6069] 2 1 2 1 1 2 1 1 2 2 2 1 2 2 1 2 2 2 2 2 2 1 2 1 2 2 2 1 1 2 2 1 1 2 2 2 1
## [6106] 2 2 1 2 2 2 2 2 1 1 2 1 1 1 2 2 1 2 1 2 2 1 2 2 2 1 1 1 1 2 1 2 2 1 1 2 1
## [6143] 2 2 1 2 2 2 1 2 2 2 1 2 2 2 2 2 1 2 1 2 2 1 2 2 2 2 1 1 1 1 1 1 1 1 2 2 1
## [6180] 1 2 1 1 2 2 1 1 2 1 1 1 1 1 2 1 2 2 1 1 2 2 1 2 2 2 2 2 2 1 1 1 2 2 1 2 2
## [6217] 2 2 1 2 2 2 1 2 2 2 2 2 1 2 1 1 1 2 1 2 1 1 1 1 2 1 2 2 2 2 1 2 2 2 2 2 1
## [6254] 2 2 1 2 2 2 2 2 2 2 1 2 2 1 2 1 1 2 2 2 2 2 2 1 2 2 1 1 2 1 1 2 2 2 2 2 2
## [6291] 1 1 1 1 1 2 2 1 2 1 2 2 1 2 2 1 1 2 1 2 2 2 2 2 2 2 1 2 2 2 2 1 2 1 2 2 1
## [6328] 2 1 1 2 2 2 2 2 2 2 2 1 2 2 1 2 1 1 2 1 2 2 1 2 2 2 1 2 2 2 2 2 1 2 2 1 2
## [6365] 2 2 1 2 2 1 2 1 1 2 2 2 2 2 1 1 1 2 2 2 1 1 1 1 2 2 2 1 2 2 1 2 1 1 1 1 1
## [6402] 1 2 2 2 1 2 1 2 2 2 1 1 1 2 2 2 2 2 2 1 2 1 1 1 2 2 2 2 2 2 1 2 2 1 1 2 2
## [6439] 1 2 1 2 2 1 2 2 2 1 1 1 2 2 2 2 1 2 1 2 2 2 1 2 2 2 2 2 2 2 1 2 2 1 2 2 2
## [6476] 2 2 1 2 1 2 1 2 2 2 2 2 2 2 2 2 2 2 2 1 1 1 2 2 2 2 2 1 1 2 1 2 1 2 2 2 2
## [6513] 2 2 2 1 2 1 1 1 2 1 2 1 2 1 2 2 1 2 2 2 2 2 2 1 1 2 2 2 1 2 2 2 1 2 2 2 1
## [6550] 1 2 1 2 2 2 2 2 1 1 2 2 1 2 2 1 1 1 2 1 2 1 2 1 2 2 2 2 2 2 1 2 2 1 2 2 1
## [6587] 2 1 2 1 1 1 2 2 2 1 2 2 2 2 1 1 1 2 2 1 2 1 2 2 1 1 1 2 1 1 1 2 2 2 2 1 2
## [6624] 1 2 2 2 1 2 2 2 1 2 2 2 1 2 1 2 2 1 2 2 2 1 2 2 1 1 2 1 1 2 2 2 2 2 1 2 2
## [6661] 1 2 2 2 2 2 1 2 2 1 1 2 1 2 2 2 2 2 1 1 1 1 1 1 2 2 2 2 2 2 1 1 1 1 1 1 1
## [6698] 2 1 1 1 2 2 2 2 2 2 1 2 2 2 2 2 1 2 2 2 1 2 2 1 2 2 1 2 2 1 2 2 1 1 1 1 2
## [6735] 2 2 2 1 2 2 2 1 2 1 2 2 1 2 2 2 1 2 1 2 1 2 2 2 2 2 2 1 2 2 2 2 2 2 2 2 1
## [6772] 2 1 1 1 2 2 2 2 2 1 1 2 1 2 1 2 2 1 2 1 2 1 2 1 2 2 2 2 2 2 1 2 2 2 1 1 2
## [6809] 2 2 1 1 1 2 1 1 1 2 2 2 2 2 1 2 1 2 1 2 1 2 2 2 2 2 1 2 2 2 2 1 2 2 1 2 2
## [6846] 2 1 2 2 2 2 2 2 2 2 2 2 1 1 1 1 2 1 1 2 2 2 2 2 2 2 1 1 2 2 1 2 2 2 1 1 2
## [6883] 1 2 1 2 1 2 2 2 1 1 1 2 1 1 1 1 2 2 2 2 2 1 1 2 1 1 2 2 1 1 2 2 2 2 1 2 1
## [6920] 2 2 2 2 1 1 2 1 1 1 2 1 2 1 2 2 2 2 2 2 2 1 2 1 2 1 2 1 1 2 2 2 2 1 2 2 1
## [6957] 2 2 1 2 2 1 1 1 2 1 1 2 2 1 1 2 1 2 2 2 2 1 2 1 2 2 1 1 1 2 2 2 1 2 1 2 2
## [6994] 1 2 2 2 2 1 2 2 1 2 2 2 2 2 1 2 2 2 1 1 1 2 1 2 1 1 2 1 1 2 1 1 2 1 2 2 2
## [7031] 1 2 1 1 2 1 1 1 2 2 2 2 2 1 2 1 1 2 1 1 1 1 1 2 2 2 1 1 1 1 2 1 2 1 2 1 1
## [7068] 2 2 1 1 2 2 2 1 1 1 2 2 2 1 2 2 1 2 1 2 1 1 2 1 2 1 2 1 2 2 2 2 2 2 2 2 2
## [7105] 2 1 2 1 2 1 2 1 1 2 2 1 1 2 2 1 1 2 1 1 2 2 2 1 2 1 1 2 2 2 2 1 1 2 1 2 2
## [7142] 2 1 2 2 1 2 2 2 2 2 2 1 1 2 1 1 2 1 2 2 2 2 2 2 2 2 1 1 2 1 2 1 2 2 1 2 2
## [7179] 2 1 2 1 2 1 2 1 2 2 1 2 2 2 2 2 2 1 1 1 1 2 1 2 2 2 2 1 2 2 2 2 2 2 2 2 2
## [7216] 1 2 1 1 1 2 1 2 2 2 2 1 1 2 1 1 2 2 1 1 2 1 2 1 2 2 2 2 2 2 2 2 2 1 2 2 1
## [7253] 1 1 1 2 1 2 2 2 1 2 1 2 2 1 2 1 2 1 2 2 2 2 2 2 1 1 2 1 1 2 1 2 2 2 2 2 2
## [7290] 1 2 1 2 1 1 1 2 2 1 2 2 2 1 1 2 1 2 1 2 2 2 1 2 2 2 2 1 2 1 2 1 2 2 1 2 2
## [7327] 2 2 1 2 1 1 2 2 1 2 1 1 2 2 2 1 1 2 1 2 1 2 2 1 2 1 2 1 2 2 2 2 2 1 2 2 1
## [7364] 2 2 1 2 2 2 1 2 2 1 2 2 2 1 1 2 2 2 1 1 2 2 1 2 1 2 2 2 2 2 2 2 2 2 2 2 2
## [7401] 1 2 1 2 2 1 1 1 2 2 1 2 1 1 2 2 2 1 2 2 2 2 2 1 1 1 1 2 1 1 1 2 1 1 1 2 1
## [7438] 2 1 2 2 1 2 2 1 1 1 1 2 2 2 2 1 2 1 2 2 2 2 2 2 1 1 2 1 1 2 2 2 2 2 1 2 2
## [7475] 1 1 1 1 2 1 1 1 1 2 2 2 2 1 2 2 1 1 1 2 2 2 2 1 1 2 2 2 2 2 2 1 2 2 1 1 1
## [7512] 1 2 2 2 1 2 1 1 1 2 2 1 2 1 2 2 2 2 2 2 2 1 1 1 2 1 1 2 2 1 2 1 2 2 2 2 2
## [7549] 2 1 2 2 2 2 2 1 1 2 2 1 2 1 2 2 1 2 2 2 1 2 2 1 2 1 1 1 1 2 2 2 2 1 1 2 1
## [7586] 2 1 1 2 1 1 2 2 1 2 2 2 1 1 2 1 1 1 1 2 1 1 2 2 2 2 2 2 1 2 1 1 1 2 1 2 2
## [7623] 2 1 2 2 2 2 1 2 1 1 2 1 1 1 2 1 2 2 1 2 1 1 1 2 2 2 2 1 1 2 1 1 1 1 2 2 1
## [7660] 1 1 2 1 1 1 2 1 2 2 2 2 2 2 2 2 2 2 1 1 2 2 2 2 1 1 1 1 1 1 2 2 1 2 1 1 2
## [7697] 1 2 2 2 1 2 1 2 2 1 2 2 1 2 2 2 2 1 1 2 2 1 1 2 2 1 1 2 2 2 2 2 1 2 1 1 2
## [7734] 2 1 2 1 1 2 2 2 1 1 1 2 2 2 1 2 1 1 2 2 1 1 2 1 2 2 2 2 1 1 1 1 1 2 1 2 1
## [7771] 2 2 2 2 2 1 1 1 1 1 1 1 2 1 1 2 1 2 2 2 2 2 2 1 2 2 1 2 2 2 2 1 1 1 2 1 1
## [7808] 2 2 2 1 2 1 2 2 1 2 2 2 2 1 2 1 1 2 1 2 1 1 2 2 1 1 2 2 1 2 2 1 2 2 2 1 1
## [7845] 1 2 2 2 2 1 2 1 1 2 2 1 2 1 1 2 1 1 1 1 2 1 1 2 2 1 1 1 1 2 2 2 1 1 1 2 1
## [7882] 1 1 1 1 2 1 2 2 2 1 1 2 2 2 1 1 1 2 1 1 2 1 2 1 2 1 2 2 1 1 2 1 1 2 1 1 2
## [7919] 2 1 1 2 2 2 1 2 1 2 1 2 2 2 1 1 1 1 2 2 1 2 2 1 2 2 1 1 1 1 2 2 1 2 1 2 2
## [7956] 1 2 2 1 2 1 2 2 2 2 2 2 2 2 1 1 2 2 2 1 1 1 1 1 1 1 2 2 2 1 1 2 1 2 2 2 2
## [7993] 2 1 2 2 1 2 1 1 1 1 1 1 1 2 2 1 1 2 2 2 2 2 1 2 2 2 2 2 1 1 2 1 2 1 2 2 1
## [8030] 1 2 1 2 1 2 2 2 1 1 2 2 1 2 2 1 2 1 1 2 1 1 2 2 1 1 1 2 1 2 1 2 1 2 2 1 1
## [8067] 2 2 1 2 2 1 2 1 2 1 1 2 2 1 2 2 2 1 1 1 1 2 2 2 2 2 2 1 1 1 2 1 1 1 2 2 1
## [8104] 2 1 1 2 1 1 1 2 1 2 2 2 1 2 2 1 2 2 1 1 2 1 1 2 1 2 2 2 1 2 2 2 2 1 2 1 1
## [8141] 2 1 2 2 2 2 2 1 2 1 1 1 2 1 2 1 1 1 2 2 2 2 1 1 2 2 2 1 1 1 1 1 1 1 1 2 2
## [8178] 2 1 2 2 2 1 1 1 2 2 2 2 2 1 2 1 2 1 1 2 1 1 2 2 2 1 2 2 1 2 1 2 1 1 2 2 1
## [8215] 1 1 2 2 2 2 2 1 1 2 1 2 2 2 1 1 2 2 2 2 2 1 1 2 1 1 2 1 2 1 1 1 1 1 2 2 1
## [8252] 1 2 1 1 1 2 2 2 1 2 2 2 1 1 2 2 1 1 2 2 2 1 1 1 2 2 1 2 2 1 2 1 1 1 2 1 2
## [8289] 2 2 2 2 1 2 2 2 2 1 2 2 1 2 1 2 2 1 2 2 2 2 2 1 1 2 1 2 2 2 2 1 1 2 2 1 1
## [8326] 1 1 1 1 2 2 2 2 2 1 1 1 1 2 2 2 1 1 2 2 2 1 2 2 1 1 2 1 1 2 1 1 1 2 2 1 2
## [8363] 2 1 1 1 1 2 2 2 1 1 2 2 1 2 1 2 1 1 1 1 1 1 1 2 2 1 1 1 1 1 1 1 2 2 1 2 2
## [8400] 2 1 2 2 1 2 1 2 1 1 2 2 1 2 1 2 2 1 1 1 2 2 2 2 1 1 2 2 2 2 1 2 1 1 1 2 2
## [8437] 1 1 1 1 1 2 1 1 1 2 2 1 1 2 1 1 1 1 2 2 1 1 1 2 1 1 1 2 1 1 1 2 1 2 2 2 2
## [8474] 1 2 2 2 2 1 2 1 2 1 1 2 1 2 1 1 1 1 1 2 1 2 2 2 1 1 1 2 2 1 2 1 2 1 2 2 2
## [8511] 2 1 1 2 2 2 2 1 2 1 1 2 1 1 1 1 2 1 2 2 2 1 2 1 1 1 1 1 1 1 2 2 1 1 1 1 2
## [8548] 2 2 2 1 1 2 1 1 2 1 1 1 1 1 2 2 2 2 1 1 2 1 2 1 1 2 2 1 2 1 1 2 1 1 2 2 2
## [8585] 2 1 2 1 2 2 2 2 1 2 2 2 1 2 1 1 1 1 1 1 1 1 2 1 2 1 2 1 2 2 2 2 1 2 1 1 2
## [8622] 1 1 1 1 1 2 1 2 2 1 1 2 2 2 1 1 2 1 1 1 2 1 1 1 2 1 2 1 2 1 1 1 1 1 1 2 1
## [8659] 2 1 1 1 2 1 2 2 1 2 2 1 2 1 1 1 1 2 2 2 2 2 2 1 2 1 2 2 1 1 1 1 2 2 1 2 2
## [8696] 1 1 2 1 1 2 1 2 2 2 2 2 2 2 2 2 2 1 2 2 1 1 2 2 2 2 2 1 1 2 1 2 2 2 1 2 1
## [8733] 1 2 2 1 1 1 2 1 1 2 2 1 1 1 2 2 2 1 2 1 1 1 2 2 1 2 1 1 2 1 1 2 1 2 1 2 1
## [8770] 2 1 1 1 2 2 1 2 1 2 1 1 1 1 2 1 2 1 2 1 1 1 1 2 2 2 1 1 2 2 1 1 1 2 2 1 1
## [8807] 1 1 1 1 2 1 2 1 1 1 2 2 2 2 2 2 2 2 2 2 1 2 2 2 2 1 1 1 1 2 1 1 1 1 1 2 2
## [8844] 1 1 1 2 1 1 1 2 2 1 1 1 1 2 2 2 1 2 2 1 2 2 2 2 1 1 2 1 1 1 1 1 1 2 2 1 2
## [8881] 2 1 2 1 2 2 1 1 1 1 2 1 1 2 2 1 2 1 2 2 2 1 2 2 2 1 2 1 2 2 1 1 1 1 1 2 2
## [8918] 1 2 1 1 1 2 1 1 2 1 2 1 1 1 1 2 1 1 1 1 2 1 1 1 2 1 1 2 2 2 1 2 1 2 1 1 2
## [8955] 1 1 1 1 2 1 1 2 2 1 1 2 1 2 2 2 1 2 1 2 2 1 1 1 1 1 2 1 2 1 1 2 1 2 2 2 2
## [8992] 2 1 2 1 1 1 2 2 2 2 1 2 1 1 2 2 1 1 1 1 1 1 2 2 2 1 2 1 2 1 1 1 1 1 2 2 1
## [9029] 2 1 2 2 1 1 1 1 2 1 1 1 1 2 1 1 1 1 1 2 2 1 2 1 1 2 2 2 1 2 2 1 1 1 2 1 1
## [9066] 2 2 1 2 1 1 2 2 1 2 1 1 2 1 2 1 1 2 2 1 2 1 2 2 2 2 2 2 2 1 1 1 1 1 2 1 2
## [9103] 1 1 2 2 1 1 1 1 2 2 1 2 2 1 1 2 1 1 1 1 1 1 1 2 2 1 1 1 2 2 2 2 1 1 1 1 2
## [9140] 2 1 1 2 1 1 2 1 2 2 1 2 1 1 2 1 2 2 1 1 2 2 1 2 1 1 2 2 2 2 1 2 2 2 1 1 2
## [9177] 2 2 1 2 2 2 2 1 2 2 2 1 1 1 1 1 2 1 2 1 1 1 2 1 1 1 1 1 2 2 2 2 1 1 1 2 2
## [9214] 2 1 1 2 1 1 2 1 1 1 2 2 1 2 1 1 1 1 1 1 1 1 1 2 1 1 1 1 1 1 1 2 1 1 1 1 1
## [9251] 1 1 1 1 2 1 1 1 2 1 2 1 2 1 1 1 1 1 1 2 2 1 2 1 1 1 1 2 2 1 1 1 2 2 2 1 2
## [9288] 1 1 2 1 2 2 1 2 1 2 1 1 1 1 1 1 1 1 2 2 1 1 1 2 1 2 1 1 1 1 1 1 1 2 2 2 2
## [9325] 2 1 2 1 2 1 1 2 1 2 1 2 1 2 1 2 1 1 1 2 1 1 2 1 1 1 1 1 2 1 1 2 1 2 1 1 1
## [9362] 1 1 1 1 1 2 1 1 1 1 1 2 1 1 1 1 2 2 1 1 2 1 2 1 1 1 1 1 2 1 2 1 1 2 1 1 2
## [9399] 1 2 1 1 1 2 1 2 1 2 1 2 2 1 1 2 2 1 2 1 2 2 2 2 1 1 2 1 1 2 2 1 2 1 2 2 2
## [9436] 2 2 2 2 1 1 1 1 1 1 1 1 2 1 2 1 1 2 1 2 1 1 1 1 1 2 2 1 2 1 1 1 1 2 1 1 1
## [9473] 1 2 1 1 1 1 1 2 1 1 2 1 1 2 1 2 1 1 2 1 1 1 2 1 2 1 1 1 1 2 2 2 2 1 1 2 1
## [9510] 2 2 1 1 1 1 1 2 2 2 1 1 2 1 2 2 1 2 1 1 2 1 1 1 1 2 1 1 2 1 2 2 2 2 1 2 1
## [9547] 2 2 2 1 2 1 2 1 2 2 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 2 1 2 1 2 1 1 2 1
## [9584] 1 1 1 2 2 1 1 1 1 2 1 1 2 1 1 1 1 2 2 2 1 1 1 1 1 2 1 2 2 2 2 2 1 1 1 1 2
## [9621] 1 1 2 1 2 1 2 1 1 1 1 1 1 1 2 2 1 1 2 2 1 2 1 1 1 1 2 1 2 1 2 1 2 1 1 1 1
## [9658] 1 1 2 1 1 1 2 2 1 2 2 2 2 1 2 1 1 2 1 2 1 1 1 2 2 1 1 1 1 1 2 1 1 1 2 2 1
## [9695] 1 2 1 1 1 1 1 1 2 1 1 2 1 1 1 1 1 2 2 2 1 2 2 1 2 1 2 1 1 1 2 2 1 2 1 1 1
## [9732] 1 2 2 1 1 2 1 1 1 1 1 1 1 1 1 1 2 1 1 2 2 1 1 1 2 1 2 2 1 1 1 2 2 1 2 1 2
## [9769] 2 1 2 1 1 2 1 1 2 1 2 2 1 1 1 2 1 1 2 1 1 1 1 1 1 1 1 2 1 2 1 1 2 1 1 1 1
## [9806] 1 1 1 1 1 2 2 1 2 1 2 1 1 2 2 1 2 1 1 1 1 1 1 1 1 1 2 2 2 2 1 1 1 1 2 1 2
## [9843] 2 2 1 2 1 1 1 1 2 1 2 2 1 1 1 1 1 1 1 1 1 2 1 2 1 1 1 2 1 1 1 2 1 1 1 2 1
## [9880] 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 1 2 1 2 1 2 1 1 1 2 1 1 1 2 1 2 1 1 1 1 1 1
## [9917] 1 1 1 2 1 1 2 1 1 1 2 1 1 1 1 1 1 2 1 1 1 1 1 1 1 1 1 2 1 2 1 1 1 1 1 1 1
## [9954] 1 1 1 1 2 1 1 1 1 1 2 1 2 1 2 1 2 2 1 1 1 1 1 1 2 1 1 2 1 1 1 2 1 1 1 1 1
## [9991] 2 1 1 1 2

Internal Validation

For the same sample of data that the Hopkin’s statistic was previously computed for, internal validation metrics, such as connectivity, Dunn index and silhouette width are computed. Internal validation evaluates the quality of clusters using only the dataset itself. These are computed for k-means, hierarchical and clara clustering methods (PAM deemed inefficient due to the size of the dataset). The results of the test suggest that hierarchical clustering optimal method for the examined dataset, significantly outperforming partitioning methods such K-Means and CLARA. The optimal number of clusters is identified, again, as \(k=2\), achieving the highest Silhouette Width and Dunn Index, suggesting, respectively, a reasonable structure and distinct cluster separation. Furthermore, the Connectivity score for hierarchical clustering was drastically lower (by over 160 points for both instances), than K-Means and CLARA, confirming superior local compactness.

methods <- c('kmeans', 'hierarchical','clara')

validation <- clValid(data_sample, nClust = 2:10, clMethods = methods, validation = 'internal', maxitems = 100000)
## Warning in clValid(data_sample, nClust = 2:10, clMethods = methods, validation
## = "internal", : rownames for data not specified, using 1:nrow(data)
summary(validation)
## 
## Clustering Methods:
##  kmeans hierarchical clara 
## 
## Cluster sizes:
##  2 3 4 5 6 7 8 9 10 
## 
## Validation Measures:
##                                   2        3        4        5        6        7        8        9       10
##                                                                                                            
## kmeans       Connectivity  176.8560 258.2147 290.2690 384.2889 423.6881 499.3171 468.0488 494.7861 511.9230
##              Dunn            0.0332   0.0295   0.0343   0.0334   0.0369   0.0386   0.0273   0.0258   0.0323
##              Silhouette      0.2699   0.2650   0.2685   0.2109   0.2062   0.1992   0.2007   0.1958   0.2013
## hierarchical Connectivity    7.4659   7.4659  33.5329  34.4119  77.9595  84.7270  88.6738 131.6278 132.6540
##              Dunn            0.2526   0.2526   0.0946   0.0946   0.0967   0.0967   0.0967   0.0935   0.0935
##              Silhouette      0.5870   0.4941   0.3681   0.3050   0.2056   0.1763   0.1642   0.1300   0.1014
## clara        Connectivity  196.3353 293.0639 362.8123 478.9690 490.7905 563.6849 653.9552 640.4742 697.4786
##              Dunn            0.0268   0.0239   0.0251   0.0297   0.0257   0.0277   0.0194   0.0333   0.0333
##              Silhouette      0.2285   0.2240   0.2149   0.1580   0.1701   0.1640   0.1299   0.1512   0.1348
## 
## Optimal Scores:
## 
##              Score  Method       Clusters
## Connectivity 7.4659 hierarchical 2       
## Dunn         0.2526 hierarchical 2       
## Silhouette   0.5870 hierarchical 2

Clustering

In this section various clustering algorithms will be implemented and analyzed in terms of, both, their statistical purity and utility in terms of efficient consumer segmentation providing helpful insight for either recommendation systems or market research. The optimal number of cluster, despite conflicting results of various metrics, is set at \(k=4\) – the result determined by the highest value of Silhouette Width coinciding closely with the optimal numbers of clusters from the field’s literature. However, it is not only the optimal number of clusters, whose results were inconclusive – various methods of clustering were used for similar purposes in the previously cited literature. This study will focus on two distinct approaches, having already eliminated PAM clustering methods as inefficient in such a large dataset, CLARA as deemed redundant by the interna validation metrics and Density-Based clusteing methods (DBSCAN) as unfit for the highly dense data structure of this project:

K-Means Clustering

K-Means clustering algorithm assigns each and every data point to one of the pre-defined \(k\) clusters – it works as an optimization problem whose main objective is to find the \(k\) cluster centers and assign the observations to the nearest cluster center so that the differences within clusters measure by the squared distances from the cluster are minimized and the difference between clusters is maximized. The assignments are subsequently updated through adjusting the boundaries of clusters according to the current content of the clusters, which occurs several times until further changes are no longer feasible. Clustering is performed for \(k=4\).

k4 <- kmeans(music_scaled, 4)
k4$size
## [1] 2550 1811 5286  348

The data is split into 4 clusters, according to the Silhouette Width, all of which are unequal with the number of observations in each cluster, respectively: 2550, 1811, 5286, 348.

Based on the number of observations in each cluster and the visualization of cluster map, cluster 4 is mostly comprised of outlying observations – probably the heavy users, whose large listening density values were previously illustrated on the scatter plot. Although there is only few of them, their impact on the feature space is enormous, effectively pulling the cluster boundary outwards. This capture of the extreme variance introduced listening density stabilized the centroids of the remaining clusters, resulting in interpretable segments within this highly dense dataset. The dominant group, represented by the third cluster, constitutes the majority of the sample and most likely represents casual or mainstream listeners. Both cluster 3 and 1 from compact, dense regions in the center of the plot, overlapping slightly – suggesting lower Silhouette scores. Cluster 2 occupies a distinct region, suggesting a better separation than the central groups.

fviz_cluster(k4, data = music_scaled, 
             geom = 'point',
             ellipse.type = 'convex',
             ggtheme = theme_minimal(),
             main = "Cluster Map",
             palette = 'jco')

k4$centers
##   genre_entropy novelty_score mainstreaminess_score diversity_score
## 1     0.2222123     1.0369098            0.06245614      1.18744557
## 2    -1.3604383     0.2540524           -1.36958033      0.03674786
## 3     0.3704681    -0.5397671            0.44468660     -0.52517128
## 4    -0.1758105    -0.7212638           -0.08496139     -0.91517587
##   listening_density
## 1       -0.46882357
## 2       -0.12264418
## 3        0.02310175
## 4        3.72268072

Centers of clusters provide clear interpretability – cluster 1 represents the explorers of the dataset, with high novelty and diversity scores. Cluster 2 exhibits interestingly low values of genre entropy and mainstreaminess score – representing users with focused and niche music preferences. Cluster 3 does not portray any specific deviations from the average, suggesting a representation of the average user – fairly mainstream and inclined towards a moderate number of genres, without a tendency for diversity and novelty – a cluster for casual listeners, representing the majority of users in the sample. Cluster 4, however, displays an extreme value of listening density – encompassing the outlying heavy users.

The differences between clusters is better illustrated by a snake plot portraying the deviation from the average score for variables of all of the clusters’ centers.

The average silhouette width equals 0.28, which indicates a relatively weak structure with noticeable overlap between clusters. While a higher score could definitely be preferred, such partition maximized the width of silhouette. It is important to note that behavioral segmentation involving continuous psychometric variables, such as mainstreaminess or novelty the boundaries may naturally be fuzzy rather than rigid. Cluster 3 shows the most stable structure with poitive silhouette values – confirming casual users as the core and cohesive group of the dataset. Cluster 4 exhibits high variability with negative silhouette values, which is to be mathematically expected for an “outlier cluster” – the heavy users with extreme listening habits are distant form the center and highly dispersed from each other. The lower silhouette score is a trade-off accepted to gain detailed behavioral insights rather than a statistically pure but practically trivial split.

s4 <- silhouette(k4$cluster, dist(music_scaled))
fviz_silhouette(s4)
##   cluster size ave.sil.width
## 1       1 2550          0.27
## 2       2 1811          0.18
## 3       3 5286          0.33
## 4       4  348          0.16

Hierarchical Clustering

Hierarchical clustering (HCA) is a method of cluster analysis that does not require a pre-defined number of clusters, which allows for a further exploration of natural segmentation within the data. This project will utilize the agglomerative, or the bottom-up approach to hierarchical clustering, which begins with each singular data point in an individual cluster. With each iteration, the algorithms merges the two most similar clusters based on a chosen distance metric (here, euclidean distance is implemented) and linkage criterion, until all fo teh data points form a single cluster.

# drawing a sample 
set.seed(123)
sample_index <- sample(1:nrow(music_scaled), 4000)
music_sample <- music_scaled[sample_index, ]

# defining methods 
m <- c("average", "single", "complete", "ward")
names(m) <- c("average", "single", "complete", "ward")


ac_coeff <- function(x) {
  agnes(music_sample, method = x)$ac
}

results <- map_dbl(m, ac_coeff)
sort(results, decreasing = TRUE)
##      ward  complete   average    single 
## 0.9954920 0.9832191 0.9824863 0.9610489

In order to determine the most efficient linkage method – the Agglomerative Coefficients are computed for average, single, complete and Ward linkage methods (due to the issues with computation for the dataset used in this project, a random sample of 4000 records is drawn). The Ward linkage method exhibits the highest value of Agglomerative Coefficient and, hence, is chosen as the proper linkage method for this dataset. Ward’s method works by minimizing the total within-cluster sums of squares.

d <- dist(music_scaled, method = "euclidean")
hc1 <- hclust(d, method = "ward.D2")
plot(hc1, cex = 0.6, hang = -1, main = "Dendrogram - Ward Method")

Despite the suggestion of internal validation metrics to perform hierarchical clustering with \(k=2\), I firmly believe that such division would be too sparse to provide practical interpretation. Furthermore, visual analysis of the dendrogram reveals a clear dichotomy in the user base – most likely separating heavy users from casual listeners – following the tree down, however, both of those branches are split at a height between 80 and 110. The vertical length of these four branches before they subdivide further, especially the ones signified by red, dark blue and cyan, is substantial. This indicates a high level of separation stability and provides empirical evidence to retain \(k=4\) as the optimal segmentation level.

cut_hc1 <- hcut(music_scaled, k = 4, hc_method = 'ward.D2')
plot(hc1, cex = 0.6, hang = -1, main = "Dendrogram - Ward Method; k=4")
rect.hclust(hc1, k = 4, border = 2:5)

hc_clusters <- cutree(hc1, k=4)

Similarly to the results obtained from K-Means clustering, cluster 4 contains the fewest observations, as can be seen on the cluster map plotted below – once again it encompasses the outlying values. This group was sightly larger in K-Means, which liekly forced this group to be larger by pulling in moderate users. Cluster 3 illustrates the dominant group, which compared to the results of K-Means, has grown even more – containing approximately 63% of all users in the dataset. While such finding seems to be realistic for a streaming platform, where the vast majority of users are, indeed, the average consumers, this may provide little to no insight in practical interpretation and might lower the silhouette score even further. Clusters 1 and 2 with, respectively, 1503 and 1915 users represent substantial, specific sub-cultures within the audience.

table(hc_clusters)
## hc_clusters
##    1    2    3    4 
## 1503 1915 6288  289
fviz_cluster(list(data = music_scaled, cluster = hc_clusters),
             geom = "point",      
             ellipse.type = "convex", 
             palette = "jco",      
             ggtheme = theme_minimal(),
             main = "Cluster Map (Hierarchical Clustering)")

The four user archetypes based on the centroid data are, as follows: - Cluster 1: the niche user – genre entropy and mainstreaminess reach the lowest values in the dataset – these users listen to a narrow set of genres and actively avoid popular music. - Cluster 2: the explorer – achieving the highest diveristy and novelty scores, these are the users, who listen to a vast variety of tracks with an inclination towards novel sounds, unlike the first group, they switch the genres frequently. - Cluster 3: the casual users – in this group all of the variables oscillate near the average. These group represents the baseline users, accepting the mainstream, listening to a standard variety of music, who consume moderately. - Cluster 4: the heavy users – a group defined by a drastic deviation in listening density, these are the focused (low diversity) and heavy listeners.

temp_data <- as.data.frame(music_scaled)
temp_data$h_cluster <- as.factor(cut_hc1$cluster)
hc_centers <- aggregate(.~ h_cluster, data = temp_data, mean)
print(hc_centers)
##   h_cluster genre_entropy novelty_score mainstreaminess_score diversity_score
## 1         1   -1.51245157     0.4754816            -1.4348975       0.2784618
## 2         2    0.26743775     1.0341621             0.2096416       1.3358587
## 3         3    0.27655104    -0.3968652             0.2724522      -0.4294425
## 4         4    0.07653445    -0.6905918             0.1453566      -0.9562743
##   listening_density
## 1       -0.22945128
## 2       -0.54802322
## 3        0.04103353
## 4        3.93187164

The average silhouette width equals 0.25 – a value lower than the one rendered in K-Means analysis. This time it is the second cluster that displays the most stable structure, while not managing to reach a positive score across the entire group, the negative values it achieves are the highest among the clusters. Once again, cluster 4 exhibits high variability. The lower silhouette score yielded by hierarchical clustering analysis with no additional practical interpretations of the dataset leads to the selections of K-Means algorithm as the one that suits the data analyzed in this project the most.

sil_hc <- silhouette(hc_clusters, d)
fviz_silhouette(sil_hc)
##   cluster size ave.sil.width
## 1       1 1503          0.14
## 2       2 1915          0.32
## 3       3 6288          0.26
## 4       4  289          0.16


Interpretation

Further interpretation of the obtained segmentation will proceed with the results of K-Means clustering. In this section, prior unstadardized data will be utilized and clusters will be analyzed in terms of their demographic distinctions.

Cluster Profiles

The final user profiles rendered by each cluster are plotted below. The algorithm differentiated four distinct user archetypes, which I classified as follows: - Explorer – characterized by high novelty and discovery scores with relatively low listening density. These are the users, who are inclined towards novel sounds, engaging with a wide variety of genres and not shying away from the mainstream. In terms of music recommendation algorithms – these are the users, who may benefit the most out of a blend of a personalized approach and mainstream promotion.A group for users, who not only consume content but rather discover what the platform has to offer. - Niche users – defined mostly by their low mainstreaminess scores and genre entropy – these users avoid popular music and listen to a compact set of genres diversifying their consumption mostly among different tracks of the same genre (high diversity score). These are the users who have already found their niche, they know exactly what they want and they stick to it – a beneficial recommendation system would need to be highly personalized. - Casual listeners – these groups encompasses the majority of this sample – highly mainstream with relatively low novelty and diversity – these are the users who listen to the various genres and songs that are popular at the moment, - Heavy users – the outliers of the dataset – a group characterized by their extremely high listening density and low diversity scores. The focused, heavy users – a group that encompasses either the obsessive listeners or people or institution using music as background noise (retail stores and office backgrounds) – super-streamers.

Age in Clusters

Explorers and niche users are almost identical in terms of age structure – the average age of users in these clusters is 26 and the median age is 24. Casual listeners tend to be younger – the average age in this cluster is 23, while the median is 22. Heavy listeners occupy the middle ground between clusters 1, 2 and 3 – the avergae age is 25 and the median is 23. The sample seems to be comprised mostly of young adults with subtle differences in the distributions of age.

final_data$cluster <- as.factor(k4$cluster)

age_profile <- final_data %>%
  group_by(cluster) %>%
  summarise(
    Avg_Age = mean(age, na.rm = TRUE),
    Median_Age = median(age, na.rm = TRUE),
    Count = n()
  )
print(age_profile)
## # A tibble: 4 × 4
##   cluster Avg_Age Median_Age Count
##   <fct>     <dbl>      <dbl> <int>
## 1 1          26.6         24  2550
## 2 2          26.4         24  1811
## 3 3          23.8         22  5286
## 4 4          25.4         23   348

Visually, all four clusters exhibit an almost identical right-skewed distribution. The vast majority of users are concentrated around the early 20s. However, there is a long tail reaching up to 75+ years suggesting that, while th platform is dominated by young adults, there is a small but present segment of older users in every category. In all clusters, the age of the median user oscillates around 22-24 years, while the mean age is consistently positioned to the right of the median, due to the skewed data. The consistency of the gap between the mean and the median across all of the groups suggests that age may not be the driving force behind listening habits.

my_cols <- c("1" = "skyblue", 
               "2" = "darkseagreen", 
               "3" = "purple", 
               "4" = "pink")

cluster_names <- c("1" = "C1: Explorer", 
                   "2" = "C2: Niche", 
                   "3" = "C3: Casual", 
                   "4" = "C4: Heavy")


ggplot(final_data, aes(x = age, fill = cluster)) +
  geom_histogram(binwidth = 1, color = "white", alpha = 0.9) +
  geom_vline(data = age_profile, aes(xintercept = Avg_Age), 
             linetype = "dashed", color = "red", size = 0.5) +
  geom_vline(data = age_profile, aes(xintercept = Median_Age), 
             linetype = "solid", color = "navy", size = 0.3) +
  facet_wrap(~cluster, scales = "free_y", labeller = as_labeller(cluster_names)) +
  scale_fill_manual(values = my_cols) +
  labs(title = "Age Profile by Cluster",
       x = "Age",
       y = "Count of Users") +
  theme_minimal() +
  theme(legend.position = "none",
        strip.text = element_text(size = 12, face = "bold"))

The analysis of age across user archetypes reveals a remarkable age homogeneity across all four clusters, which may suggest that within this dataset – users habits are not defined by listener’s age, but rather by psychological or lifestyle-based factors. Consequently, marketing strategies and recommendation systems would need to rely on behavioral targeting rather than demographic segmentation by itself.

Gender Analysis

The dataset used in this study is hevaily male-dominated, as illustrated on the plot below. The behaviors associated with intense or obsessive listening appear to be more strongly male-correlated – niche and heavy clusters are both comprised of men in more than 75%. The cluster with the highest proportion of women is the casual listener one, followed by the explorer. The gender gap narrows slightly as the listening habits become more diversified or casual. However, it is hard to determine whether this disproportions reveal a natural, behavioral, feminine tendency towards mainstream music or whether it is just a result of a phenomenon described in literature as the popularity bias of the female users imposed and intesified by the unfairly biased recommender systems (Lesota et al. 2021).

gender_profile <- final_data %>%
  group_by(cluster, gender) %>%
  summarise(Count = n()) %>%
  mutate(Percentage = Count / sum(Count) * 100)
## `summarise()` has grouped output by 'cluster'. You can override using the
## `.groups` argument.
# gender plot for clusters
ggplot(gender_profile, aes(x = cluster, y = Percentage, fill = gender)) +
  geom_col(position = "fill") +
  geom_text(aes(label = Count), 
            position = position_fill(vjust = 0.5), 
            color = "white",                     
            size = 3,                           
            fontface = "bold") +
  scale_y_continuous(labels = scales::percent) +
  labs(title = "Gender Structure in Clusters", y = "Percent") +
  theme_minimal()

### Global Distribution

The map illustrating the dominant clusters in each country reveals that music consumption habits are highly globalized, the average users in the vast majority of Global North behave similarly. While niech and explorer behaviors exist they are not the dominant trait of big nations, but rather specific segments scattered across the world. The users in explorer, niceh and heavy user clustered scattered on the map may illustarte instances of small sample size artifacts.

country_domination <- final_data %>%
  group_by(country, cluster) %>%
  summarise(count = n(), .groups = 'drop') %>%
  group_by(country) %>%
  slice_max(count, n = 1, with_ties = FALSE) %>%
  ungroup()

country_domination$region <- countrycode(sourcevar = country_domination$country,
                                         origin = "iso2c",
                                         destination = "country.name")

country_domination$region[country_domination$region == "United States"] <- "USA"
country_domination$region[country_domination$region == "United Kingdom"] <- "UK"
country_domination$region[country_domination$country == "UK"] <- "UK"

world_map <- map_data("world")
map_data_joined <- left_join(world_map, country_domination, by = "region")

map_data_joined$cluster <- as.factor(map_data_joined$cluster)

my_colors <- c("1" = "skyblue", 
               "2" = "darkseagreen", 
               "3" = "purple", 
               "4" = "pink")

ggplot(map_data_joined, aes(x = long, y = lat, group = group)) +
  geom_polygon(aes(fill = cluster), color = "white", size = 0.1) +
  scale_fill_manual(values = my_colors, na.value = "gray", 
                    name = "Dominant Cluster",
                    labels = c("C1: Explorer", "C2: Niche", "C3: Casual", "C4: Heavy")) +
  theme_void() + 
  labs(title = "Dominant User Profile by Country") +
  theme(legend.position = "bottom")

The country that dominates across almost all of the clusters is the USA (likely due to the dataset’s skew towards North American users), with only one exception – niche user’s cluster is dominated by Russian users. This phenomenon can be interpreted through the lens of algorithmic cultural bias – since mainstreaminess is typically defined by global, hence Western-centric, charts, users consuming regionally popular content are mathematically classified as niche by the model. What appears to be niche behavior might as well represent a distinct cultural, most liekly non-Western-centric ecosystem rather than obscure taste.

## # A tibble: 4 × 4
##   cluster_name country_name count percent
##   <fct>        <chr>        <int>   <dbl>
## 1 C1: Explorer USA            580    22.7
## 2 C2: Niche    Russia         264    14.6
## 3 C3: Casual   USA            873    16.5
## 4 C4: Heavy    USA             57    16.4

Concluding Remarks

This analysis aimed to demystify the algorythmic black box by identifying the natural behavioral groupings that emerge from users’ consumption habits. The resultus confirm that users have, indeed, developed distinct and quantifiable strategies for navigating the abundance of music products – differentiating four distinct behavioral archetypes.

First, the analysis reveals that a vast majority of users (cluster 3: casual listeners) mitigate the the curse of choice by passivity – they reside comfortably in the mainstream, likely relying on standard recommendation algorithms or popularity charts. For these users, the main advantage of streaming platforms lies within the ease of access to popular content.

In contrast, the explorer or niche clusters demonstrate that the long tail* is not merely a repository of obscure content, but the preferred habitat for specific user types. The explorers traverse through the digital landscape to actively seek out novel sounds – treating the enormity of supply as a resource for discovery, rather than a burden. Meanwhile, the niche users reject the mainstream bias of algorithms, curating highly specific consumption patterns.

The last user archetype– the heavy user – illustrates a highly focused and obsessive consumer

This study challenges the assumption that such navigation strategies are demographic in nature. The remarkable age homogeneity across all clusters indicates that nuances of user behavior are better illustrated by psychometrics rather than demographic information. Furthermore, the dominance of Russian users in the niche cluster highlights Western-centric nature of the mainstream.

In conclusion, the consumption of music is driven by behavioral intent and the natural groupings of music consumers are better illustrated by how they cope with digital abundance, rather than by their demographic information. Recommendation systems must acknowledge the subtle nuances of user’s behavior to turn the curse of infinite choice into a tailored experience for every user archetype.


Bibliography

Anderson, C., Nissley, C., & Anderson, C. (2006, April). The long tail.

Kużelewska, U., & Wichowski, K. (2015, June). A modified clustering algorithm DBSCAN used in a collaborative filtering recommender system for music recommendation. In International conference on dependability and complex systems (pp. 245-254). Cham: Springer International Publishing.

Lesota, O., Melchiorre, A., Rekabsaz, N., Brandl, S., Kowald, D., Lex, E., & Schedl, M. (2021, September). Analyzing item popularity bias of music recommender systems: are different genders equally affected?. In Proceedings of the 15th ACM conference on recommender systems (pp. 601-606).

Marques, A., Andrade, N., & Balby, L. (2013). Exploring the relation between novelty aspects and preferences in music listening. In Proc. ISMIR.

Schedl, M., & Hauger, D. (2015, August). Tailoring music recommendations to users by considering diversity, mainstreaminess, and novelty. In Proceedings of the 38th international acm sigir conference on research and development in information retrieval (pp. 947-950).

Schedl, M., Brandl, S., Lesota, O., Parada-Cabaleiro, E., Penz, D., & Rekabsaz, N. (2022, March). LFM-2b: A dataset of enriched music listening events for recommender systems research and fairness analysis. In Proceedings of the 2022 Conference on Human Information Interaction and Retrieval (pp. 337-341).

Sousa, A. M., Almeida, J. M., & Figueiredo, F. (2019, August). Analyzing and modeling user curiosity in online content consumption: a LastFM case study. In Proceedings of the 2019 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining (pp. 426-431).

Zhang, Z., & Chang, J. (2022). Clustering-based Categorization of Music Users through Unsupervised Learning. Economics & Management Information, 1-8.