Introduction
The current landscape of digital content, in the era of streaming
platforms, can, at times, seem hard to navigate for a consumer. The
democratization of access, both to the consumption, as well as the
production and publishing of music that followed digitization, can
simultaneously feel like a blessing and a curse. The challenge that
current music consumers face lies not within the limitations imposed by
rigid supplies of the old brick-and-mortar shops, but in finding the
desired product in the thicket of each streaming platform’s near
infinite assortment. Machine learning has not been indifferent to this
issue – ushering the users towards the items they are most likely to
prefer with the help of recommendation algorithms. While to the ordinary
eye these may look like black boxes, generating recommendations out of
the thin air, riddled with various advertisements of whatever record
label owes the highest number of shares in a given streaming platform,
they often rely heavily on clustering. The implementation of
unsupervised learning methods provides a crucial insight into the nature
of platform’s supply, as well as demand – users, whose listening
histories have been analyzed to obtain their consumption habits receive
more personalized recommendation of products, who have previously been
analyzed and grouped so as to match user’s preferences.
The aim of this project is to perform a similar analysis of user’s
behavioral music consumption patterns – to see whether any specific
patterns of user’s navigation across the long tail of music products can
be detected. For this purpose, clustering methods will be utilized to
discover segmentation of music listeners based on their registered
listening habits. The main purpose of cluster analysis is to discover
the natural groupings within the data, characterized by high
intra-class similarity and inter-class dissimilarity.
This study is structured as follows: the first section provides an
overview of the dataset used in this project and prepares the data for
further analysis. Subsequently, additional behavioral metrics, derived
from literature concerning music recommender systems, are computed for
each user in this project’s data sample. The derived variables are then
analyzed and explorative clustering analysis is performed – the
clusterability of data is investigated and the optimal number of
clusters is analyzed. Afterwards, clustering algorithms, such as K-Means
and Hierarchical Clustering are performed and subsequently evaluated in
terms of statistical purity and practical utility. The clusters are then
analyzed through the lens of demographic variables registered in the
original dataset.
Data Description and Preprocessing
The analysis in this project will be based on LFM-2b – a dataset made
available by Schedl et al. (2022) – currently accessible solely through
the
Wayback Machine, as the source platform revoked the license for
publishing large datasets derived from their site. LFM-2b contains
listening records of over 120,000 users of a music platform - last.fm,
amounting to more than 2 billion listening events (interactions
between a distinct user and a specific song enriched by its metadata) of
more than 50 million distinct tracks by over 5 million artists.
Additionally, LFM-2b contains demographic information concerning users
(age, gender, country) and additional track-level metadata, such as
tags, microgenres and vector embeddings of lyrics. LFM-2b includes 10
data elements contained in separate files, focused on varying levels of
metadata. For this analysis I will use: - listening-events – the main
data file containing tuples of <user id, track id, album id,
timestamp>, - users – file linking distinct users with their
demographic data, <user id, country, age, gender, creation time>,
- tags-micro-genres – file linking distinct tracks to the user-generated
tags filtered so as to only represent genres provided by Every Noise at
Once.
Due to the exceptionally large size of the dataset and limited
computational powers, in this study I will draw a semi-random sample of
10,000 users whose listening events the further analysis will be focused
on. In order to ensure proper user-level metadata representation, the
sample is picked out of those users, who provide information of their
gender and country, and whose age ranges betweeen 10 to 95 years. Thus
filtered dataset consists of 42,683 users, meaning that only 35% of the
users in the dataset provide representative demographic data.
## Classes 'data.table' and 'data.frame': 120322 obs. of 5 variables:
## $ user_id : int 0 1 2 3 4 5 6 7 8 9 ...
## $ country : chr "UK" "US" "UK" "BR" ...
## $ age : int 31 43 35 31 51 38 28 -1 36 45 ...
## $ gender : chr "m" "m" "m" "m" ...
## $ creation_time: POSIXct, format: "2002-12-28 01:00:00" "2003-04-15 02:00:00" ...
## - attr(*, ".internal.selfref")=<externalptr>
## [1] 0 112
Further analysis will be focused on the previously selected users,
whose listening histories will be imported from the main data file
(listening-events.tsv). The original dataset records the activity of
120,322 users spanning over the period between February 14th 2005 to
March 20th 2020. This adds up to a total of over 2 billions listening
events, hence the need for sampling and for importing target users from
the file with the use of DuckDB library. DuckDB is an in-process SQL
OLAP database management system, which allows for a faster and more
computationally efficient data processing (DuckDB Docs).
# creating a connection (DuckDB in memory)
con <- dbConnect(duckdb::duckdb(), dbdir = ":memory:")
# reading sampled user id list into the connection
dbWriteTable(con, "target_users", sampled_users %>% select(user_id))
input_file <- "/Users/gosia/Desktop/tails/lastfm_data/listening-events.tsv"
output_file <- "/Users/gosia/Desktop/sampled_events.csv"
# defining the query - extracting the listening events of target users
query <- paste0("
SELECT
le.user_id,
le.track_id,
le.timestamp
FROM read_csv('", input_file, "', delim='\t', header=TRUE, auto_detect=TRUE) AS le
INNER JOIN target_users tu ON CAST(le.user_id AS VARCHAR) = tu.user_id
")
# execute the query and save
sql_cmd <- paste0("COPY (", query, ") TO '", output_file, "' (HEADER, DELIMITER ',')")
dbExecute(con, sql_cmd)
## [1] 296947364
dbDisconnect(con)
Calculating Behavioral Metrics
Diversity
User’s diversity is defined in three ways, one based on the entirety
of user’s listening habits, indicating what portion of user’s total
playcount (the sum of all of user’s listening events)
is constituted by unique tracks, the other two representing user’s
micro-genre diversity.
Track Level
Track level diversity (\(TD_i\)) is
computed for each user in the sample as:
\[TD_{i} = {UT}_i/{TP}_i\] Where
\(UT_i\) are the distinct, unique
tracks in the entirety of user’s \(i\)
listening history and \(TP_i\) is the
total amount of listening events constituting user’s \(i\) listening history.
# connecting to the file
ds <- open_dataset("/Users/gosia/Desktop/sampled_events.parquet")
# calculating track diversity
diversity_results <- ds %>%
group_by(user_id) %>%
summarise(total_playcount = n(),
unique_tracks = n_distinct(track_id)) %>%
mutate(diversity_score = unique_tracks / total_playcount) %>%
collect()
# taking a look
head(diversity_results)
## # A tibble: 6 × 4
## user_id total_playcount unique_tracks diversity_score
## <int> <int> <int> <dbl>
## 1 34590 30832 9698 0.315
## 2 83295 13059 5510 0.422
## 3 61327 21333 1732 0.0812
## 4 2409 25032 7896 0.315
## 5 21574 22595 3366 0.149
## 6 16648 12215 7513 0.615
Micro-genre Level
Micro-genre file contains user-generated tags describing tracks with
more than 10 listens globally, filtered so as to only present the tags
defining music genres and micro-genres provided by Every Noise at Once.
Therefore not all of the songs represented in users’ listening histories
will be assigned a micro-genre, however those songs are either quite
obscure (less than 10 listens globally), or not represented by uniformly
informative tags (the downside to user-generated tags can be their lack
of informative value, due to humorous tags providing deceitful
information or a lot of noise).
To compute micro-genre level diversity, I will limit the analysis to
such extent so as to include only the dominant tag for each track. A lot
of records in the dataset were described with tags such as: “pop”,
“indie pop”, “alternative pop”, most often than not, in the same order.
The latter tags do not deepen the understanding of the track’s genre in
terms of diversity. Therefore, following the rule of sparsity - only the
first tag will be considered in further analysis.
# loading file
lines <- read_lines("/Users/gosia/Desktop/tails/lastfm_data/tags-micro-genres.json")
track_genres <- data.table(track_id = as.integer(str_extract(lines, "(?<=\"i\":)\\d+")), # extracting track_id (in the file a number following " "i":")
top_genre=str_extract(lines,"(?<=\"tags\":\\{\")[^\"]+")) # extracting top_genre - always the first appearing on the list, following ""tags":"
# cleaning RAM
rm(lines)
# taking a look
head(track_genres)
## track_id top_genre
## <int> <char>
## 1: 36346257 pop
## 2: 7247506 pop
## 3: 10540381 indie rock
## 4: 36367985 soul
## 5: 36039983 grunge
## 6: 24361147 folk
# saving file
write_parquet(track_genres, "/Users/gosia/Desktop/top_genres.parquet")
setDT(track_genres)
total_n_genres <- uniqueN(track_genres$top_genre)
# aggregating users listening history
# simplifying the listening events dataset so that it is in a <user_id, track_id, playcount> format
# where playcount is the number of times that a given track was listened to by a given user
con <- dbConnect(duckdb::duckdb())
ds <- open_dataset("/Users/gosia/Desktop/sampled_events.parquet") %>%
to_duckdb(con = con, table_name = "events")
sampled_playcounts <- ds %>%
group_by(user_id, track_id) %>%
summarise(
playcount = n(),
.groups = "drop"
) %>%
collect()
dbDisconnect(con)
setDT(sampled_playcounts)
# connecting dataset with genres
full_data <- merge(sampled_playcounts, track_genres, by = "track_id", all.x = TRUE)
# tags are provided only for the songs with more than 10 scrobbles on the platform
# hence not all songs have genres
# percent of na values
sum(is.na(full_data$top_genre))/length(full_data$track_id) # 41% NAs
## [1] 0.4146919
full_data[is.na(top_genre), top_genre := "Unknown"]
Genre Ratio
One metric of user’s genre diversity denotes the ratio of genres
listened by the user \(i\) (\(G_i\)) and the total amount of genres
represented in the sample of listening histories (\(TG\)).
\[ GD_i = G_i/TG \]
Genre Entropy
The former representation of genre diversity may not be the one that
provides the most information on the user’s actual listening habits.
Both, user who listens to mainly rock music, but has tried pop, metal,
hip-hop or jazz once or twice and the user who listens to all of those
genres equally will have the same score.
To alleviate this issue an alternative metric of user genre diversity
– based on Shannon entropy – is proposed. It is computed using the
following formula: \[ GE_i = - \sum_{g \in
G_i} p(g) * log_2p(g) \] Where, \(GE_i\) representes genre entropy for a user
\(i\), \(G_i\) represents all of the unique
micro-genres replayed by a user \(i\)
and \(p(g)\) is the probability
(relative frequency) of listening to a micro-genre \(g\), calculated for each user as follows:
\[ p(g) = \text {playcount of genre g}/\text
{user's total playcount}\]
### CALCULATING DIVERSITY METRICS ###
# Shannon entropy
calculate_weighted_entropy <- function(counts) {
total <- sum(counts)
if(total == 0) return(0)
p <- counts/total # probability of a given genre being played
-sum(p*log2(p)) # Shannon entropy formula
}
# user - genre table
user_genre <- full_data[, .(genre_playcount = sum(playcount)),
by = .(user_id, top_genre)]
# calculating metrics for each user
user_genre_metrics <- user_genre[, .(unique_genres = .N,
genre_ratio = .N/total_n_genres,
genre_entropy = calculate_weighted_entropy(genre_playcount)),
by = user_id]
head(user_genre_metrics)
## user_id unique_genres genre_ratio genre_entropy
## <int> <int> <num> <num>
## 1: 7546 163 0.07882012 3.388006
## 2: 53343 74 0.03578337 3.336050
## 3: 29864 138 0.06673114 3.801346
## 4: 98958 289 0.13974855 3.412806
## 5: 6307 321 0.15522244 4.176551
## 6: 1342 274 0.13249516 3.255282
Novelty
The methodology for calcualting novelty is adapted from
Schedl and Hauger (2015), it models user’s tendency for listening
previously unknown music. It is calculated through dividing user’s
listening history into 6 month periods and computing a ratio of novel
items – those appearing for the first time in the entirety of user’s
listening history. Subsequently, overall novelty score of a user \(i\) is computed by averaging over all of
the 6 month time windows.
It is computed using the following formula: \[ N_{i_t} = NT_{i_t}/ TP_{i_t} \] Where
\(NT_{i_t}\) denotes the amount of
novel tracks in a time window \(t\) for
user \(i\) and \(TP_{i_t}\) is the total playcount of a
given user in time slot \(t\).
Subsequently, total novelty for a user \(i\) is calculated as: \[ N_i = 1/n_t \sum_{\text t \in n_t} N_{i_t}
\] Where \(n_t\) is the number
of all of the time windows that the novelty score – \(N_{i_t}\) was computed for.
# connecting to DuckDB
con <- dbConnect(duckdb::duckdb())
# installing icu extension within duckdb for date handling
dbExecute(con, "INSTALL icu;")
## [1] 0
dbExecute(con, "LOAD icu;")
## [1] 0
# importing file to the etsbalished connection
ds <- open_dataset("/Users/gosia/Desktop/sampled_events.parquet") %>%
to_duckdb(con = con, table_name = "events")
# calculating discovery_moments - the first time that a user listened to a given track
discovery_moment <- ds %>%
group_by(user_id, track_id) %>%
summarise(first_timestamp = min(timestamp), .groups = "drop")
### calculating novelty metrics in 6 month long windows
novelty_metrics <- ds %>%
left_join(discovery_moment, by = c("user_id", "track_id")) %>%
mutate(curr_year = year(timestamp),
curr_month = month(timestamp),
disc_year = year(first_timestamp),
disc_month = month(first_timestamp)) %>% # establishing time windows
# mutating window id so that they are in a YYYYH format, where YYYY denotes the year and H which half of the year the event took place in
mutate(current_window = curr_year*10 + if_else(curr_month<= 6, 1, 2),
discovery_window = disc_year*10 + if_else(disc_month <= 6, 1, 2)) %>%
mutate(is_novel = if_else(current_window == discovery_window, 1, 0)) %>%
# grouping by users and windows
group_by(user_id, current_window) %>%
summarise(window_novelty = mean(is_novel),
.groups = "drop") %>%
# average for users
group_by(user_id) %>%
summarise(
novelty_score = mean(window_novelty)) %>%
collect()
# cleaning up and disconnecting
dbDisconnect(con)
head(novelty_metrics)
## # A tibble: 6 × 2
## user_id novelty_score
## <int> <dbl>
## 1 94635 0.597
## 2 31374 0.756
## 3 634 0.625
## 4 26948 0.450
## 5 15989 0.588
## 6 19643 0.653
Mainstreaminess
Another behavioral metric describing user’s listening habits is
represented by mainstreaminess (Schedl and Hauger, 2015), which
shows user’s inclination to listen to songs that are popular, or
mainstream.
In this analysis, mainstream tracks are defined by the frequency with
which they were listened to in the entire timespan of the data cover on
LFM-2b. To construct a ranking of all of the songs in the dataset, the
file containing <user id, track id, playcount> –
listening-counts.tsv – is used. For each track, all of the users’
playcounts are aggregated (global playcount).
# big dataset - duckdb
con <- dbConnect(duckdb::duckdb(), dbdir = ":memory:")
path <- '/Users/gosia/Desktop/tails/lastfm_data/listening-counts.tsv'
dbExecute(con, paste0("
CREATE OR REPLACE VIEW all_counts AS
SELECT * FROM read_csv('", path, "', delim = '\t', header = TRUE, auto_detect = TRUE)"))
## [1] 0
# query for aggregating all of the times that a track was played
query <- "SELECT
track_id,
SUM(count) as global_playcount
FROM all_counts
GROUP BY track_id"
# executing and saving
global_output <- "/Users/gosia/Desktop/global_track_ranks.csv"
dbExecute(con, paste0("COPY (", query, ") TO '", global_output, "' (HEADER, DELIMITER ',')"))
## [1] 50813373
dbDisconnect(con)
global_tracks <- fread('/Users/gosia/Desktop/global_track_ranks.csv')
options(scipen = 999)
plot_data <- global_tracks[global_playcount >= 5]
plot_data$log <- log10(plot_data$global_playcount)
# plot of raw distribution, zoomed
p_raw <- ggplot(plot_data, aes(x = global_playcount)) +
geom_histogram(fill = "violet", color = "white", binwidth = 10) +
coord_cartesian(xlim = c(0, 500)) +
labs(title = "Raw Distribution",
x = "Global Playcount",
y = "Track Count") +
theme_minimal()
# log plot
p_log <- ggplot(plot_data, aes(x = log)) +
geom_histogram(bins = 60, fill = "purple", color = "white", alpha = 0.9) +
labs(title = "Log-Transformed Distribution",
x = "Log10(Global Playcount)",
y = "Track Count") +
theme_minimal()
grid.arrange(p_raw, p_log, ncol = 2)

The distribution of track popularity clearly illustrates the long
tail phenomenon (Anderson, 2006) – characteristic for cultural
industries post-digitization. Digitization has drastically transformed
the supply of music accessible to an average listener. From the old
brick-and-mortar shops with strictly limited supply, promoting mostly
the best-selling products and, hence, perpetuating their popularity,
through the rise of music piracy and easily distributable digital copies
in the early 2000s, all the way to the streaming platforms used nowadays
– the available supply of music products has increased exponentially in
the 21st century. This development led to a shift in how consumers
explore and navigate across as good as an infinite supply of
experience goods, whose true value is known to the consumer
only after the good is consumed. As made visible by the plots above,
most of the songs in the dataset have been replayed less than 100 times
globally, however there still persist songs that have been replayed many
times, constituting the mainstream. Due to this distribution of track
popularity, logarithmic transformation is applied to the global
playcount in further analysis.
Subsequently, the mainstreaminess score of each track is
computed as: \[ M_s = log_{10}(\text{global
playcount}_s + 1)/log_{10}(\text{max(global playcount)+1})\]
Where \(s\) is a subscript denoting a
track \(s\) and \(max(\text{global playcount})\) is the
number of times that the top 1 track in the dataset was played.
Mainstreaminess score per user is then calculated as a mean of \(M_s\) in user’s history weighted by user’s
playcount of that song.
# logarithmic normalization
max_playcount <- max(global_tracks$global_playcount)
# formula: log(playcount + 1) / log(max + 1)
global_tracks[, track_score := log10(global_playcount + 1) / log10(max_playcount + 1)]
# cleaning up - only the score needed
global_tracks[, global_playcount := NULL]
# merging with sampled users' playcount history
merged <- merge(sampled_playcounts, global_tracks, by = "track_id", all.x = TRUE)
summary(merged)
## track_id user_id playcount track_score
## Min. : 1 Min. : 1 Min. : 1.00 Min. :0.0552
## 1st Qu.:13811903 1st Qu.: 9375 1st Qu.: 1.00 1st Qu.:0.3449
## Median :25210466 Median : 22198 Median : 1.00 Median :0.5172
## Mean :25296113 Mean : 29663 Mean : 4.23 Mean :0.4933
## 3rd Qu.:37120819 3rd Qu.: 44877 3rd Qu.: 3.00 3rd Qu.:0.6577
## Max. :50813367 Max. :120319 Max. :108903.00 Max. :1.0000
# mainstreaminess per user
mainstreaminess <- merged[, .(mainstreaminess_score = weighted.mean(track_score, w = playcount)),
by = user_id]
Listening Density
Listening density is a metric reflecting each user’s average yearly
playcount. First, the number of years that each user was active is
computed as the duration between user’s first activity on the platform
(as in the date of user’s first listening event, not registration date)
and the last activity registered in the dataset divided by 365 for
yearly approximation.
This process led to the identification of 4 outlying observations –
users, whose activity has been registered only one time in the duration
of a one day – eliminated from further analysis. Another user is
eliminated due to faulty data, his first registered activity was on
1970-01-01, while the last.fm platform has been running since 2002.
Afterwards, listening density is calculated as: \[ LD_i = TP_i/AY_i\] Where \(TP_i\) is the total playcount of a user and
\(AY_i\) represents the number of years
that a user was active.
con <- dbConnect(duckdb::duckdb())
parquet_input <- "/Users/gosia/Desktop/sampled_events.parquet"
# query for finding min and max date of listening events
query <- paste0(" SELECT
user_id,
MIN(timestamp) as first_seen,
MAX(timestamp) as last_seen
FROM '", parquet_input, "'
GROUP BY user_id")
user_lifespan <-dbGetQuery(con, query)
setDT(user_lifespan)
dbDisconnect(con)
# calculating the difference
# in days
user_lifespan[, duration_days := as.numeric(difftime(last_seen, first_seen, units = "days"))]
user_lifespan[, one_timer := ifelse(duration_days<1, 1, 0)]
sum(user_lifespan$one_timer) # 4 one-timers - eliminate as outliers
## [1] 4
# checking whether there are any data errors
summary(user_lifespan) # yes, min first_seen is 1970-01-01
## user_id first_seen
## Min. : 1 Min. :1970-01-01 01:00:36.00
## 1st Qu.: 15476 1st Qu.:2009-12-29 18:42:44.50
## Median : 35662 Median :2011-06-30 16:23:32.00
## Mean : 42538 Mean :2010-12-25 23:10:08.85
## 3rd Qu.: 64079 3rd Qu.:2012-04-08 21:26:48.00
## Max. :120319 Max. :2014-08-29 01:37:59.00
## last_seen duration_days one_timer
## Min. :2012-06-01 02:48:22.00 Min. : 0.0 Min. :0.0000
## 1st Qu.:2013-10-14 19:05:45.75 1st Qu.: 959.9 1st Qu.:0.0000
## Median :2015-11-14 20:36:13.00 Median : 1936.4 Median :0.0000
## Mean :2016-05-18 13:46:35.40 Mean : 1970.6 Mean :0.0004
## 3rd Qu.:2019-09-05 05:57:08.25 3rd Qu.: 2835.9 3rd Qu.:0.0000
## Max. :2020-03-20 12:59:58.00 Max. :17248.2 Max. :1.0000
# Unix Epoch glitch
# last.fm has been running since 2002 - eliminate all of the users whose first_seen dates are before that as unreliable
to_eliminate <- user_lifespan[year(first_seen) < 2002 | duration_days < 1, user_id] # losing 5 users - saving to eliminate from final data
user_lifespan <- user_lifespan[!user_id %in% to_eliminate]
summary(user_lifespan) # all good
## user_id first_seen
## Min. : 1 Min. :2005-02-14 01:07:46.00
## 1st Qu.: 15468 1st Qu.:2009-12-29 17:19:38.00
## Median : 35671 Median :2011-06-30 13:21:22.00
## Mean : 42534 Mean :2010-12-27 02:09:53.03
## 3rd Qu.: 64061 3rd Qu.:2012-04-08 20:09:25.50
## Max. :120319 Max. :2014-08-29 01:37:59.00
## last_seen duration_days one_timer
## Min. :2012-06-01 02:48:22.00 Min. : 2.07 Min. :0
## 1st Qu.:2013-10-15 06:24:05.00 1st Qu.: 960.69 1st Qu.:0
## Median :2015-11-14 23:44:52.00 Median :1937.12 Median :0
## Mean :2016-05-18 23:00:52.91 Mean :1969.87 Mean :0
## 3rd Qu.:2019-09-06 23:10:52.00 3rd Qu.:2835.91 3rd Qu.:0
## Max. :2020-03-20 12:59:58.00 Max. :5512.99 Max. :0
# converting into years
user_lifespan[, active_years := duration_days/365]
write.csv(user_lifespan, "user_lifespan_sample.csv", row.names = F)
datasets <- list(sampled_users, user_lifespan, user_genre_metrics, novelty_metrics, mainstreaminess, diversity_results)
final_data <- datasets %>%
reduce(function(x,y) merge(x,y, by = "user_id", all.x = TRUE))
# eliminate the users that were previously identified as outliers
final_data<- final_data[!user_id %in% to_eliminate]
colSums(is.na(final_data)) # no NAs
## user_id country age
## 0 0 0
## gender creation_time first_seen
## 0 0 0
## last_seen duration_days one_timer
## 0 0 0
## active_years unique_genres genre_ratio
## 0 0 0
## genre_entropy novelty_score mainstreaminess_score
## 0 0 0
## total_playcount unique_tracks diversity_score
## 0 0 0
final_data[, listening_density := total_playcount/active_years]
Selecting Features for Clustering
Clustering will be performed only on the previously
calculated–numeric–behavioral metrics: * Diversity (track-level), *
Genre Entropy, * Novelty, * Mainstreaminess, * Listening Density.
Demographic information provided by the users will later be used for
the analysis of the final clusters.
str(final_data)
## Classes 'data.table' and 'data.frame': 9995 obs. of 19 variables:
## $ user_id : int 1 5 8 12 25 31 32 35 48 50 ...
## $ country : chr "US" "ES" "US" "US" ...
## $ age : int 43 38 36 33 31 36 36 28 35 37 ...
## $ gender : chr "m" "m" "m" "m" ...
## $ creation_time : POSIXct, format: "2003-04-15 02:00:00" "2003-07-22 02:00:00" ...
## $ first_seen : POSIXct, format: "2006-04-30 23:36:15" "2006-06-09 16:58:38" ...
## $ last_seen : POSIXct, format: "2014-02-20 23:05:56" "2018-06-14 18:32:00" ...
## $ duration_days : num 2853 4388 1570 3316 3462 ...
## $ one_timer : num 0 0 0 0 0 0 0 0 0 0 ...
## $ active_years : num 7.82 12.02 4.3 9.08 9.48 ...
## $ unique_genres : int 65 355 173 185 290 172 178 141 77 141 ...
## $ genre_ratio : num 0.0314 0.1717 0.0837 0.0895 0.1402 ...
## $ genre_entropy : num 3.45 4.75 3.48 5.11 4.98 ...
## $ novelty_score : num 0.987 0.726 0.703 0.758 0.768 ...
## $ mainstreaminess_score: num 0.417 0.537 0.423 0.609 0.55 ...
## $ total_playcount : int 868 52823 20000 16189 22597 12502 21642 22128 9947 16251 ...
## $ unique_tracks : int 775 24442 10180 7876 12961 8220 6755 4255 4569 5634 ...
## $ diversity_score : num 0.893 0.463 0.509 0.487 0.574 ...
## $ listening_density : num 111 4394 4649 1782 2383 ...
## - attr(*, ".internal.selfref")=<externalptr>
## - attr(*, "sorted")= chr "user_id"
summary(final_data)
## user_id country age gender
## Min. : 1 Length:9995 Min. :10.00 Length:9995
## 1st Qu.: 15468 Class :character 1st Qu.:20.00 Class :character
## Median : 35671 Mode :character Median :23.00 Mode :character
## Mean : 42534 Mean :25.01
## 3rd Qu.: 64061 3rd Qu.:27.00
## Max. :120319 Max. :95.00
## creation_time first_seen
## Min. :2003-02-10 19:05:23.00 Min. :2005-02-14 01:07:46.00
## 1st Qu.:2007-09-30 17:36:06.50 1st Qu.:2009-12-29 17:19:38.00
## Median :2009-07-07 16:35:47.00 Median :2011-06-30 13:21:22.00
## Mean :2009-05-30 07:15:25.22 Mean :2010-12-27 02:09:53.03
## 3rd Qu.:2011-04-15 14:13:51.50 3rd Qu.:2012-04-08 20:09:25.50
## Max. :2012-07-19 22:07:27.00 Max. :2014-08-29 01:37:59.00
## last_seen duration_days one_timer
## Min. :2012-06-01 02:48:22.00 Min. : 2.07 Min. :0
## 1st Qu.:2013-10-15 06:24:05.00 1st Qu.: 960.69 1st Qu.:0
## Median :2015-11-14 23:44:52.00 Median :1937.12 Median :0
## Mean :2016-05-18 23:00:52.91 Mean :1969.87 Mean :0
## 3rd Qu.:2019-09-06 23:10:52.00 3rd Qu.:2835.91 3rd Qu.:0
## Max. :2020-03-20 12:59:58.00 Max. :5512.99 Max. :0
## active_years unique_genres genre_ratio genre_entropy
## Min. : 0.005672 Min. : 2.0 Min. :0.0009671 Min. :0.03367
## 1st Qu.: 2.632024 1st Qu.: 68.0 1st Qu.:0.0328820 1st Qu.:3.30627
## Median : 5.307192 Median :114.0 Median :0.0551257 Median :3.87241
## Mean : 5.396901 Mean :126.8 Mean :0.0613068 Mean :3.74852
## 3rd Qu.: 7.769602 3rd Qu.:172.0 3rd Qu.:0.0831721 3rd Qu.:4.32403
## Max. :15.104076 Max. :793.0 Max. :0.3834623 Max. :5.63307
## novelty_score mainstreaminess_score total_playcount unique_tracks
## Min. :0.08837 Min. :0.1616 Min. : 9 Min. : 9
## 1st Qu.:0.51300 1st Qu.:0.5109 1st Qu.: 5306 1st Qu.: 1594
## Median :0.63544 Median :0.5732 Median : 18923 Median : 4248
## Mean :0.64205 Mean :0.5634 Mean : 29707 Mean : 7020
## 3rd Qu.:0.76709 3rd Qu.:0.6273 3rd Qu.: 34722 3rd Qu.: 8932
## Max. :1.00000 Max. :0.8554 Max. :662625 Max. :151467
## diversity_score listening_density
## Min. :0.005538 Min. : 7.38
## 1st Qu.:0.188262 1st Qu.: 1665.20
## Median :0.298967 Median : 3624.64
## Mean :0.341958 Mean : 5518.25
## 3rd Qu.:0.454493 3rd Qu.: 6857.03
## Max. :1.000000 Max. :177363.83
The visual inspection of the scatter plots reveals a high density and
uniform coverage of the behavioral features, indicating that users are
effectively differentiated. In sharp contrast to these, listening
density displays a sparse distribution with few, but drastic outlying
values, exceeding the value of 100,000 playcounts per year. These
structural differences render the need for data standardization, so that
the overall cluster distance calculations are not disproportionately
dominated by the consumption volume, masking the more nuanced behavioral
patterns to the algorithm.

To further examine the relationships between the final clustering
variables, a correlation matrix is calculated. Its analysis reveals
logical dependencies between the variables. There is a strong positive
correlation (r=0.68) between diversity and novelty, indicating that
users, who are more inclined to discover new tracks also tend to have a
more diversified consumption. Interestingly, mainstreaminess correlates
positively with genre entropy, which suggests that mainstream
consumption also involves eclectic listening across popular hits from
various genres. Listening density displays a moderate negative
correlation with, both, diversity and novelty, implying that the most
active users (heavy listeners) tend to have more focused and
repetitive listening habits – favouring familiarity over novelty and
exploration. All of the variables are retained for further analysis, as
there is no severe multicollinearity within the data – all of the
correlation coefficients remain below the critical threshold and none of
the features are deemed redundant.
features <- c("genre_entropy", "novelty_score",
"mainstreaminess_score", "diversity_score", "listening_density")
music_data <- final_data[, ..features]

# scaling data
music_scaled <- scale(music_data)
Pre-clustering Assessment
Clusterability
Prior to the launch of clustering algorithms, the data needs to be
further analyzed in terms of its clusterability and the appropriate
parameter selections, such as the optimal number of clusters (defined as
\(k\)).
First, Hopkin’s statistic is calculated in order to verify whether
data is suitable for clustering. The value of Hopkin’s statistic for
this project’s data is equal to \(0.8255\), therefore the null hypothesis of
the uniformity of data’s distribution is rejected.
The Visual Assessment of Tendency (VAT) plot, displayed below,
illustrates the Euclidean distance between data points reordered
according to their mutual similarity, where red indicates low distance
(inter-class similarity) and blue indicates high distance (intra-class
dissimilarity). The plot generated for data used in this analysis
displays a block-diagonal structure with reddish squares along the main
diagonal–representing similar groups of observations– separated by the
blue, perpendicular regions of high dissimilarity. The presence of these
structures serves as visual evidence that the dataset possesses a
non-random clusterable structure, proving the application of clustering
algorithms feasible.
### HOPKINS STATISTIC ###
set.seed(123)
random_rows <- sample(1:nrow(music_scaled), 1000)
data_sample <- music_scaled[random_rows,]
get_clust_tendency(data_sample, n = 500, graph = TRUE)
## $hopkins_stat
## [1] 0.8255465
##
## $plot
### The Optimal Number of Clusters
Having ensured that the data used in this project does, indeed,
possess a non-random structure with potential for meaningful clusters,
further examination of data structure is performed. This section will
focus on the determination of the optimal number of meaningful
clusters.
Sousa et al. (2019) find four main types of user behaviors (\(k=4\)) in terms of how curiosity stimuli
captured by novelty, uncertainty, conflict and complexity drive the user
accesses to online music. Performing Mini-Batch K-Menas on 150 millions
of listening events generated by 43,000 users, they discovered four
distinct types of user behavior: * Underground - describing access
towards an item with similar stimuli metrics and a tendency to stay in
the relaxation zone for each metric (moderate values), * Eclectic - with
access regularly spread across each metric, which achieve higher values
compared with the underground access, * Niche Radical - type of access,
where the user listens to repeated songs (low novelty) with the songs
being more comples, * Explorer - pursuing more complex, novel music.
While the authors stress that they are analyzing access profiles – a
characteristic of listening events– rather than user profiles, I suspect
that a similar analysis can be applied to user behavior as a whole.
However, Marques et al. (2013) perform a similar analysis based on
familiarity, mainstreaminess and metrics measuring attention – revealing
seven user clusters (\(k=7\)), computed
with Ward hierarchical clustering method: fond of surprises, mainstream
upholder, mainstream explorer, crowd follower, niche radical, highly
eclectic and underground.
There is an extensive and ever-growing field of literature concerning
the analysis of music recommendation systems– a crucial tool in the
post-digitization era providing an increasingly more personalized way to
navigate the near infinite supply of digital content. Schedl and Hauger
(2015) propose an implementation of user-specific characteristics
modelling their listening behavior, such as: diversity, mainstreaminess
and novelty. Based on those behavioral features they mannually divide
the users into 14 categories based on imposed statistical divisions.
This project implements the same approach used in their study but,
instead of artificially imposed user segmentation, clustering algorithms
will be applied to analyze the natural groupings of the data.
Chang and Zhang (2022) use unsupervised learning methods, such as
Principal Component Analysis (PCA) and K-means clustering to categorize
users based on their behaviors, preferences and demographic information
ending up with seven (\(k=7\)) clusters
characterized with age, genre, the intensity of user’s activity and
interaction with streaming platform’s social functions. Kużelewska and
Wichowski (2021), however, propose the use of modified DBSCAN clustering
algorithms for more efficient Collaborative Filtering recommender
systems (based on user’s interaction with the items).
To sum up, literature concerning the clustering of music consumers
related data points neither to the most effective method of clustering,
nor to a clear optimal number of clusters. In such case, to determine
the appropriate number of clusters, the Elbow Method, Silhouette and
Calinsky-Harabasz Index will be applied in the following section.
Elbow Method
The elbow plot illustrates explained variation as a function of the
number of clusters. According to this method, the optimal number of
clusters is determined by identifying th epoint where the marginal gain
in reduction in variance begins to diminish significantly. The first
substantial drop in variance occurs at \(k=2\), reducing it to 0.74, the slope still
continues its significant drop at \(k=3\), reducing the explained variance by
0.13. Another distinct inflection point becomes visible aroun \(k=4\), \(k=5\) – the transition from \(k=4\) to \(k=5\) still offers a noticeable reduction
in variance. However, beyond \(k=5\),
the slope flattens suggesting that subsequent clusters provide
diminishing returns in terms of explaining the data
structure.Consequently, the elbow method produces ambiguous results,
probing further exploration.
Optimal_Clusters_KMeans(music_scaled, max_clusters = 10, plot_clusters = T)

## [1] 1.0000000 0.7357964 0.6084016 0.5195449 0.4633887 0.4416606 0.3984118
## [8] 0.3738682 0.3581979 0.3313506
Silhouette
Exploratory analysis of silhouettte value to determine the optimal
number of clusters works as though data had already been separated into
a given amount of clusters using a specified clustering method (here
k-means is used). The analysis of silhouette plot reveals \(k=4\) as an unequivocally optimal number of
clustering. However, it is only higher than \(k=2\), which would offer significantly
lower utility. Therefore, this analysis will implement \(k=4\) as the optimal parameter
selection.
Optimal_Clusters_KMeans(music_scaled, max_clusters = 10, plot_clusters = T, criterion = "silhouette")

## [1] 0.0000000 0.2740422 0.2639991 0.2831830 0.2045231 0.2243158 0.2000706
## [8] 0.1906235 0.1866816 0.1862293
Calinsky-Harabasz Index
The Calinsky-Harabasz Index, a metric of internal validation,
measures the ratio of between-cluster variance and within-cluster
variance. This index was calculated using the Ward’s linkage method and
suggests \(k=2\) as the optimal number
of clusters. This result, however, is inconsistent with previously cited
literature, as well as the market logic. While being statistically
correct, it would not provide informative results.
NbClust(music_scaled, distance = 'euclidean', min.nc = 2, max.nc = 10, method = 'ward.D', index = 'ch')
## $All.index
## 2 3 4 5 6 7 8 9
## 2585.229 2286.134 2176.760 2012.294 1927.729 1975.745 1836.711 1717.333
## 10
## 1674.566
##
## $Best.nc
## Number_clusters Value_Index
## 2.000 2585.229
##
## $Best.partition
## [1] 1 1 1 2 2 1 1 2 1 1 1 1 2 1 1 1 1 1 2 1 1 2 1 2 1 2 1 2 2 1 1 2 2 1 2 2 1
## [38] 2 1 1 1 2 2 2 1 2 1 2 2 1 1 2 1 2 2 1 1 1 1 2 1 2 2 2 2 2 1 2 1 2 1 1 2 2
## [75] 2 1 2 1 1 1 2 2 2 1 1 1 2 1 1 1 1 2 2 2 2 2 1 2 2 1 1 1 1 2 1 2 1 2 2 1 2
## [112] 2 1 2 2 2 1 1 1 2 1 2 2 2 2 1 2 2 2 1 1 1 1 1 2 1 1 1 1 2 1 2 2 2 1 1 2 2
## [149] 2 2 2 2 1 1 2 2 1 1 2 1 2 1 1 2 1 1 1 2 2 1 2 1 2 1 2 2 2 1 2 2 1 1 2 1 2
## [186] 2 2 2 2 1 2 1 1 2 2 2 1 1 2 2 1 1 2 1 1 2 2 1 1 2 2 1 1 2 2 1 2 2 1 1 1 1
## [223] 2 2 2 2 2 2 2 1 1 1 1 2 2 2 1 2 2 1 1 2 2 2 1 2 2 2 1 2 2 1 1 1 2 1 1 2 1
## [260] 1 2 2 1 2 1 1 2 1 2 2 2 2 1 2 2 1 1 1 2 2 1 1 2 2 2 2 2 2 2 1 1 1 2 1 1 2
## [297] 2 2 2 2 1 1 1 2 1 2 2 2 1 2 1 1 2 2 1 1 2 2 1 2 1 2 2 1 2 1 2 2 1 2 2 2 2
## [334] 2 1 2 1 2 2 1 2 2 1 1 1 2 1 2 1 1 2 2 2 2 1 2 2 2 1 2 1 1 2 2 2 1 2 1 2 1
## [371] 2 2 1 1 1 2 2 1 2 2 2 2 2 1 1 1 2 2 1 2 2 2 2 1 1 2 1 2 2 2 2 1 2 1 2 2 1
## [408] 2 1 2 2 1 2 1 1 1 2 1 2 2 1 2 2 2 2 2 2 2 2 2 1 1 2 2 2 1 2 2 2 2 2 2 1 1
## [445] 1 1 2 1 2 1 2 1 1 2 2 2 2 1 1 2 2 1 2 1 2 1 1 1 1 2 2 1 2 1 2 1 1 2 1 1 1
## [482] 2 2 1 1 1 2 1 2 1 2 2 2 2 2 1 1 2 2 1 2 2 2 2 2 2 1 1 2 2 2 2 2 2 2 2 1 2
## [519] 2 1 1 2 1 1 2 2 1 2 2 1 2 1 1 1 2 2 2 2 2 2 2 2 2 2 1 2 2 1 2 1 2 1 1 2 2
## [556] 1 2 1 1 2 1 2 2 2 2 1 2 1 2 1 2 1 1 2 2 1 1 2 1 2 2 2 2 2 2 2 1 1 2 2 2 2
## [593] 2 1 2 1 2 2 1 2 2 1 1 2 2 2 2 2 1 1 2 2 1 2 2 1 1 2 1 1 2 2 2 2 2 2 1 2 2
## [630] 2 2 1 1 2 1 2 2 2 2 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 1 1 2 1 1 2 2 1 2 2 2
## [667] 1 1 2 1 1 1 2 1 2 2 2 2 2 2 1 2 1 2 2 2 2 2 2 2 2 1 2 2 2 1 2 2 2 2 1 2 1
## [704] 1 1 2 2 1 2 2 2 2 2 2 2 1 2 2 2 1 2 1 1 2 1 2 2 2 1 2 2 1 2 1 1 2 2 2 2 2
## [741] 2 2 1 2 2 2 1 2 2 2 1 1 2 2 2 1 2 1 2 2 2 2 2 2 2 2 1 1 2 2 1 2 1 1 2 2 1
## [778] 2 2 1 2 2 1 2 1 2 1 2 2 2 1 2 2 1 1 2 2 1 2 2 1 1 1 2 2 1 1 2 2 2 1 1 2 2
## [815] 2 1 2 1 1 1 2 2 1 1 1 2 1 2 2 2 2 1 1 2 1 1 2 1 2 2 1 2 1 1 1 1 2 2 1 2 1
## [852] 2 2 2 2 1 2 1 2 2 1 1 1 1 2 2 2 1 2 2 1 1 2 2 2 2 1 2 2 2 1 1 2 2 2 1 2 1
## [889] 2 2 2 2 2 1 2 2 2 2 2 2 2 1 2 1 1 2 2 1 1 2 2 1 2 2 1 1 2 2 2 1 2 2 1 2 2
## [926] 1 1 2 1 1 2 2 2 2 2 2 2 2 2 2 1 1 2 1 1 1 1 2 1 1 2 2 2 2 1 1 2 2 2 1 2 1
## [963] 1 2 2 2 2 2 2 1 2 1 2 2 1 2 2 1 2 1 2 1 2 1 1 2 1 2 2 2 2 2 1 1 2 1 2 1 1
## [1000] 1 2 1 1 2 2 2 1 2 2 2 2 2 1 2 2 2 1 2 2 2 2 2 1 1 2 1 2 2 1 2 2 2 1 2 2 2
## [1037] 2 2 2 1 1 2 2 2 2 2 2 2 2 2 1 2 2 2 2 2 1 2 2 2 1 1 1 1 2 1 2 1 1 2 1 1 1
## [1074] 2 1 2 2 1 2 1 2 1 2 2 2 2 1 1 2 2 2 2 1 2 1 2 2 2 2 2 2 2 2 1 2 2 1 1 2 1
## [1111] 1 2 2 1 2 2 1 2 2 2 2 1 2 2 1 2 1 2 2 2 2 1 1 2 2 2 2 2 2 1 2 1 2 1 1 2 1
## [1148] 1 2 2 1 2 1 1 1 1 1 2 2 2 2 2 2 1 2 2 2 2 1 2 2 1 2 2 1 2 1 2 2 2 2 1 1 2
## [1185] 1 1 2 1 1 2 2 2 2 1 2 2 1 1 2 2 2 2 2 2 2 2 2 2 1 2 2 1 2 2 2 1 2 1 1 1 2
## [1222] 1 2 2 2 2 2 1 2 1 1 1 1 1 1 2 2 2 1 2 2 1 1 1 1 1 1 1 1 2 2 2 2 1 1 1 1 2
## [1259] 1 2 1 1 2 1 2 2 1 2 2 2 1 2 2 2 2 2 2 2 1 2 2 2 1 2 1 2 2 2 2 2 1 2 2 1 1
## [1296] 1 2 1 2 1 1 2 2 2 2 1 1 2 1 2 1 2 2 1 2 2 1 2 2 2 1 2 2 1 2 2 1 2 2 1 2 1
## [1333] 2 2 2 1 2 2 1 2 1 2 1 1 2 1 1 1 2 2 2 2 2 2 1 1 2 1 1 1 2 1 2 2 1 1 1 1 2
## [1370] 1 1 1 1 1 1 1 2 2 1 2 2 2 2 1 1 2 2 2 2 2 1 2 1 1 1 2 2 2 2 2 1 2 2 2 1 2
## [1407] 1 2 2 1 2 2 1 1 2 1 1 1 1 2 2 1 1 2 1 2 1 1 2 1 1 2 1 2 2 1 1 1 2 2 1 2 1
## [1444] 2 1 1 2 2 2 1 2 1 2 1 2 2 2 1 1 1 1 2 1 1 2 2 2 2 1 2 2 2 1 1 1 2 2 2 1 2
## [1481] 2 2 1 1 2 2 2 1 2 2 2 2 1 1 1 2 1 2 1 1 2 1 2 1 2 1 1 1 2 2 1 2 2 2 2 2 2
## [1518] 2 1 1 1 1 1 1 1 2 2 2 1 2 1 1 2 2 1 1 2 1 1 2 1 1 1 1 1 1 2 1 2 2 2 2 2 1
## [1555] 2 2 1 1 1 2 1 2 2 1 2 2 1 2 2 2 2 2 2 2 1 2 2 1 2 1 1 2 2 1 2 2 2 2 1 1 2
## [1592] 2 1 2 2 2 2 1 2 1 2 2 2 1 1 2 1 2 1 2 2 2 1 2 1 2 1 1 2 2 1 2 2 2 2 1 1 2
## [1629] 2 2 2 2 1 1 1 2 2 2 2 2 1 1 1 2 2 2 1 2 1 2 1 2 2 2 2 2 2 1 2 2 2 1 1 2 1
## [1666] 1 1 2 2 2 1 1 1 2 1 2 2 2 1 1 1 1 2 2 2 2 2 1 1 1 2 2 2 1 2 1 2 1 2 2 2 2
## [1703] 1 2 2 2 1 1 2 1 1 2 1 1 2 2 2 2 2 2 2 2 2 2 1 2 2 1 1 1 1 1 2 1 1 2 1 1 2
## [1740] 2 2 1 2 2 1 1 2 1 1 1 1 2 1 2 2 2 2 2 2 1 2 1 1 2 2 2 2 1 2 2 1 1 2 2 2 2
## [1777] 2 2 1 2 1 2 1 1 1 2 1 2 1 2 1 2 2 2 1 2 1 1 1 1 1 2 2 2 2 2 1 1 2 2 2 2 2
## [1814] 2 1 2 1 1 2 1 1 1 1 2 2 2 1 2 2 2 2 2 2 2 1 2 2 2 2 2 1 1 1 1 1 2 1 2 1 1
## [1851] 1 2 1 1 2 2 2 2 2 2 2 2 1 1 2 2 1 2 1 2 2 2 1 2 2 2 2 2 1 2 1 2 1 1 2 2 2
## [1888] 1 2 2 2 2 2 1 2 2 2 2 2 2 2 2 1 1 2 2 2 2 2 1 1 1 2 1 1 2 1 2 2 2 1 2 1 2
## [1925] 2 1 2 2 1 2 1 1 2 2 2 2 1 2 2 1 2 1 2 2 2 1 1 2 2 1 1 2 2 2 2 2 2 1 1 1 2
## [1962] 2 1 1 1 1 2 2 1 2 1 2 2 1 1 2 2 2 1 1 1 2 2 1 2 2 1 1 2 1 2 2 2 1 1 2 2 1
## [1999] 2 2 1 1 1 2 2 2 2 1 2 2 1 2 2 1 2 2 2 2 2 2 2 2 2 2 1 2 1 1 1 2 2 1 2 2 2
## [2036] 2 1 1 1 2 2 1 2 2 1 1 2 2 1 2 1 1 2 1 1 1 1 2 1 2 1 1 1 2 2 2 1 1 2 2 2 1
## [2073] 2 1 2 1 2 2 1 2 2 2 2 1 2 2 1 2 2 1 1 1 2 2 2 2 2 1 2 2 2 2 1 2 2 1 2 2 1
## [2110] 2 2 2 1 2 2 1 2 1 1 2 2 1 1 1 2 2 2 2 1 2 1 1 2 2 1 1 2 1 2 2 1 2 2 2 2 2
## [2147] 2 2 1 2 1 2 2 2 2 1 1 2 2 2 1 2 2 1 2 2 2 2 2 1 2 2 1 2 1 2 2 2 1 1 1 1 2
## [2184] 2 2 1 2 1 2 1 2 2 1 2 1 2 2 1 2 2 1 2 2 1 1 2 2 2 1 2 2 2 2 1 2 1 1 2 2 1
## [2221] 1 2 2 2 2 2 2 2 2 1 2 2 2 2 1 1 1 1 1 1 2 1 2 1 1 2 2 2 2 2 2 2 1 2 2 2 2
## [2258] 2 2 2 2 1 1 2 1 2 2 1 1 1 2 1 2 2 2 2 2 1 1 2 2 2 2 2 1 1 2 2 2 2 2 2 1 2
## [2295] 2 2 1 1 1 2 1 1 1 2 1 2 1 2 2 2 2 2 1 1 2 2 2 2 2 2 2 1 1 2 2 1 2 2 2 1 2
## [2332] 2 2 2 1 2 2 2 2 1 2 2 2 2 2 2 1 1 2 2 2 1 2 1 1 2 2 2 1 2 2 2 2 2 2 1 1 1
## [2369] 1 2 2 2 2 1 1 1 2 2 2 2 2 1 2 2 2 2 1 2 2 1 2 1 2 2 1 1 2 2 1 1 1 2 2 2 2
## [2406] 2 1 2 1 2 2 1 2 2 2 2 2 2 1 2 2 2 2 2 2 2 1 2 1 2 1 2 1 2 1 2 1 1 2 2 1 2
## [2443] 1 2 1 2 2 2 2 2 2 1 1 1 2 1 1 2 2 1 2 2 1 1 2 2 1 1 2 2 2 2 1 2 1 2 2 2 1
## [2480] 2 1 2 2 2 2 2 1 2 2 2 2 2 2 2 2 2 1 2 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 1 1 2
## [2517] 2 2 2 2 2 2 2 1 1 2 2 2 2 1 1 1 2 2 2 2 1 1 2 1 2 2 1 1 1 2 2 2 2 2 1 1 2
## [2554] 2 1 2 2 2 1 1 1 2 2 1 1 2 2 2 2 2 2 2 2 1 1 2 1 1 2 2 1 2 1 2 1 2 1 2 1 2
## [2591] 2 2 2 2 2 1 1 2 1 2 1 2 1 2 2 2 1 1 2 1 2 1 2 2 1 2 1 2 2 2 2 1 2 2 2 1 1
## [2628] 1 1 2 2 1 1 1 1 2 1 2 1 1 2 1 2 1 1 2 2 1 2 2 1 1 2 2 2 1 2 2 2 2 2 2 2 2
## [2665] 2 2 1 2 2 2 2 1 1 2 1 2 2 2 2 1 2 1 1 1 1 2 1 1 1 1 1 1 1 2 1 2 1 2 1 1 1
## [2702] 2 2 2 2 2 2 2 2 2 2 2 1 1 2 1 1 2 1 2 2 2 1 2 2 2 2 2 2 1 1 1 2 1 2 2 1 2
## [2739] 1 2 2 1 2 1 1 2 1 2 2 1 2 1 2 2 2 1 2 1 1 1 2 2 1 2 1 2 2 1 2 2 2 2 1 1 2
## [2776] 2 2 2 1 2 2 2 2 2 1 1 2 1 2 1 2 1 2 2 1 2 2 2 2 2 2 1 2 2 1 1 2 2 2 1 1 2
## [2813] 2 2 2 2 1 2 2 2 1 2 1 2 1 2 2 2 1 1 2 2 2 1 1 2 2 2 2 2 1 1 2 2 2 2 1 2 2
## [2850] 1 2 1 1 2 1 1 2 2 1 2 1 2 2 1 1 2 1 2 1 1 2 1 2 1 2 1 1 1 1 1 2 1 2 1 1 2
## [2887] 1 2 1 1 2 2 1 2 1 1 2 1 1 1 1 2 2 1 1 2 2 2 2 2 2 1 2 2 2 1 2 2 2 1 1 2 1
## [2924] 1 2 2 2 1 2 1 2 2 1 1 2 1 2 1 1 1 2 2 2 2 1 2 1 2 2 2 2 2 2 1 2 2 2 2 2 1
## [2961] 1 2 2 1 2 2 2 2 2 2 1 2 2 2 2 1 2 2 1 2 2 1 1 1 2 1 2 2 2 1 1 2 2 2 1 1 1
## [2998] 2 2 2 2 2 2 2 2 2 1 1 2 2 1 2 1 2 2 2 2 2 2 1 1 1 2 1 1 2 1 1 1 1 1 1 1 1
## [3035] 2 2 2 2 2 2 1 1 1 1 2 2 2 1 2 2 2 2 2 2 2 2 1 2 2 2 2 2 1 2 2 2 1 1 2 2 1
## [3072] 2 1 2 1 1 1 2 2 1 2 2 2 1 2 2 2 2 1 1 2 1 1 2 2 1 2 2 1 2 2 2 2 1 1 2 2 2
## [3109] 2 2 2 2 1 2 1 1 1 2 2 1 1 1 1 2 2 2 2 1 2 2 1 2 1 2 2 1 2 2 2 2 1 2 1 2 1
## [3146] 1 2 2 2 2 2 1 2 2 2 2 1 1 1 2 2 1 2 1 1 1 1 1 1 2 2 1 2 2 1 1 2 2 1 2 1 2
## [3183] 1 2 2 2 1 2 2 1 2 2 1 2 1 2 1 1 1 2 1 2 1 2 1 2 1 2 2 1 2 1 1 2 2 1 2 2 1
## [3220] 1 2 1 2 1 1 2 2 1 2 1 1 2 1 2 2 1 2 2 2 1 2 1 1 1 2 1 2 2 1 1 2 1 2 2 2 1
## [3257] 2 2 1 2 2 2 1 2 2 2 1 2 1 2 1 1 1 2 2 1 1 1 2 1 2 2 1 2 1 2 2 2 1 1 1 2 1
## [3294] 1 2 1 1 1 2 2 1 2 2 1 2 2 2 2 2 2 1 2 2 2 2 1 1 1 2 1 2 2 2 1 2 2 2 2 2 2
## [3331] 2 1 2 1 2 2 2 1 2 1 2 2 1 2 2 1 2 2 1 2 2 2 2 2 2 2 2 1 1 2 2 1 2 2 1 2 1
## [3368] 2 2 1 1 2 1 2 1 2 2 1 2 1 2 1 2 2 2 1 2 2 2 2 2 1 2 2 1 2 2 1 2 1 2 2 2 2
## [3405] 1 2 2 1 2 1 2 2 2 1 1 2 2 1 2 1 2 1 1 1 1 1 1 2 1 2 2 1 2 2 2 2 2 2 2 1 2
## [3442] 2 2 2 2 2 2 2 1 2 2 1 2 2 2 2 1 2 2 2 2 1 2 1 2 1 2 1 2 2 1 2 1 2 2 2 2 1
## [3479] 2 1 1 2 2 2 1 2 2 1 2 2 2 2 2 2 1 1 2 1 1 1 1 2 1 2 1 1 2 1 2 1 2 2 2 2 2
## [3516] 2 2 1 2 2 1 2 1 1 1 1 2 1 2 1 2 1 2 2 1 2 1 1 1 1 2 1 1 2 2 2 2 1 2 1 1 2
## [3553] 1 2 1 2 2 2 2 2 2 1 2 1 1 1 1 1 2 1 1 1 2 1 2 1 2 2 2 2 1 1 1 1 1 2 2 2 2
## [3590] 1 1 1 1 2 2 2 2 1 2 1 2 2 2 2 1 1 2 1 1 2 2 2 2 2 2 2 2 2 2 2 1 1 2 2 1 1
## [3627] 2 2 1 1 1 1 1 2 2 2 1 1 2 2 2 2 2 2 2 1 1 2 2 1 2 1 2 2 2 2 2 2 2 1 2 1 2
## [3664] 2 2 1 2 2 2 2 1 2 2 2 2 2 2 1 2 2 2 2 2 2 2 1 2 1 2 2 2 2 2 2 2 1 2 1 2 1
## [3701] 1 2 2 1 2 2 2 2 1 2 2 1 2 2 2 2 1 1 1 2 2 1 2 1 2 2 1 2 1 1 1 2 2 2 2 1 1
## [3738] 2 2 2 2 1 2 1 2 2 2 1 2 2 1 2 2 2 2 2 2 1 1 1 2 2 1 2 2 2 1 2 1 1 2 1 1 1
## [3775] 2 2 2 2 1 2 1 2 2 1 1 1 2 2 2 2 2 2 2 2 2 2 1 2 2 2 1 1 2 2 2 2 2 2 2 2 1
## [3812] 1 2 2 1 2 2 2 1 2 2 2 2 1 1 1 2 1 2 2 1 1 2 2 1 2 2 2 2 1 2 2 2 2 1 1 1 1
## [3849] 2 1 1 1 1 2 1 2 2 2 1 2 2 1 1 2 2 1 1 2 1 2 2 2 1 1 2 2 2 2 1 2 1 2 1 2 2
## [3886] 2 2 2 2 1 2 2 2 1 2 1 2 1 1 2 2 1 2 1 2 2 2 2 2 2 2 1 2 1 1 2 1 2 1 2 2 1
## [3923] 2 1 2 2 1 2 1 1 2 2 1 1 1 2 1 1 1 2 1 2 2 2 2 1 2 2 2 2 2 2 1 2 1 2 2 2 2
## [3960] 1 1 2 1 2 2 2 2 2 1 1 1 1 2 1 2 2 1 2 1 2 2 2 1 1 2 1 1 1 1 2 1 1 1 2 1 2
## [3997] 1 1 2 2 2 1 2 2 2 1 2 2 2 2 2 2 1 2 1 2 1 2 2 1 1 2 2 2 2 2 2 1 2 2 2 1 2
## [4034] 2 1 2 1 1 2 2 1 1 2 1 2 1 2 1 1 2 1 1 1 1 2 2 1 2 1 1 2 1 1 2 1 2 2 1 2 2
## [4071] 2 2 2 2 2 1 2 2 2 1 1 1 2 2 2 1 2 1 1 1 2 2 2 1 1 1 1 1 1 1 2 2 1 2 2 1 2
## [4108] 1 1 1 1 1 2 2 2 2 1 2 2 1 2 2 1 1 1 1 1 1 1 2 1 2 2 2 2 2 2 2 1 1 2 1 2 1
## [4145] 1 2 2 2 1 2 1 1 1 1 1 1 2 2 2 2 2 2 2 1 1 2 1 2 1 2 2 1 2 2 1 2 2 2 2 1 1
## [4182] 1 1 1 1 1 2 2 1 1 1 1 2 2 2 1 2 2 2 2 1 2 2 2 2 1 2 1 1 2 2 2 2 2 2 2 2 2
## [4219] 2 1 2 2 1 1 2 2 1 1 1 1 2 2 1 2 2 2 1 1 1 2 2 1 2 2 1 2 2 2 1 1 1 2 1 2 2
## [4256] 2 2 2 1 1 1 1 2 1 2 1 2 2 2 2 2 2 2 1 1 1 2 2 1 2 1 2 2 2 1 1 1 1 2 1 2 1
## [4293] 1 2 2 2 1 1 2 2 1 2 1 1 2 2 2 1 1 1 1 2 2 2 1 2 2 2 2 1 2 2 1 1 2 1 2 1 1
## [4330] 2 2 1 1 2 2 2 2 2 2 2 1 1 2 1 1 2 2 2 2 1 2 2 1 2 2 2 1 2 2 1 2 2 2 2 2 1
## [4367] 2 2 2 1 1 2 1 2 1 2 1 1 2 1 2 1 2 1 2 2 2 1 2 2 2 2 1 2 2 2 2 2 2 1 1 2 2
## [4404] 2 2 2 2 2 1 2 2 1 2 1 1 2 2 2 2 1 1 1 1 2 1 1 2 1 1 2 2 2 1 1 2 1 2 1 2 2
## [4441] 2 2 2 2 1 1 1 2 1 1 1 2 2 2 1 2 1 1 1 1 2 2 2 2 2 2 2 2 2 2 1 1 2 1 1 1 1
## [4478] 2 1 2 1 1 1 2 2 2 2 1 1 2 1 2 1 1 1 2 1 2 2 1 2 2 2 2 2 1 2 2 2 2 2 2 1 2
## [4515] 2 1 2 1 1 2 1 2 1 1 2 2 2 1 2 2 2 2 1 1 2 2 2 2 2 2 2 2 1 1 2 1 2 1 1 2 1
## [4552] 1 2 2 2 2 1 2 2 2 2 2 2 2 2 2 2 2 1 1 2 2 1 1 2 2 2 2 1 2 2 2 1 1 2 2 2 1
## [4589] 1 2 2 2 1 2 1 2 1 2 2 2 1 1 2 1 1 1 2 1 2 1 1 1 1 2 2 2 2 2 1 2 2 1 2 1 2
## [4626] 1 1 1 1 1 2 2 1 1 2 1 2 1 2 2 2 2 2 1 1 1 2 2 2 2 2 2 1 1 2 2 1 1 1 1 2 1
## [4663] 1 1 2 2 2 2 2 1 1 2 2 1 1 2 2 2 2 1 2 2 2 1 2 2 1 2 2 1 2 2 2 2 1 2 2 2 2
## [4700] 2 2 2 1 2 2 2 2 1 2 1 2 2 1 1 2 1 1 2 1 2 1 2 2 2 2 2 1 1 2 1 2 1 2 1 1 1
## [4737] 2 2 2 1 2 1 2 1 1 1 1 1 2 2 2 1 1 1 2 2 1 2 2 2 2 1 2 1 2 2 2 2 2 1 2 1 2
## [4774] 2 1 1 2 2 2 1 2 1 2 2 1 1 2 2 1 1 2 2 2 2 2 2 2 1 2 1 2 1 1 2 2 1 1 1 2 2
## [4811] 1 2 1 2 1 1 2 1 1 2 2 2 1 2 2 1 1 2 1 2 1 2 2 1 2 2 2 1 2 1 1 2 2 2 2 1 2
## [4848] 2 2 2 2 2 1 1 2 1 2 1 2 1 2 2 1 2 1 2 2 2 1 2 2 2 2 2 2 1 1 2 2 1 1 2 2 2
## [4885] 1 1 1 1 2 2 1 2 2 2 2 2 1 1 2 2 2 2 2 2 2 2 1 2 1 2 1 2 2 1 1 1 1 2 1 1 1
## [4922] 2 2 2 2 1 2 2 1 2 1 1 2 1 1 2 2 2 2 1 1 1 2 1 2 1 1 2 1 1 1 1 1 2 2 1 2 2
## [4959] 2 2 1 1 1 1 2 1 1 2 2 1 2 1 1 1 2 1 2 1 2 2 2 2 2 2 2 1 1 2 2 2 2 2 2 2 2
## [4996] 1 2 2 2 1 2 2 2 2 2 2 1 2 2 2 2 2 1 1 2 2 2 1 1 2 1 2 1 2 1 1 2 2 2 1 1 2
## [5033] 1 2 2 1 1 2 1 2 2 2 2 2 1 2 2 2 2 2 1 2 1 1 2 2 2 2 2 1 2 2 2 2 2 2 2 2 1
## [5070] 2 2 2 2 2 1 2 2 2 2 2 2 1 1 1 1 2 2 2 1 2 2 2 2 2 2 2 2 2 1 2 2 1 2 1 2 1
## [5107] 2 1 2 1 1 2 1 2 2 2 1 2 1 1 1 1 2 2 1 1 2 2 2 1 2 2 1 2 2 2 2 2 1 2 2 2 1
## [5144] 1 2 2 1 2 2 2 2 2 2 2 2 2 2 1 1 2 2 1 2 1 1 1 2 1 1 2 1 1 1 2 1 1 1 2 1 2
## [5181] 1 2 2 2 2 2 1 1 1 1 2 2 2 2 2 1 2 2 2 1 1 2 1 2 1 2 1 1 1 1 2 2 2 1 2 2 2
## [5218] 1 2 2 1 2 2 2 2 1 2 2 1 1 2 2 1 1 2 2 2 2 2 2 1 1 1 2 1 2 1 1 2 1 2 2 1 2
## [5255] 2 1 1 2 2 1 2 1 1 2 1 2 1 1 2 2 1 1 2 2 1 2 1 2 1 1 1 1 2 1 2 1 2 2 1 1 1
## [5292] 2 2 2 2 2 2 1 2 1 1 2 2 1 1 2 2 2 2 2 2 2 2 2 2 1 1 2 2 1 2 1 2 2 2 2 2 1
## [5329] 1 2 2 2 1 2 2 2 2 2 2 2 2 2 2 1 1 2 2 2 2 1 1 1 1 2 1 2 2 2 2 2 2 2 2 2 2
## [5366] 2 2 1 2 2 2 2 2 2 2 2 1 2 1 1 1 2 2 2 1 2 2 2 2 1 1 1 2 2 2 2 1 2 2 2 2 1
## [5403] 2 2 1 2 1 2 2 1 2 2 1 1 1 2 1 2 1 2 2 1 2 1 2 1 2 2 1 1 2 2 1 2 2 2 2 2 1
## [5440] 2 1 2 2 1 1 2 2 1 1 2 2 2 2 2 1 2 2 2 1 2 2 2 1 2 2 2 2 2 2 1 1 2 2 2 1 2
## [5477] 1 1 2 2 2 2 2 1 2 1 2 2 2 2 1 2 1 2 1 1 1 1 1 1 1 1 1 1 1 2 2 2 1 1 2 1 1
## [5514] 2 1 2 2 1 2 1 2 1 2 1 2 2 1 2 2 1 2 1 1 2 2 1 2 1 2 1 2 1 2 2 2 1 2 2 2 1
## [5551] 2 1 1 2 1 2 2 1 2 2 2 1 1 1 2 2 2 2 1 2 2 1 2 2 1 2 1 1 1 1 1 2 2 1 1 2 1
## [5588] 2 1 1 2 1 2 2 1 1 1 1 2 2 2 2 1 2 2 2 2 2 2 2 1 2 2 1 2 2 2 2 1 2 1 2 2 2
## [5625] 2 1 2 2 2 1 1 2 2 1 1 2 2 2 1 2 2 2 1 2 2 2 2 1 2 1 2 1 2 2 1 1 2 2 2 2 2
## [5662] 2 2 2 2 2 2 2 2 1 1 2 2 1 2 1 1 1 2 2 2 2 1 1 2 2 2 2 2 2 1 2 2 2 2 2 1 2
## [5699] 1 1 1 2 2 1 2 2 2 2 1 2 2 2 2 1 1 2 2 2 2 1 2 2 2 1 1 2 1 2 2 2 1 1 2 1 2
## [5736] 1 2 1 1 1 2 2 1 2 2 2 1 2 1 1 2 1 2 2 2 1 2 2 2 1 2 1 2 1 2 2 2 1 2 2 1 2
## [5773] 2 2 2 2 2 1 2 2 1 1 1 1 1 2 2 2 2 2 2 2 2 1 1 1 2 2 1 1 2 1 2 1 2 1 1 1 2
## [5810] 2 1 1 2 2 2 1 1 2 1 1 1 1 2 2 2 2 1 2 2 2 2 2 1 2 2 2 1 1 2 2 2 2 2 2 1 1
## [5847] 2 2 1 2 1 1 1 2 2 2 2 2 2 1 2 2 2 2 2 2 1 1 2 2 1 2 2 1 1 2 1 1 1 2 2 1 1
## [5884] 2 2 2 2 2 1 2 2 1 2 2 1 2 2 1 2 1 2 2 2 1 2 1 2 2 2 1 1 1 2 2 1 2 1 1 2 1
## [5921] 1 2 2 1 2 2 1 2 2 1 2 2 2 2 2 1 2 2 2 1 2 1 2 2 2 2 2 1 2 2 2 2 2 2 2 2 2
## [5958] 2 2 2 2 2 2 2 2 2 2 1 2 1 2 1 2 1 2 1 2 2 2 1 1 2 1 2 2 2 1 1 2 1 2 1 2 1
## [5995] 2 2 1 1 1 2 2 1 2 1 1 2 1 2 1 2 1 2 2 2 2 2 1 1 2 1 2 2 2 1 2 1 2 2 2 2 2
## [6032] 2 2 2 2 2 2 1 1 1 2 2 1 1 2 2 1 2 2 1 1 1 1 1 1 2 2 1 1 2 2 1 2 1 2 1 2 2
## [6069] 2 1 2 1 1 2 1 1 2 2 2 1 2 2 1 2 2 2 2 2 2 1 2 1 2 2 2 1 1 2 2 1 1 2 2 2 1
## [6106] 2 2 1 2 2 2 2 2 1 1 2 1 1 1 2 2 1 2 1 2 2 1 2 2 2 1 1 1 1 2 1 2 2 1 1 2 1
## [6143] 2 2 1 2 2 2 1 2 2 2 1 2 2 2 2 2 1 2 1 2 2 1 2 2 2 2 1 1 1 1 1 1 1 1 2 2 1
## [6180] 1 2 1 1 2 2 1 1 2 1 1 1 1 1 2 1 2 2 1 1 2 2 1 2 2 2 2 2 2 1 1 1 2 2 1 2 2
## [6217] 2 2 1 2 2 2 1 2 2 2 2 2 1 2 1 1 1 2 1 2 1 1 1 1 2 1 2 2 2 2 1 2 2 2 2 2 1
## [6254] 2 2 1 2 2 2 2 2 2 2 1 2 2 1 2 1 1 2 2 2 2 2 2 1 2 2 1 1 2 1 1 2 2 2 2 2 2
## [6291] 1 1 1 1 1 2 2 1 2 1 2 2 1 2 2 1 1 2 1 2 2 2 2 2 2 2 1 2 2 2 2 1 2 1 2 2 1
## [6328] 2 1 1 2 2 2 2 2 2 2 2 1 2 2 1 2 1 1 2 1 2 2 1 2 2 2 1 2 2 2 2 2 1 2 2 1 2
## [6365] 2 2 1 2 2 1 2 1 1 2 2 2 2 2 1 1 1 2 2 2 1 1 1 1 2 2 2 1 2 2 1 2 1 1 1 1 1
## [6402] 1 2 2 2 1 2 1 2 2 2 1 1 1 2 2 2 2 2 2 1 2 1 1 1 2 2 2 2 2 2 1 2 2 1 1 2 2
## [6439] 1 2 1 2 2 1 2 2 2 1 1 1 2 2 2 2 1 2 1 2 2 2 1 2 2 2 2 2 2 2 1 2 2 1 2 2 2
## [6476] 2 2 1 2 1 2 1 2 2 2 2 2 2 2 2 2 2 2 2 1 1 1 2 2 2 2 2 1 1 2 1 2 1 2 2 2 2
## [6513] 2 2 2 1 2 1 1 1 2 1 2 1 2 1 2 2 1 2 2 2 2 2 2 1 1 2 2 2 1 2 2 2 1 2 2 2 1
## [6550] 1 2 1 2 2 2 2 2 1 1 2 2 1 2 2 1 1 1 2 1 2 1 2 1 2 2 2 2 2 2 1 2 2 1 2 2 1
## [6587] 2 1 2 1 1 1 2 2 2 1 2 2 2 2 1 1 1 2 2 1 2 1 2 2 1 1 1 2 1 1 1 2 2 2 2 1 2
## [6624] 1 2 2 2 1 2 2 2 1 2 2 2 1 2 1 2 2 1 2 2 2 1 2 2 1 1 2 1 1 2 2 2 2 2 1 2 2
## [6661] 1 2 2 2 2 2 1 2 2 1 1 2 1 2 2 2 2 2 1 1 1 1 1 1 2 2 2 2 2 2 1 1 1 1 1 1 1
## [6698] 2 1 1 1 2 2 2 2 2 2 1 2 2 2 2 2 1 2 2 2 1 2 2 1 2 2 1 2 2 1 2 2 1 1 1 1 2
## [6735] 2 2 2 1 2 2 2 1 2 1 2 2 1 2 2 2 1 2 1 2 1 2 2 2 2 2 2 1 2 2 2 2 2 2 2 2 1
## [6772] 2 1 1 1 2 2 2 2 2 1 1 2 1 2 1 2 2 1 2 1 2 1 2 1 2 2 2 2 2 2 1 2 2 2 1 1 2
## [6809] 2 2 1 1 1 2 1 1 1 2 2 2 2 2 1 2 1 2 1 2 1 2 2 2 2 2 1 2 2 2 2 1 2 2 1 2 2
## [6846] 2 1 2 2 2 2 2 2 2 2 2 2 1 1 1 1 2 1 1 2 2 2 2 2 2 2 1 1 2 2 1 2 2 2 1 1 2
## [6883] 1 2 1 2 1 2 2 2 1 1 1 2 1 1 1 1 2 2 2 2 2 1 1 2 1 1 2 2 1 1 2 2 2 2 1 2 1
## [6920] 2 2 2 2 1 1 2 1 1 1 2 1 2 1 2 2 2 2 2 2 2 1 2 1 2 1 2 1 1 2 2 2 2 1 2 2 1
## [6957] 2 2 1 2 2 1 1 1 2 1 1 2 2 1 1 2 1 2 2 2 2 1 2 1 2 2 1 1 1 2 2 2 1 2 1 2 2
## [6994] 1 2 2 2 2 1 2 2 1 2 2 2 2 2 1 2 2 2 1 1 1 2 1 2 1 1 2 1 1 2 1 1 2 1 2 2 2
## [7031] 1 2 1 1 2 1 1 1 2 2 2 2 2 1 2 1 1 2 1 1 1 1 1 2 2 2 1 1 1 1 2 1 2 1 2 1 1
## [7068] 2 2 1 1 2 2 2 1 1 1 2 2 2 1 2 2 1 2 1 2 1 1 2 1 2 1 2 1 2 2 2 2 2 2 2 2 2
## [7105] 2 1 2 1 2 1 2 1 1 2 2 1 1 2 2 1 1 2 1 1 2 2 2 1 2 1 1 2 2 2 2 1 1 2 1 2 2
## [7142] 2 1 2 2 1 2 2 2 2 2 2 1 1 2 1 1 2 1 2 2 2 2 2 2 2 2 1 1 2 1 2 1 2 2 1 2 2
## [7179] 2 1 2 1 2 1 2 1 2 2 1 2 2 2 2 2 2 1 1 1 1 2 1 2 2 2 2 1 2 2 2 2 2 2 2 2 2
## [7216] 1 2 1 1 1 2 1 2 2 2 2 1 1 2 1 1 2 2 1 1 2 1 2 1 2 2 2 2 2 2 2 2 2 1 2 2 1
## [7253] 1 1 1 2 1 2 2 2 1 2 1 2 2 1 2 1 2 1 2 2 2 2 2 2 1 1 2 1 1 2 1 2 2 2 2 2 2
## [7290] 1 2 1 2 1 1 1 2 2 1 2 2 2 1 1 2 1 2 1 2 2 2 1 2 2 2 2 1 2 1 2 1 2 2 1 2 2
## [7327] 2 2 1 2 1 1 2 2 1 2 1 1 2 2 2 1 1 2 1 2 1 2 2 1 2 1 2 1 2 2 2 2 2 1 2 2 1
## [7364] 2 2 1 2 2 2 1 2 2 1 2 2 2 1 1 2 2 2 1 1 2 2 1 2 1 2 2 2 2 2 2 2 2 2 2 2 2
## [7401] 1 2 1 2 2 1 1 1 2 2 1 2 1 1 2 2 2 1 2 2 2 2 2 1 1 1 1 2 1 1 1 2 1 1 1 2 1
## [7438] 2 1 2 2 1 2 2 1 1 1 1 2 2 2 2 1 2 1 2 2 2 2 2 2 1 1 2 1 1 2 2 2 2 2 1 2 2
## [7475] 1 1 1 1 2 1 1 1 1 2 2 2 2 1 2 2 1 1 1 2 2 2 2 1 1 2 2 2 2 2 2 1 2 2 1 1 1
## [7512] 1 2 2 2 1 2 1 1 1 2 2 1 2 1 2 2 2 2 2 2 2 1 1 1 2 1 1 2 2 1 2 1 2 2 2 2 2
## [7549] 2 1 2 2 2 2 2 1 1 2 2 1 2 1 2 2 1 2 2 2 1 2 2 1 2 1 1 1 1 2 2 2 2 1 1 2 1
## [7586] 2 1 1 2 1 1 2 2 1 2 2 2 1 1 2 1 1 1 1 2 1 1 2 2 2 2 2 2 1 2 1 1 1 2 1 2 2
## [7623] 2 1 2 2 2 2 1 2 1 1 2 1 1 1 2 1 2 2 1 2 1 1 1 2 2 2 2 1 1 2 1 1 1 1 2 2 1
## [7660] 1 1 2 1 1 1 2 1 2 2 2 2 2 2 2 2 2 2 1 1 2 2 2 2 1 1 1 1 1 1 2 2 1 2 1 1 2
## [7697] 1 2 2 2 1 2 1 2 2 1 2 2 1 2 2 2 2 1 1 2 2 1 1 2 2 1 1 2 2 2 2 2 1 2 1 1 2
## [7734] 2 1 2 1 1 2 2 2 1 1 1 2 2 2 1 2 1 1 2 2 1 1 2 1 2 2 2 2 1 1 1 1 1 2 1 2 1
## [7771] 2 2 2 2 2 1 1 1 1 1 1 1 2 1 1 2 1 2 2 2 2 2 2 1 2 2 1 2 2 2 2 1 1 1 2 1 1
## [7808] 2 2 2 1 2 1 2 2 1 2 2 2 2 1 2 1 1 2 1 2 1 1 2 2 1 1 2 2 1 2 2 1 2 2 2 1 1
## [7845] 1 2 2 2 2 1 2 1 1 2 2 1 2 1 1 2 1 1 1 1 2 1 1 2 2 1 1 1 1 2 2 2 1 1 1 2 1
## [7882] 1 1 1 1 2 1 2 2 2 1 1 2 2 2 1 1 1 2 1 1 2 1 2 1 2 1 2 2 1 1 2 1 1 2 1 1 2
## [7919] 2 1 1 2 2 2 1 2 1 2 1 2 2 2 1 1 1 1 2 2 1 2 2 1 2 2 1 1 1 1 2 2 1 2 1 2 2
## [7956] 1 2 2 1 2 1 2 2 2 2 2 2 2 2 1 1 2 2 2 1 1 1 1 1 1 1 2 2 2 1 1 2 1 2 2 2 2
## [7993] 2 1 2 2 1 2 1 1 1 1 1 1 1 2 2 1 1 2 2 2 2 2 1 2 2 2 2 2 1 1 2 1 2 1 2 2 1
## [8030] 1 2 1 2 1 2 2 2 1 1 2 2 1 2 2 1 2 1 1 2 1 1 2 2 1 1 1 2 1 2 1 2 1 2 2 1 1
## [8067] 2 2 1 2 2 1 2 1 2 1 1 2 2 1 2 2 2 1 1 1 1 2 2 2 2 2 2 1 1 1 2 1 1 1 2 2 1
## [8104] 2 1 1 2 1 1 1 2 1 2 2 2 1 2 2 1 2 2 1 1 2 1 1 2 1 2 2 2 1 2 2 2 2 1 2 1 1
## [8141] 2 1 2 2 2 2 2 1 2 1 1 1 2 1 2 1 1 1 2 2 2 2 1 1 2 2 2 1 1 1 1 1 1 1 1 2 2
## [8178] 2 1 2 2 2 1 1 1 2 2 2 2 2 1 2 1 2 1 1 2 1 1 2 2 2 1 2 2 1 2 1 2 1 1 2 2 1
## [8215] 1 1 2 2 2 2 2 1 1 2 1 2 2 2 1 1 2 2 2 2 2 1 1 2 1 1 2 1 2 1 1 1 1 1 2 2 1
## [8252] 1 2 1 1 1 2 2 2 1 2 2 2 1 1 2 2 1 1 2 2 2 1 1 1 2 2 1 2 2 1 2 1 1 1 2 1 2
## [8289] 2 2 2 2 1 2 2 2 2 1 2 2 1 2 1 2 2 1 2 2 2 2 2 1 1 2 1 2 2 2 2 1 1 2 2 1 1
## [8326] 1 1 1 1 2 2 2 2 2 1 1 1 1 2 2 2 1 1 2 2 2 1 2 2 1 1 2 1 1 2 1 1 1 2 2 1 2
## [8363] 2 1 1 1 1 2 2 2 1 1 2 2 1 2 1 2 1 1 1 1 1 1 1 2 2 1 1 1 1 1 1 1 2 2 1 2 2
## [8400] 2 1 2 2 1 2 1 2 1 1 2 2 1 2 1 2 2 1 1 1 2 2 2 2 1 1 2 2 2 2 1 2 1 1 1 2 2
## [8437] 1 1 1 1 1 2 1 1 1 2 2 1 1 2 1 1 1 1 2 2 1 1 1 2 1 1 1 2 1 1 1 2 1 2 2 2 2
## [8474] 1 2 2 2 2 1 2 1 2 1 1 2 1 2 1 1 1 1 1 2 1 2 2 2 1 1 1 2 2 1 2 1 2 1 2 2 2
## [8511] 2 1 1 2 2 2 2 1 2 1 1 2 1 1 1 1 2 1 2 2 2 1 2 1 1 1 1 1 1 1 2 2 1 1 1 1 2
## [8548] 2 2 2 1 1 2 1 1 2 1 1 1 1 1 2 2 2 2 1 1 2 1 2 1 1 2 2 1 2 1 1 2 1 1 2 2 2
## [8585] 2 1 2 1 2 2 2 2 1 2 2 2 1 2 1 1 1 1 1 1 1 1 2 1 2 1 2 1 2 2 2 2 1 2 1 1 2
## [8622] 1 1 1 1 1 2 1 2 2 1 1 2 2 2 1 1 2 1 1 1 2 1 1 1 2 1 2 1 2 1 1 1 1 1 1 2 1
## [8659] 2 1 1 1 2 1 2 2 1 2 2 1 2 1 1 1 1 2 2 2 2 2 2 1 2 1 2 2 1 1 1 1 2 2 1 2 2
## [8696] 1 1 2 1 1 2 1 2 2 2 2 2 2 2 2 2 2 1 2 2 1 1 2 2 2 2 2 1 1 2 1 2 2 2 1 2 1
## [8733] 1 2 2 1 1 1 2 1 1 2 2 1 1 1 2 2 2 1 2 1 1 1 2 2 1 2 1 1 2 1 1 2 1 2 1 2 1
## [8770] 2 1 1 1 2 2 1 2 1 2 1 1 1 1 2 1 2 1 2 1 1 1 1 2 2 2 1 1 2 2 1 1 1 2 2 1 1
## [8807] 1 1 1 1 2 1 2 1 1 1 2 2 2 2 2 2 2 2 2 2 1 2 2 2 2 1 1 1 1 2 1 1 1 1 1 2 2
## [8844] 1 1 1 2 1 1 1 2 2 1 1 1 1 2 2 2 1 2 2 1 2 2 2 2 1 1 2 1 1 1 1 1 1 2 2 1 2
## [8881] 2 1 2 1 2 2 1 1 1 1 2 1 1 2 2 1 2 1 2 2 2 1 2 2 2 1 2 1 2 2 1 1 1 1 1 2 2
## [8918] 1 2 1 1 1 2 1 1 2 1 2 1 1 1 1 2 1 1 1 1 2 1 1 1 2 1 1 2 2 2 1 2 1 2 1 1 2
## [8955] 1 1 1 1 2 1 1 2 2 1 1 2 1 2 2 2 1 2 1 2 2 1 1 1 1 1 2 1 2 1 1 2 1 2 2 2 2
## [8992] 2 1 2 1 1 1 2 2 2 2 1 2 1 1 2 2 1 1 1 1 1 1 2 2 2 1 2 1 2 1 1 1 1 1 2 2 1
## [9029] 2 1 2 2 1 1 1 1 2 1 1 1 1 2 1 1 1 1 1 2 2 1 2 1 1 2 2 2 1 2 2 1 1 1 2 1 1
## [9066] 2 2 1 2 1 1 2 2 1 2 1 1 2 1 2 1 1 2 2 1 2 1 2 2 2 2 2 2 2 1 1 1 1 1 2 1 2
## [9103] 1 1 2 2 1 1 1 1 2 2 1 2 2 1 1 2 1 1 1 1 1 1 1 2 2 1 1 1 2 2 2 2 1 1 1 1 2
## [9140] 2 1 1 2 1 1 2 1 2 2 1 2 1 1 2 1 2 2 1 1 2 2 1 2 1 1 2 2 2 2 1 2 2 2 1 1 2
## [9177] 2 2 1 2 2 2 2 1 2 2 2 1 1 1 1 1 2 1 2 1 1 1 2 1 1 1 1 1 2 2 2 2 1 1 1 2 2
## [9214] 2 1 1 2 1 1 2 1 1 1 2 2 1 2 1 1 1 1 1 1 1 1 1 2 1 1 1 1 1 1 1 2 1 1 1 1 1
## [9251] 1 1 1 1 2 1 1 1 2 1 2 1 2 1 1 1 1 1 1 2 2 1 2 1 1 1 1 2 2 1 1 1 2 2 2 1 2
## [9288] 1 1 2 1 2 2 1 2 1 2 1 1 1 1 1 1 1 1 2 2 1 1 1 2 1 2 1 1 1 1 1 1 1 2 2 2 2
## [9325] 2 1 2 1 2 1 1 2 1 2 1 2 1 2 1 2 1 1 1 2 1 1 2 1 1 1 1 1 2 1 1 2 1 2 1 1 1
## [9362] 1 1 1 1 1 2 1 1 1 1 1 2 1 1 1 1 2 2 1 1 2 1 2 1 1 1 1 1 2 1 2 1 1 2 1 1 2
## [9399] 1 2 1 1 1 2 1 2 1 2 1 2 2 1 1 2 2 1 2 1 2 2 2 2 1 1 2 1 1 2 2 1 2 1 2 2 2
## [9436] 2 2 2 2 1 1 1 1 1 1 1 1 2 1 2 1 1 2 1 2 1 1 1 1 1 2 2 1 2 1 1 1 1 2 1 1 1
## [9473] 1 2 1 1 1 1 1 2 1 1 2 1 1 2 1 2 1 1 2 1 1 1 2 1 2 1 1 1 1 2 2 2 2 1 1 2 1
## [9510] 2 2 1 1 1 1 1 2 2 2 1 1 2 1 2 2 1 2 1 1 2 1 1 1 1 2 1 1 2 1 2 2 2 2 1 2 1
## [9547] 2 2 2 1 2 1 2 1 2 2 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 2 1 2 1 2 1 1 2 1
## [9584] 1 1 1 2 2 1 1 1 1 2 1 1 2 1 1 1 1 2 2 2 1 1 1 1 1 2 1 2 2 2 2 2 1 1 1 1 2
## [9621] 1 1 2 1 2 1 2 1 1 1 1 1 1 1 2 2 1 1 2 2 1 2 1 1 1 1 2 1 2 1 2 1 2 1 1 1 1
## [9658] 1 1 2 1 1 1 2 2 1 2 2 2 2 1 2 1 1 2 1 2 1 1 1 2 2 1 1 1 1 1 2 1 1 1 2 2 1
## [9695] 1 2 1 1 1 1 1 1 2 1 1 2 1 1 1 1 1 2 2 2 1 2 2 1 2 1 2 1 1 1 2 2 1 2 1 1 1
## [9732] 1 2 2 1 1 2 1 1 1 1 1 1 1 1 1 1 2 1 1 2 2 1 1 1 2 1 2 2 1 1 1 2 2 1 2 1 2
## [9769] 2 1 2 1 1 2 1 1 2 1 2 2 1 1 1 2 1 1 2 1 1 1 1 1 1 1 1 2 1 2 1 1 2 1 1 1 1
## [9806] 1 1 1 1 1 2 2 1 2 1 2 1 1 2 2 1 2 1 1 1 1 1 1 1 1 1 2 2 2 2 1 1 1 1 2 1 2
## [9843] 2 2 1 2 1 1 1 1 2 1 2 2 1 1 1 1 1 1 1 1 1 2 1 2 1 1 1 2 1 1 1 2 1 1 1 2 1
## [9880] 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 1 2 1 2 1 2 1 1 1 2 1 1 1 2 1 2 1 1 1 1 1 1
## [9917] 1 1 1 2 1 1 2 1 1 1 2 1 1 1 1 1 1 2 1 1 1 1 1 1 1 1 1 2 1 2 1 1 1 1 1 1 1
## [9954] 1 1 1 1 2 1 1 1 1 1 2 1 2 1 2 1 2 2 1 1 1 1 1 1 2 1 1 2 1 1 1 2 1 1 1 1 1
## [9991] 2 1 1 1 2
Internal Validation
For the same sample of data that the Hopkin’s statistic was
previously computed for, internal validation metrics, such as
connectivity, Dunn index and silhouette width are computed. Internal
validation evaluates the quality of clusters using only the dataset
itself. These are computed for k-means, hierarchical and clara
clustering methods (PAM deemed inefficient due to the size of the
dataset). The results of the test suggest that hierarchical clustering
optimal method for the examined dataset, significantly outperforming
partitioning methods such K-Means and CLARA. The optimal number of
clusters is identified, again, as \(k=2\), achieving the highest Silhouette
Width and Dunn Index, suggesting, respectively, a reasonable structure
and distinct cluster separation. Furthermore, the Connectivity score for
hierarchical clustering was drastically lower (by over 160 points for
both instances), than K-Means and CLARA, confirming superior local
compactness.
methods <- c('kmeans', 'hierarchical','clara')
validation <- clValid(data_sample, nClust = 2:10, clMethods = methods, validation = 'internal', maxitems = 100000)
## Warning in clValid(data_sample, nClust = 2:10, clMethods = methods, validation
## = "internal", : rownames for data not specified, using 1:nrow(data)
summary(validation)
##
## Clustering Methods:
## kmeans hierarchical clara
##
## Cluster sizes:
## 2 3 4 5 6 7 8 9 10
##
## Validation Measures:
## 2 3 4 5 6 7 8 9 10
##
## kmeans Connectivity 176.8560 258.2147 290.2690 384.2889 423.6881 499.3171 468.0488 494.7861 511.9230
## Dunn 0.0332 0.0295 0.0343 0.0334 0.0369 0.0386 0.0273 0.0258 0.0323
## Silhouette 0.2699 0.2650 0.2685 0.2109 0.2062 0.1992 0.2007 0.1958 0.2013
## hierarchical Connectivity 7.4659 7.4659 33.5329 34.4119 77.9595 84.7270 88.6738 131.6278 132.6540
## Dunn 0.2526 0.2526 0.0946 0.0946 0.0967 0.0967 0.0967 0.0935 0.0935
## Silhouette 0.5870 0.4941 0.3681 0.3050 0.2056 0.1763 0.1642 0.1300 0.1014
## clara Connectivity 196.3353 293.0639 362.8123 478.9690 490.7905 563.6849 653.9552 640.4742 697.4786
## Dunn 0.0268 0.0239 0.0251 0.0297 0.0257 0.0277 0.0194 0.0333 0.0333
## Silhouette 0.2285 0.2240 0.2149 0.1580 0.1701 0.1640 0.1299 0.1512 0.1348
##
## Optimal Scores:
##
## Score Method Clusters
## Connectivity 7.4659 hierarchical 2
## Dunn 0.2526 hierarchical 2
## Silhouette 0.5870 hierarchical 2
Clustering
In this section various clustering algorithms will be implemented and
analyzed in terms of, both, their statistical purity and utility in
terms of efficient consumer segmentation providing helpful insight for
either recommendation systems or market research. The optimal number of
cluster, despite conflicting results of various metrics, is set at \(k=4\) – the result determined by the
highest value of Silhouette Width coinciding closely with the optimal
numbers of clusters from the field’s literature. However, it is not only
the optimal number of clusters, whose results were inconclusive –
various methods of clustering were used for similar purposes in the
previously cited literature. This study will focus on two distinct
approaches, having already eliminated PAM clustering methods as
inefficient in such a large dataset, CLARA as deemed redundant by the
interna validation metrics and Density-Based clusteing methods (DBSCAN)
as unfit for the highly dense data structure of this project:
- K-Means – the approach most widely used in the literature on music
recommendation systems,
- Hierarchical Clustering – the approach deemed superior across the
internal validation metrics.
K-Means Clustering
K-Means clustering algorithm assigns each and every data point to one
of the pre-defined \(k\) clusters – it
works as an optimization problem whose main objective is to find the
\(k\) cluster centers and assign the
observations to the nearest cluster center so that the differences
within clusters measure by the squared distances from the cluster are
minimized and the difference between clusters is maximized. The
assignments are subsequently updated through adjusting the boundaries of
clusters according to the current content of the clusters, which occurs
several times until further changes are no longer feasible. Clustering
is performed for \(k=4\).
k4 <- kmeans(music_scaled, 4)
k4$size
## [1] 2550 1811 5286 348
The data is split into 4 clusters, according to the Silhouette Width,
all of which are unequal with the number of observations in each
cluster, respectively: 2550, 1811, 5286, 348.
Based on the number of observations in each cluster and the
visualization of cluster map, cluster 4 is mostly comprised of outlying
observations – probably the heavy users, whose large
listening density values were previously illustrated on the
scatter plot. Although there is only few of them, their impact on the
feature space is enormous, effectively pulling the cluster boundary
outwards. This capture of the extreme variance introduced listening
density stabilized the centroids of the remaining clusters, resulting in
interpretable segments within this highly dense dataset. The dominant
group, represented by the third cluster, constitutes the majority of the
sample and most likely represents casual or mainstream listeners. Both
cluster 3 and 1 from compact, dense regions in the center of the plot,
overlapping slightly – suggesting lower Silhouette scores. Cluster 2
occupies a distinct region, suggesting a better separation than the
central groups.
fviz_cluster(k4, data = music_scaled,
geom = 'point',
ellipse.type = 'convex',
ggtheme = theme_minimal(),
main = "Cluster Map",
palette = 'jco')

k4$centers
## genre_entropy novelty_score mainstreaminess_score diversity_score
## 1 0.2222123 1.0369098 0.06245614 1.18744557
## 2 -1.3604383 0.2540524 -1.36958033 0.03674786
## 3 0.3704681 -0.5397671 0.44468660 -0.52517128
## 4 -0.1758105 -0.7212638 -0.08496139 -0.91517587
## listening_density
## 1 -0.46882357
## 2 -0.12264418
## 3 0.02310175
## 4 3.72268072
Centers of clusters provide clear interpretability – cluster 1
represents the explorers of the dataset, with high novelty and diversity
scores. Cluster 2 exhibits interestingly low values of genre entropy and
mainstreaminess score – representing users with focused and niche music
preferences. Cluster 3 does not portray any specific deviations from the
average, suggesting a representation of the average user – fairly
mainstream and inclined towards a moderate number of genres, without a
tendency for diversity and novelty – a cluster for casual listeners,
representing the majority of users in the sample. Cluster 4, however,
displays an extreme value of listening density – encompassing the
outlying heavy users.
The differences between clusters is better illustrated by a snake
plot portraying the deviation from the average score for variables of
all of the clusters’ centers.

The average silhouette width equals 0.28, which indicates a
relatively weak structure with noticeable overlap between clusters.
While a higher score could definitely be preferred, such partition
maximized the width of silhouette. It is important to note that
behavioral segmentation involving continuous psychometric variables,
such as mainstreaminess or novelty the boundaries may naturally be fuzzy
rather than rigid. Cluster 3 shows the most stable structure with
poitive silhouette values – confirming casual users as the core and
cohesive group of the dataset. Cluster 4 exhibits high variability with
negative silhouette values, which is to be mathematically expected for
an “outlier cluster” – the heavy users with extreme listening
habits are distant form the center and highly dispersed from each other.
The lower silhouette score is a trade-off accepted to gain detailed
behavioral insights rather than a statistically pure but practically
trivial split.
s4 <- silhouette(k4$cluster, dist(music_scaled))
fviz_silhouette(s4)
## cluster size ave.sil.width
## 1 1 2550 0.27
## 2 2 1811 0.18
## 3 3 5286 0.33
## 4 4 348 0.16

Hierarchical Clustering
Hierarchical clustering (HCA) is a method of cluster analysis that
does not require a pre-defined number of clusters, which allows for a
further exploration of natural segmentation within the data. This
project will utilize the agglomerative, or the bottom-up
approach to hierarchical clustering, which begins with each singular
data point in an individual cluster. With each iteration, the algorithms
merges the two most similar clusters based on a chosen distance metric
(here, euclidean distance is implemented) and linkage criterion, until
all fo teh data points form a single cluster.
# drawing a sample
set.seed(123)
sample_index <- sample(1:nrow(music_scaled), 4000)
music_sample <- music_scaled[sample_index, ]
# defining methods
m <- c("average", "single", "complete", "ward")
names(m) <- c("average", "single", "complete", "ward")
ac_coeff <- function(x) {
agnes(music_sample, method = x)$ac
}
results <- map_dbl(m, ac_coeff)
sort(results, decreasing = TRUE)
## ward complete average single
## 0.9954920 0.9832191 0.9824863 0.9610489
In order to determine the most efficient linkage method – the
Agglomerative Coefficients are computed for average, single, complete
and Ward linkage methods (due to the issues with computation for the
dataset used in this project, a random sample of 4000 records is drawn).
The Ward linkage method exhibits the highest value of Agglomerative
Coefficient and, hence, is chosen as the proper linkage method for this
dataset. Ward’s method works by minimizing the total within-cluster sums
of squares.
d <- dist(music_scaled, method = "euclidean")
hc1 <- hclust(d, method = "ward.D2")
plot(hc1, cex = 0.6, hang = -1, main = "Dendrogram - Ward Method")

Despite the suggestion of internal validation metrics to perform
hierarchical clustering with \(k=2\), I
firmly believe that such division would be too sparse to provide
practical interpretation. Furthermore, visual analysis of the dendrogram
reveals a clear dichotomy in the user base – most likely separating
heavy users from casual listeners – following the tree down, however,
both of those branches are split at a height between 80 and 110. The
vertical length of these four branches before they subdivide further,
especially the ones signified by red, dark blue and cyan, is
substantial. This indicates a high level of separation stability and
provides empirical evidence to retain \(k=4\) as the optimal segmentation
level.
cut_hc1 <- hcut(music_scaled, k = 4, hc_method = 'ward.D2')
plot(hc1, cex = 0.6, hang = -1, main = "Dendrogram - Ward Method; k=4")
rect.hclust(hc1, k = 4, border = 2:5)

hc_clusters <- cutree(hc1, k=4)
Similarly to the results obtained from K-Means clustering, cluster 4
contains the fewest observations, as can be seen on the cluster map
plotted below – once again it encompasses the outlying values. This
group was sightly larger in K-Means, which liekly forced this group to
be larger by pulling in moderate users. Cluster 3 illustrates the
dominant group, which compared to the results of K-Means, has grown even
more – containing approximately 63% of all users in the dataset. While
such finding seems to be realistic for a streaming platform, where the
vast majority of users are, indeed, the average consumers, this
may provide little to no insight in practical interpretation and might
lower the silhouette score even further. Clusters 1 and 2 with,
respectively, 1503 and 1915 users represent substantial, specific
sub-cultures within the audience.
table(hc_clusters)
## hc_clusters
## 1 2 3 4
## 1503 1915 6288 289
fviz_cluster(list(data = music_scaled, cluster = hc_clusters),
geom = "point",
ellipse.type = "convex",
palette = "jco",
ggtheme = theme_minimal(),
main = "Cluster Map (Hierarchical Clustering)")
The four user archetypes based on the centroid data are, as follows: -
Cluster 1: the niche user – genre entropy and mainstreaminess reach the
lowest values in the dataset – these users listen to a narrow set of
genres and actively avoid popular music. - Cluster 2: the explorer –
achieving the highest diveristy and novelty scores, these are the users,
who listen to a vast variety of tracks with an inclination towards novel
sounds, unlike the first group, they switch the genres frequently. -
Cluster 3: the casual users – in this group all of the variables
oscillate near the average. These group represents the baseline users,
accepting the mainstream, listening to a standard variety of music, who
consume moderately. - Cluster 4: the heavy users – a group defined by a
drastic deviation in listening density, these are the focused (low
diversity) and heavy listeners.
temp_data <- as.data.frame(music_scaled)
temp_data$h_cluster <- as.factor(cut_hc1$cluster)
hc_centers <- aggregate(.~ h_cluster, data = temp_data, mean)
print(hc_centers)
## h_cluster genre_entropy novelty_score mainstreaminess_score diversity_score
## 1 1 -1.51245157 0.4754816 -1.4348975 0.2784618
## 2 2 0.26743775 1.0341621 0.2096416 1.3358587
## 3 3 0.27655104 -0.3968652 0.2724522 -0.4294425
## 4 4 0.07653445 -0.6905918 0.1453566 -0.9562743
## listening_density
## 1 -0.22945128
## 2 -0.54802322
## 3 0.04103353
## 4 3.93187164

The average silhouette width equals 0.25 – a value lower than the one
rendered in K-Means analysis. This time it is the second cluster that
displays the most stable structure, while not managing to reach a
positive score across the entire group, the negative values it achieves
are the highest among the clusters. Once again, cluster 4 exhibits high
variability. The lower silhouette score yielded by hierarchical
clustering analysis with no additional practical interpretations of the
dataset leads to the selections of K-Means algorithm as the one that
suits the data analyzed in this project the most.
sil_hc <- silhouette(hc_clusters, d)
fviz_silhouette(sil_hc)
## cluster size ave.sil.width
## 1 1 1503 0.14
## 2 2 1915 0.32
## 3 3 6288 0.26
## 4 4 289 0.16

Interpretation
Further interpretation of the obtained segmentation will proceed with
the results of K-Means clustering. In this section, prior unstadardized
data will be utilized and clusters will be analyzed in terms of their
demographic distinctions.
Cluster Profiles
The final user profiles rendered by each cluster are plotted below.
The algorithm differentiated four distinct user archetypes, which I
classified as follows: - Explorer – characterized by high novelty and
discovery scores with relatively low listening density. These are the
users, who are inclined towards novel sounds, engaging with a wide
variety of genres and not shying away from the mainstream. In terms of
music recommendation algorithms – these are the users, who may benefit
the most out of a blend of a personalized approach and mainstream
promotion.A group for users, who not only consume content but rather
discover what the platform has to offer. - Niche users – defined mostly
by their low mainstreaminess scores and genre entropy – these users
avoid popular music and listen to a compact set of genres diversifying
their consumption mostly among different tracks of the same genre (high
diversity score). These are the users who have already found their
niche, they know exactly what they want and they stick to it – a
beneficial recommendation system would need to be highly personalized. -
Casual listeners – these groups encompasses the majority of this sample
– highly mainstream with relatively low novelty and diversity – these
are the users who listen to the various genres and songs that are
popular at the moment, - Heavy users – the outliers of the dataset – a
group characterized by their extremely high listening density and low
diversity scores. The focused, heavy users – a group that encompasses
either the obsessive listeners or people or institution using music as
background noise (retail stores and office backgrounds) –
super-streamers.

Age in Clusters
Explorers and niche users are almost identical in terms of age
structure – the average age of users in these clusters is 26 and the
median age is 24. Casual listeners tend to be younger – the average age
in this cluster is 23, while the median is 22. Heavy listeners occupy
the middle ground between clusters 1, 2 and 3 – the avergae age is 25
and the median is 23. The sample seems to be comprised mostly of young
adults with subtle differences in the distributions of age.
final_data$cluster <- as.factor(k4$cluster)
age_profile <- final_data %>%
group_by(cluster) %>%
summarise(
Avg_Age = mean(age, na.rm = TRUE),
Median_Age = median(age, na.rm = TRUE),
Count = n()
)
print(age_profile)
## # A tibble: 4 × 4
## cluster Avg_Age Median_Age Count
## <fct> <dbl> <dbl> <int>
## 1 1 26.6 24 2550
## 2 2 26.4 24 1811
## 3 3 23.8 22 5286
## 4 4 25.4 23 348
Visually, all four clusters exhibit an almost identical right-skewed
distribution. The vast majority of users are concentrated around the
early 20s. However, there is a long tail reaching up to 75+ years
suggesting that, while th platform is dominated by young adults, there
is a small but present segment of older users in every category. In all
clusters, the age of the median user oscillates around 22-24 years,
while the mean age is consistently positioned to the right of the
median, due to the skewed data. The consistency of the gap between the
mean and the median across all of the groups suggests that age may not
be the driving force behind listening habits.
my_cols <- c("1" = "skyblue",
"2" = "darkseagreen",
"3" = "purple",
"4" = "pink")
cluster_names <- c("1" = "C1: Explorer",
"2" = "C2: Niche",
"3" = "C3: Casual",
"4" = "C4: Heavy")
ggplot(final_data, aes(x = age, fill = cluster)) +
geom_histogram(binwidth = 1, color = "white", alpha = 0.9) +
geom_vline(data = age_profile, aes(xintercept = Avg_Age),
linetype = "dashed", color = "red", size = 0.5) +
geom_vline(data = age_profile, aes(xintercept = Median_Age),
linetype = "solid", color = "navy", size = 0.3) +
facet_wrap(~cluster, scales = "free_y", labeller = as_labeller(cluster_names)) +
scale_fill_manual(values = my_cols) +
labs(title = "Age Profile by Cluster",
x = "Age",
y = "Count of Users") +
theme_minimal() +
theme(legend.position = "none",
strip.text = element_text(size = 12, face = "bold"))

The analysis of age across user archetypes reveals a remarkable age
homogeneity across all four clusters, which may suggest that within this
dataset – users habits are not defined by listener’s age, but rather by
psychological or lifestyle-based factors. Consequently, marketing
strategies and recommendation systems would need to rely on behavioral
targeting rather than demographic segmentation by itself.
Gender Analysis
The dataset used in this study is hevaily male-dominated, as
illustrated on the plot below. The behaviors associated with intense or
obsessive listening appear to be more strongly male-correlated – niche
and heavy clusters are both comprised of men in more than 75%. The
cluster with the highest proportion of women is the casual listener one,
followed by the explorer. The gender gap narrows slightly as the
listening habits become more diversified or casual. However, it is hard
to determine whether this disproportions reveal a natural, behavioral,
feminine tendency towards mainstream music or whether it is just a
result of a phenomenon described in literature as the popularity
bias of the female users imposed and intesified by the unfairly
biased recommender systems (Lesota et al. 2021).
gender_profile <- final_data %>%
group_by(cluster, gender) %>%
summarise(Count = n()) %>%
mutate(Percentage = Count / sum(Count) * 100)
## `summarise()` has grouped output by 'cluster'. You can override using the
## `.groups` argument.
# gender plot for clusters
ggplot(gender_profile, aes(x = cluster, y = Percentage, fill = gender)) +
geom_col(position = "fill") +
geom_text(aes(label = Count),
position = position_fill(vjust = 0.5),
color = "white",
size = 3,
fontface = "bold") +
scale_y_continuous(labels = scales::percent) +
labs(title = "Gender Structure in Clusters", y = "Percent") +
theme_minimal()
### Global Distribution
The map illustrating the dominant clusters in each country reveals
that music consumption habits are highly globalized, the average users
in the vast majority of Global North behave similarly. While niech and
explorer behaviors exist they are not the dominant trait of big nations,
but rather specific segments scattered across the world. The users in
explorer, niceh and heavy user clustered scattered on the map may
illustarte instances of small sample size artifacts.
country_domination <- final_data %>%
group_by(country, cluster) %>%
summarise(count = n(), .groups = 'drop') %>%
group_by(country) %>%
slice_max(count, n = 1, with_ties = FALSE) %>%
ungroup()
country_domination$region <- countrycode(sourcevar = country_domination$country,
origin = "iso2c",
destination = "country.name")
country_domination$region[country_domination$region == "United States"] <- "USA"
country_domination$region[country_domination$region == "United Kingdom"] <- "UK"
country_domination$region[country_domination$country == "UK"] <- "UK"
world_map <- map_data("world")
map_data_joined <- left_join(world_map, country_domination, by = "region")
map_data_joined$cluster <- as.factor(map_data_joined$cluster)
my_colors <- c("1" = "skyblue",
"2" = "darkseagreen",
"3" = "purple",
"4" = "pink")
ggplot(map_data_joined, aes(x = long, y = lat, group = group)) +
geom_polygon(aes(fill = cluster), color = "white", size = 0.1) +
scale_fill_manual(values = my_colors, na.value = "gray",
name = "Dominant Cluster",
labels = c("C1: Explorer", "C2: Niche", "C3: Casual", "C4: Heavy")) +
theme_void() +
labs(title = "Dominant User Profile by Country") +
theme(legend.position = "bottom")

The country that dominates across almost all of the clusters is the
USA (likely due to the dataset’s skew towards North American users),
with only one exception – niche user’s cluster is dominated by Russian
users. This phenomenon can be interpreted through the lens of
algorithmic cultural bias – since mainstreaminess is typically
defined by global, hence Western-centric, charts, users consuming
regionally popular content are mathematically classified as
niche by the model. What appears to be niche behavior might as
well represent a distinct cultural, most liekly non-Western-centric
ecosystem rather than obscure taste.
## # A tibble: 4 × 4
## cluster_name country_name count percent
## <fct> <chr> <int> <dbl>
## 1 C1: Explorer USA 580 22.7
## 2 C2: Niche Russia 264 14.6
## 3 C3: Casual USA 873 16.5
## 4 C4: Heavy USA 57 16.4
Bibliography
Anderson, C., Nissley, C., & Anderson, C. (2006, April). The long
tail.
Kużelewska, U., & Wichowski, K. (2015, June). A modified
clustering algorithm DBSCAN used in a collaborative filtering
recommender system for music recommendation. In International conference
on dependability and complex systems (pp. 245-254). Cham: Springer
International Publishing.
Lesota, O., Melchiorre, A., Rekabsaz, N., Brandl, S., Kowald, D.,
Lex, E., & Schedl, M. (2021, September). Analyzing item popularity
bias of music recommender systems: are different genders equally
affected?. In Proceedings of the 15th ACM conference on recommender
systems (pp. 601-606).
Marques, A., Andrade, N., & Balby, L. (2013). Exploring the
relation between novelty aspects and preferences in music listening. In
Proc. ISMIR.
Schedl, M., & Hauger, D. (2015, August). Tailoring music
recommendations to users by considering diversity, mainstreaminess, and
novelty. In Proceedings of the 38th international acm sigir conference
on research and development in information retrieval (pp. 947-950).
Schedl, M., Brandl, S., Lesota, O., Parada-Cabaleiro, E., Penz, D.,
& Rekabsaz, N. (2022, March). LFM-2b: A dataset of enriched music
listening events for recommender systems research and fairness analysis.
In Proceedings of the 2022 Conference on Human Information Interaction
and Retrieval (pp. 337-341).
Sousa, A. M., Almeida, J. M., & Figueiredo, F. (2019, August).
Analyzing and modeling user curiosity in online content consumption: a
LastFM case study. In Proceedings of the 2019 IEEE/ACM International
Conference on Advances in Social Networks Analysis and Mining
(pp. 426-431).
Zhang, Z., & Chang, J. (2022). Clustering-based Categorization of
Music Users through Unsupervised Learning. Economics & Management
Information, 1-8.