1. Project Introduction

This project aims to analyze music playlist data to explore the “hidden associations” between different artists. In this study, each user’s playlist or listening record is treated as a “Transaction,” while the artists are treated as “Items.” By utilizing association rule algorithms (Apriori and Eclat), this research seeks to answer the following questions:

1.The “Genre Anchor” Query: Do fans of modern alternative icons like Coldplay exhibit a statistically significant probability of also following The Killers?

2.The Overlap Mapping: Which artist clusters exhibit the highest degree of fanbase overlap, and do these clusters transcend traditional genre boundaries (e.g., Grunge vs. Metal)?

3.Algorithmic Discovery: How does the Apriori algorithm effectively uncover “hidden bridges” between artists—such as the link between Muse and Radiohead—that simple popularity charts might overlook?

Core Objectives:

1.1 Robust data loading

# --- 1.1 Robust Data Loading ---
file_path <- "C:/Users/JasmineJiang/Desktop/s3/2b.USL/#2.project_association/lastfm-dataset-360K/lastfm-dataset-360K/usersha1-artmbid-artname-plays.tsv"
raw_lines <- read_lines(file_path, n_max = 500000)

# Convert to a dataframe by splitting by Tab, but handling inconsistent columns
df_cleaned <- tibble(raw = raw_lines) %>%
  separate(raw, 
           into = c("user_id", "artist_mbid", "artist_name", "plays"), 
           sep = "\t", 
           extra = "drop",  # Drop extra segments if a line has > 4 tabs
           fill = "right")  # Fill with NA if a line has < 4 tabs

1.2 Data cleaning and validation

Filters out incomplete records and ensures plays is numeric so downstream mining uses valid observations.

# --- 1.2 Data Cleaning & Validation ---
# 1. Remove rows where artist_name or user_id is NA
# 2. Convert plays to numeric (invalid ones become NA)
df_cleaned <- df_cleaned %>%
  filter(!is.na(user_id) & !is.na(artist_name) & artist_name != "") %>%
  mutate(plays = as.numeric(plays)) %>%
  filter(!is.na(plays)) # Ensure the last column was actually a number

1.3 Targeted artist selection

Identifies the Top 10 most frequently occurring artists to keep the downstream analysis focused and interpretable.

# --- 1.3 Targeted Artist Selection ---
# To make your analysis meaningful, let's focus on the Top 10 most popular artists
# You can also manually define: target_artists <- c("Taylor Swift", "Drake", ...)
top_10_artists <- df_cleaned %>%
  group_by(artist_name) %>%
  summarise(count = n()) %>%
  arrange(desc(count)) %>%
  slice(1:10) %>%
  pull(artist_name)

cat("Top 10 Artists identified:\n", paste(top_10_artists, collapse = ", "), "\n")
## Top 10 Artists identified:
##  radiohead, the beatles, coldplay, red hot chili peppers, muse, metallica, pink floyd, nirvana, the killers, linkin park

1.4 Sampling for analysis (80k cap)

Restricts the data to users listening to the Top 10 artists and caps the working dataset size at 80,000 rows for practicality.

# --- 1.4 Sampling for Analysis (The 8w limit) ---
# We filter only those who listen to the top 10, then limit to your 80,000 rows
final_df <- df_cleaned %>%
  filter(artist_name %in% top_10_artists) %>%
  head(80000)

1.5 Convert to transactions

Builds per-user “playlists” and converts them into transactions (market baskets), removing users with only a single unique artist.

# --- 1.5 Convert to Transactions ---
# Convert the long-format data into a list of "shopping baskets" (playlists)
playlist_list <- split(final_df$artist_name, final_df$user_id)

# Remove users who only listen to 1 artist (they don't contribute to "overlap")
playlist_list <- playlist_list[sapply(playlist_list, function(x) length(unique(x)) > 1)]

# Final conversion to arules transaction format
trans <- as(lapply(playlist_list, unique), "transactions")

# Check the result
print(trans)
## transactions in sparse format with
##  4037 transactions (rows) and
##  10 items (columns)
summary(trans)
## transactions as itemMatrix in sparse format with
##  4037 rows (elements/itemsets/transactions) and
##  10 columns (items) and a density of 0.3045578 
## 
## most frequent items:
##             radiohead           the beatles              coldplay 
##                  1725                  1702                  1598 
## red hot chili peppers                  muse               (Other) 
##                  1229                  1169                  4872 
## 
## element (itemset/transaction) length distribution:
## sizes
##    2    3    4    5    6    7    8    9   10 
## 1766 1125  648  300  122   52   19    2    3 
## 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.000   2.000   3.000   3.046   4.000  10.000 
## 
## includes extended item information - examples:
##        labels
## 1    coldplay
## 2 linkin park
## 3   metallica
## 
## includes extended transaction information - examples:
##                              transactionID
## 1 00000c289a1829a808ac09c00daf10bc3c4e223b
## 2 0000c176103e538d5c9828e695fed4f7ae42dd01
## 3 0000ef373bbd0d89ce796abae961f2705e8c1faf

2. Analysis Method 1: Apriori Algorithm

2.1 Run Apriori

Runs the Apriori algorithm with specified support, confidence, and minimum rule length thresholds.

# 2. Analysis Method 1: Apriori Algorithm
# --- 2.1 Run Apriori Algorithm ---
# support: minimum % of users who listen to both artists
# confidence: probability that user listens to B given they listen to A
rules_apriori <- apriori(trans, 
                         parameter = list(supp = 0.05, conf = 0.3, minlen = 2))
## Apriori
## 
## Parameter specification:
##  confidence minval smax arem  aval originalSupport maxtime support minlen
##         0.3    0.1    1 none FALSE            TRUE       5    0.05      2
##  maxlen target  ext
##      10  rules TRUE
## 
## Algorithmic control:
##  filter tree heap memopt load sort verbose
##     0.1 TRUE TRUE  FALSE TRUE    2    TRUE
## 
## Absolute minimum support count: 201 
## 
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[10 item(s), 4037 transaction(s)] done [0.00s].
## sorting and recoding items ... [10 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 done [0.00s].
## writing ... [59 rule(s)] done [0.00s].
## creating S4 object  ... done [0.00s].

2.2 Inspect top rules

Sorts rules by lift (association strength) and inspects the top candidates.

# --- 2.2 Inspect Top Rules ---
# Sort by 'lift' which indicates the strength of the relationship
rules_sorted <- sort(rules_apriori, by = "lift", decreasing = TRUE)
inspect(head(rules_sorted,30))
##      lhs                                   rhs                     support   
## [1]  {coldplay, muse}                   => {the killers}           0.05226653
## [2]  {radiohead, the killers}           => {coldplay}              0.05523904
## [3]  {metallica}                        => {nirvana}               0.08595492
## [4]  {nirvana}                          => {metallica}             0.08595492
## [5]  {the killers}                      => {coldplay}              0.13450582
## [6]  {coldplay}                         => {the killers}           0.13450582
## [7]  {linkin park}                      => {metallica}             0.06886302
## [8]  {muse, the killers}                => {coldplay}              0.05226653
## [9]  {muse, the beatles}                => {radiohead}             0.05325737
## [10] {muse}                             => {the killers}           0.09611097
## [11] {the killers}                      => {muse}                  0.09611097
## [12] {pink floyd, radiohead}            => {the beatles}           0.06539510
## [13] {coldplay, the killers}            => {muse}                  0.05226653
## [14] {red hot chili peppers}            => {nirvana}               0.10255140
## [15] {nirvana}                          => {red hot chili peppers} 0.10255140
## [16] {pink floyd}                       => {metallica}             0.08570721
## [17] {metallica}                        => {pink floyd}            0.08570721
## [18] {metallica}                        => {red hot chili peppers} 0.09635868
## [19] {red hot chili peppers}            => {metallica}             0.09635868
## [20] {coldplay, radiohead}              => {muse}                  0.06638593
## [21] {coldplay, muse}                   => {radiohead}             0.06638593
## [22] {coldplay, radiohead}              => {the killers}           0.05523904
## [23] {radiohead, red hot chili peppers} => {coldplay}              0.05127570
## [24] {the beatles}                      => {pink floyd}            0.13846916
## [25] {pink floyd}                       => {the beatles}           0.13846916
## [26] {radiohead, the beatles}           => {pink floyd}            0.06539510
## [27] {linkin park}                      => {red hot chili peppers} 0.07579886
## [28] {muse, radiohead}                  => {coldplay}              0.06638593
## [29] {coldplay, the beatles}            => {radiohead}             0.07307406
## [30] {muse}                             => {radiohead}             0.14119396
##      confidence coverage   lift     count
## [1]  0.4279919  0.12212039 1.764865 211  
## [2]  0.5575000  0.09908348 1.408403 223  
## [3]  0.3540816  0.24275452 1.408303 347  
## [4]  0.3418719  0.25142432 1.408303 347  
## [5]  0.5546476  0.24250681 1.401197 543  
## [6]  0.3397997  0.39583849 1.401197 543  
## [7]  0.3337335  0.20634134 1.374778 278  
## [8]  0.5438144  0.09611097 1.373829 211  
## [9]  0.5858311  0.09090909 1.371014 215  
## [10] 0.3319076  0.28957146 1.368653 388  
## [11] 0.3963228  0.24250681 1.368653 388  
## [12] 0.5689655  0.11493683 1.349538 264  
## [13] 0.3885820  0.13450582 1.341921 211  
## [14] 0.3368592  0.30443399 1.339804 414  
## [15] 0.4078818  0.25142432 1.339804 414  
## [16] 0.3248826  0.26380976 1.338318 346  
## [17] 0.3530612  0.24275452 1.338318 346  
## [18] 0.3969388  0.24275452 1.303858 389  
## [19] 0.3165175  0.30443399 1.303858 389  
## [20] 0.3706777  0.17909339 1.280091 268  
## [21] 0.5436105  0.12212039 1.272206 268  
## [22] 0.3084371  0.17909339 1.271870 223  
## [23] 0.5000000  0.10255140 1.263141 207  
## [24] 0.3284371  0.42160020 1.244977 559  
## [25] 0.5248826  0.26380976 1.244977 559  
## [26] 0.3196126  0.20460738 1.211527 264  
## [27] 0.3673469  0.20634134 1.206655 306  
## [28] 0.4701754  0.14119396 1.187796 268  
## [29] 0.4924875  0.14837751 1.152563 295  
## [30] 0.4875962  0.28957146 1.141117 570

2.3 Filter rules for a specific artist

Demonstrates how to subset rules where a specific artist appears on the left-hand side (LHS).

# --- 2.3 Filter for Specific Artists ---
# Example: Find fans who also listen to 'Radiohead'
radiohead_rules <- subset(rules_sorted, lhs %in% "radiohead")
inspect(head(radiohead_rules, 5))
##     lhs                                   rhs           support    confidence
## [1] {radiohead, the killers}           => {coldplay}    0.05523904 0.5575000 
## [2] {pink floyd, radiohead}            => {the beatles} 0.06539510 0.5689655 
## [3] {coldplay, radiohead}              => {muse}        0.06638593 0.3706777 
## [4] {coldplay, radiohead}              => {the killers} 0.05523904 0.3084371 
## [5] {radiohead, red hot chili peppers} => {coldplay}    0.05127570 0.5000000 
##     coverage   lift     count
## [1] 0.09908348 1.408403 223  
## [2] 0.11493683 1.349538 264  
## [3] 0.17909339 1.280091 268  
## [4] 0.17909339 1.271870 223  
## [5] 0.10255140 1.263141 207

3. Analysis Method 2: Eclat Algorithm

This section finds frequent co-occurrence itemsets ({A, B, …}), which are often more intuitive for overlap/clustering analysis.

3.1 Run Eclat

Runs Eclat to identify frequent itemsets up to a maximum length, using a minimum support threshold.

# 3. Analysis Method 2: Eclat Algorithm
# --- 3.1 Run Eclat Algorithm ---
# Focus on identifying clusters of artists often listened to together
frequent_itemsets <- eclat(trans, 
                           parameter = list(supp = 0.05, maxlen = 5))
## Eclat
## 
## parameter specification:
##  tidLists support minlen maxlen            target  ext
##     FALSE    0.05      1      5 frequent itemsets TRUE
## 
## algorithmic control:
##  sparse sort verbose
##       7   -2    TRUE
## 
## Absolute minimum support count: 201 
## 
## create itemset ... 
## set transactions ...[10 item(s), 4037 transaction(s)] done [0.00s].
## sorting and recoding items ... [10 item(s)] done [0.00s].
## creating bit matrix ... [10 row(s), 4037 column(s)] done [0.00s].
## writing  ... [55 set(s)] done [0.00s].
## Creating S4 object  ... done [0.00s].

3.2 Inspect frequent pairs and groups

Inspects the most supported itemsets to see which artists frequently co-occur in user listening histories.

# --- 3.2 Inspect Frequent Pairs/Groups ---
inspect(sort(frequent_itemsets, by = "support")[1:10])
##      items                   support   count
## [1]  {radiohead}             0.4272975 1725 
## [2]  {the beatles}           0.4216002 1702 
## [3]  {coldplay}              0.3958385 1598 
## [4]  {red hot chili peppers} 0.3044340 1229 
## [5]  {muse}                  0.2895715 1169 
## [6]  {pink floyd}            0.2638098 1065 
## [7]  {nirvana}               0.2514243 1015 
## [8]  {metallica}             0.2427545  980 
## [9]  {the killers}           0.2425068  979 
## [10] {linkin park}           0.2063413  833

4. Algorithm Comparison: Apriori vs. Eclat

# --- 4.1 Performance Comparison (Time) ---
# Timing Apriori
start_time <- Sys.time()
rules_apriori <- apriori(trans, parameter = list(supp = 0.05, conf = 0.3))
## Apriori
## 
## Parameter specification:
##  confidence minval smax arem  aval originalSupport maxtime support minlen
##         0.3    0.1    1 none FALSE            TRUE       5    0.05      1
##  maxlen target  ext
##      10  rules TRUE
## 
## Algorithmic control:
##  filter tree heap memopt load sort verbose
##     0.1 TRUE TRUE  FALSE TRUE    2    TRUE
## 
## Absolute minimum support count: 201 
## 
## set item appearances ...[0 item(s)] done [0.00s].
## set transactions ...[10 item(s), 4037 transaction(s)] done [0.00s].
## sorting and recoding items ... [10 item(s)] done [0.00s].
## creating transaction tree ... done [0.00s].
## checking subsets of size 1 2 3 done [0.00s].
## writing ... [63 rule(s)] done [0.00s].
## creating S4 object  ... done [0.00s].
end_time <- Sys.time()
cat("Apriori Execution Time:", end_time - start_time, "seconds\n")
## Apriori Execution Time: 0.008551836 seconds
# Timing Eclat
start_time <- Sys.time()
itemsets_eclat <- eclat(trans, parameter = list(supp = 0.05))
## Eclat
## 
## parameter specification:
##  tidLists support minlen maxlen            target  ext
##     FALSE    0.05      1     10 frequent itemsets TRUE
## 
## algorithmic control:
##  sparse sort verbose
##       7   -2    TRUE
## 
## Absolute minimum support count: 201 
## 
## create itemset ... 
## set transactions ...[10 item(s), 4037 transaction(s)] done [0.00s].
## sorting and recoding items ... [10 item(s)] done [0.00s].
## creating bit matrix ... [10 row(s), 4037 column(s)] done [0.00s].
## writing  ... [55 set(s)] done [0.00s].
## Creating S4 object  ... done [0.00s].
end_time <- Sys.time()
cat("Eclat Execution Time:", end_time - start_time, "seconds\n")
## Eclat Execution Time: 0.004576921 seconds
# --- 4.2 Structural Comparison ---
# Get frequent itemsets from Apriori rules
itemsets_from_apriori <- generatingItemsets(rules_apriori)

# Compare the number of unique patterns found
cat("Apriori found", length(unique(itemsets_from_apriori)), "unique frequent patterns.\n")
## Apriori found 38 unique frequent patterns.
cat("Eclat found", length(itemsets_eclat), "frequent itemsets.\n")
## Eclat found 55 frequent itemsets.
# Check if the top itemsets are the same
inspect(head(sort(itemsets_from_apriori, by="support"), 5))
##     items                    support  
## [1] {radiohead}              0.4272975
## [2] {the beatles}            0.4216002
## [3] {coldplay}               0.3958385
## [4] {red hot chili peppers}  0.3044340
## [5] {radiohead, the beatles} 0.2046074
inspect(head(sort(itemsets_eclat, by="support"), 5))
##     items                   support   count
## [1] {radiohead}             0.4272975 1725 
## [2] {the beatles}           0.4216002 1702 
## [3] {coldplay}              0.3958385 1598 
## [4] {red hot chili peppers} 0.3044340 1229 
## [5] {muse}                  0.2895715 1169
# --- 4.3 Quantitative Quality Comparison ---
# Convert Eclat itemsets into rules
rules_from_eclat <- ruleInduction(itemsets_eclat, trans, confidence = 0.3)

# Now apply the same quality check to Eclat's rules
quality(rules_from_eclat)$pValue <- interestMeasure(rules_from_eclat, 
                                                     measure = "fishersExactTest", 
                                                     transactions = trans)

# Compare Average Lift and Average pValue
cat("Apriori Average Lift:", mean(quality(rules_apriori)$lift), "\n")
## Apriori Average Lift: 1.142236
cat("Eclat (Converted) Average Lift:", mean(quality(rules_from_eclat)$lift), "\n")
## Eclat (Converted) Average Lift: 1.151879

5. Visualization: Fan Overlap Map

This section visualizes artist-to-artist association rules mined with the Apriori algorithm. Using arulesViz, render the strongest rules with complementary views (network graph, grouped matrix, and parallel coordinates) to highlight fan overlap and co-listening structure.

5.1 Interactive network graph

Plots the top rules as a graph using the HTML widget engine (interactive in HTML output).

# 5. Visualization: Fan Overlap Map
library(arulesViz)

# --- 5.1 Interactive Network Map ---
# Plotting the top 30 rules as a graph
plot(rules_sorted[1:30], method = "graph", engine = "htmlwidget")

5.2 Grouped matrix plot (heatmap-style)

Shows overlap density patterns between items using a grouped matrix plot.

# --- 5.2 Static Matrix Plot (Heatmap) ---
# Shows overlap density between different artists
plot(rules_sorted[1:20], method = "grouped")

5.3 Parallel coordinates plot

Visualizes rule flows across items, which can be helpful to understand fan-base transitions.

# --- 5.3 Parallel Coordinates Plot ---
# Great for visualizing the flow of fan bases
plot(rules_sorted[1:15], method = "paracoord")

6. Conclusion

Executive Summary: The Association Rule Mining (Apriori) on the Million Song Dataset reveals a highly structured ecosystem of listener preferences. Instead of isolated fanbases, there are three distinct “Gravity Centers” where musical genres and eras overlap with remarkable consistency.

Key Findings:

The Indie-Rock Trinity: A powerful “loyalty loop” exists between Coldplay, Muse, and The Killers. With support levels peaking at 13.4%, these artists form the backbone of modern alternative playlists. If a user listens to Muse, there is a statistically dominant probability they are also navigating the discographies of Radiohead or The Killers.

The Nostalgia Core: The strongest association in the entire dataset belongs to The Beatles and Pink Floyd (Support: 13.8%). This suggests that classic rock enthusiasts exhibit the most “pure” listening habits, rarely straying from the 60s/70s canon unless it’s toward art-rock icons like Radiohead.

The Grunge-Metal Bridge: There is a robust crossover between Nirvana, Metallica, and Red Hot Chili Peppers. This indicates that “energy” and “attitude” are stronger recommendation drivers than strict genre labels, linking 90s grunge directly to heavy metal. The Radiohead Influence: Radiohead acts as the “connective tissue” of the dataset. It frequently appears as the RHS (consequent) for Muse and Pink Floyd fans, marking it as the bridge between intellectual art-rock and mainstream alternative.