Music streaming service companies such as Spotify and Pandora have built recommendaion systems to recommend songs and artists to users based upon user infomration. These systems mainly leverage user based collaborative filtering and content based models to implement the system. They continiusly learn fron the users behvior, interest and listening pattern.
The goal of this project is to implement a musical artist recommendations. Below are the details of the system implementaion:
The datasets were generated by the Information Retrieval Group at Universidad Autónoma de Madrid (http://ir.ii.uam.es). Before using these datasets, please review the README files for the usage license and other details.
link: https://grouplens.org/datasets/hetrec-2011/
This dataset contains social networking, tagging, and music artist listening information from a set of 2K users from Last.fm online music system. http://www.last.fm
Data statistics
artists.dat file contains information about music artists listened and tagged by the users.
# load list of last.fm artists:
artists <- read_delim("https://raw.githubusercontent.com/niteen11/MSDS/master/DATA643/dataset/hetrec2011-lastfm-2k/artists.dat", delim = "\t") %>% select(id, name)
dim(artists)
## [1] 16423 2
kable(head(artists,10))
id | name |
---|---|
1 | MALICE MIZER |
2 | Diary of Dreams |
3 | Carpathian Forest |
4 | Moi dix Mois |
5 | Bella Morte |
6 | Moonspell |
7 | Marilyn Manson |
8 | DIR EN GREY |
9 | Combichrist |
10 | Grendel |
user_artists.dat file contains the artists listened by each user. It also provides a listening count for each [user, artist] pair.
# loading last.fm user_artists file
ratings <- read.table("https://raw.githubusercontent.com/niteen11/MSDS/master/DATA643/dataset/hetrec2011-lastfm-2k/user_artists.dat", header = TRUE, sep = "", stringsAsFactors = FALSE)
dim(ratings)
## [1] 92834 3
kable(head(ratings,10))
userID | artistID | weight |
---|---|---|
2 | 51 | 13883 |
2 | 52 | 11690 |
2 | 53 | 11351 |
2 | 54 | 10300 |
2 | 55 | 8983 |
2 | 56 | 6152 |
2 | 57 | 5955 |
2 | 58 | 4616 |
2 | 59 | 4337 |
2 | 60 | 4147 |
In the below table TotalUsers column indicates the total number of last.fm users that have listened to the artist represented by the last.fm artist ID.
users_artists.stat <- ratings %>%
group_by(artistID) %>%
summarise(TotalUsers=length(unique(userID))) %>%
arrange(desc(TotalUsers))
kable(head(users_artists.stat,10))
artistID | TotalUsers |
---|---|
89 | 611 |
289 | 522 |
288 | 484 |
227 | 480 |
300 | 473 |
67 | 429 |
333 | 417 |
292 | 407 |
190 | 400 |
498 | 399 |
library(ggplot2)
ggplot(data=head(users_artists.stat,100), aes(x=artistID, y=TotalUsers)) +
geom_bar(stat="identity") +
theme_bw()
Let’s build a user-artist-rating dataframe with artist’s name populated
users.artist.rating <- merge.data.frame(artists,ratings,by.x = 'id',by.y = 'artistID')
kable(head(users.artist.rating))
id | name | userID | weight |
---|---|---|---|
1 | MALICE MIZER | 34 | 212 |
1 | MALICE MIZER | 274 | 483 |
1 | MALICE MIZER | 785 | 76 |
2 | Diary of Dreams | 2066 | 83 |
2 | Diary of Dreams | 135 | 1021 |
2 | Diary of Dreams | 1604 | 455 |
Below dataframe shows artists names that are listened most by the users.
users.artist.info<- merge.data.frame(artists,users_artists.stat,by.x = 'id',by.y = 'artistID') %>%
arrange(desc(TotalUsers))
kable(head(users.artist.info,20))
id | name | TotalUsers |
---|---|---|
89 | Lady Gaga | 611 |
289 | Britney Spears | 522 |
288 | Rihanna | 484 |
227 | The Beatles | 480 |
300 | Katy Perry | 473 |
67 | Madonna | 429 |
333 | Avril Lavigne | 417 |
292 | Christina Aguilera | 407 |
190 | Muse | 400 |
498 | Paramore | 399 |
295 | Beyoncé | 397 |
154 | Radiohead | 393 |
65 | Coldplay | 369 |
466 | Ke$ha | 362 |
701 | Shakira | 319 |
302 | P!nk | 305 |
229 | The Killers | 304 |
306 | Black Eyed Peas | 304 |
55 | Kylie Minogue | 298 |
461 | Miley Cyrus | 286 |
Let’s focus on top 1000 artists as determined by the number of listeners/ users.
artists.top1000 = users_artists.stat[1:1000,]
kable(head(artists.top1000))
artistID | TotalUsers |
---|---|
89 | 611 |
289 | 522 |
288 | 484 |
227 | 480 |
300 | 473 |
67 | 429 |
missing <- subset(artists.top1000, !(artistID %in% users.artist.info$id))
length(missing$artistID)
## [1] 182
Since 182 artists must be removed, this leaves us with a total of 818 artists to be retained within our user-artists data.
lastfm.ua <- subset(ratings, artistID %in% artists.top1000$artistID)
lastfm.ua <- subset(ratings, !(artistID %in% missing$artistID))
top_818 <- subset(artists.top1000, !(artistID %in% missing$artistID))
rm(artists.top1000)
kable(head(top_818))
artistID | TotalUsers |
---|---|
89 | 611 |
289 | 522 |
288 | 484 |
227 | 480 |
300 | 473 |
67 | 429 |
tags.dat file contains the set of tags available in the dataset. Load list of genres / tags:
lastfm.tags <- read_delim("https://raw.githubusercontent.com/niteen11/MSDS/master/DATA643/dataset/hetrec2011-lastfm-2k/tags.dat", delim = "\t")
dim(lastfm.tags)
## [1] 11946 2
user_taggedartists.dat file contains the tag assignments of artists provided by each particular user. Load list of UserID / Tag pairs:
users.tags <- read_delim("https://raw.githubusercontent.com/niteen11/MSDS/master/DATA643/dataset/hetrec2011-lastfm-2k/user_taggedartists.dat", delim = "\t") %>% select(userID, artistID, tagID)
dim(lastfm.tags)
## [1] 11946 2
kable(head(users.tags))
userID | artistID | tagID |
---|---|---|
2 | 52 | 13 |
2 | 52 | 15 |
2 | 52 | 18 |
2 | 52 | 21 |
2 | 52 | 41 |
2 | 63 | 13 |
users_artists.tag.stat <- users.tags %>%
group_by(tagID) %>%
summarise(TotalUsers=length(unique(userID))) %>%
arrange(desc(TotalUsers))
kable(head(users_artists.tag.stat))
tagID | TotalUsers |
---|---|
73 | 673 |
24 | 585 |
79 | 532 |
18 | 503 |
81 | 450 |
130 | 440 |
Considering Top 200 tags only for analysis purpose
tag_200 <- users_artists.tag.stat[1:200,]
tag_200 <- arrange(tag_200, tagID)
users.artist.tag.info<- merge.data.frame(lastfm.tags,users_artists.tag.stat,by.x = 'tagID',by.y = 'tagID') %>%
arrange(desc(TotalUsers))
kable(head(users.artist.tag.info,20))
tagID | tagValue | TotalUsers |
---|---|---|
73 | rock | 673 |
24 | pop | 585 |
79 | alternative | 532 |
18 | electronic | 503 |
81 | indie | 450 |
130 | female vocalists | 440 |
78 | alternative rock | 374 |
39 | dance | 370 |
84 | indie rock | 262 |
195 | british | 254 |
192 | classic rock | 252 |
25 | 80s | 248 |
33 | experimental | 247 |
181 | punk | 243 |
14 | ambient | 242 |
1 | metal | 237 |
72 | hard rock | 235 |
134 | singer-songwriter | 221 |
292 | folk | 211 |
191 | instrumental | 195 |
Now, let’s exclude artists not in top 818 list from the above artists analysis
user_top_tags<- subset(users.tags, tagID %in% tag_200$tagID)
user_top_tags<- subset(user_top_tags, artistID %in% top_818$artistID)
length(unique(user_top_tags$artistID))
## [1] 815
Now ensure that artistID / tagID pairs are summarized by counting the number of times any given tag has been applied to an artist:
user.topTags <- summarise(group_by(user_top_tags, artistID, tagID ),
Count = length(tagID) )
length(unique(user.topTags$artistID))
## [1] 815
So we have a mismatch: 3 of the 818 artists we retained from the user-artists data have not been genre tagged by any user via the top 200 genre tags
Which 3?
not_tagged <- subset(top_818, !(artistID %in% user_top_tags$artistID))
not_tagged
## # A tibble: 3 x 2
## artistID TotalUsers
## <int> <int>
## 1 1627 19
## 2 3378 18
## 3 4350 16
not_tagged$artistID %in% users.tags$artistID
## [1] TRUE TRUE TRUE
Remove 3 non-tagged artists from top_818
top_815 <- subset(top_818, artistID %in% user_top_tags$artistID)
length(unique(top_815$artistID))
## [1] 815
Remove artists not tagged using top 200 tags from user-artists data
lastfm.ua <- subset(lastfm.ua, artistID %in% top_815$artistID)
length(unique(lastfm.ua$userID))
## [1] 1870
1870 users retained - same as with top 818 artists
One of our goals for this project include the use of a user-based collaborative filter for purposes of generating recommendations of musical artists to last.fm users, we need to convert our reduced user-artists data to a user-item matrix
# convert to wide format
lastfm_mat <- spread(lastfm.ua, artistID, weight)
user_ids <- as.vector(lastfm_mat$userID)
lastfm_rating_mat <- as.matrix(lastfm_mat[,2:ncol(lastfm_mat)])
Calculate density / sparsity
nratings <- length(as.vector(lastfm_rating_mat))
nratings
## [1] 1524050
sum(!is.na(as.vector(lastfm_rating_mat)) ) /
1 - sum(!is.na(as.vector(lastfm_rating_mat)) ) / nratings
## [1] 51364.97
For content-based recommendation efforts for this project, we need to create artist-genre matrix and provide similar artist recommendations to users.
# convert to wide format
tmp_mat <- spread(user.topTags, tagID, Count)
ag_artistID <- as.vector(tmp_mat$artistID)
art.genre.mat <- as.matrix(tmp_mat[,2:ncol(tmp_mat)])
Validating the content of artist-genre matrix against original user-tags data
nrow(subset(users.tags, artistID == 89 & tagID == 24))
## [1] 99
nrow(subset(users.tags, artistID == 72 & tagID == 18))
## [1] 68
Calculate density / sparsity
ntags <- length(as.vector(art.genre.mat))
ntags
## [1] 163000
sum(!is.na(as.vector(art.genre.mat)) ) / ntags
## [1] 0.09107975
1 - sum(!is.na(as.vector(art.genre.mat)) ) / ntags
## [1] 0.9089202
The artist-genre matrix can be binarized as follows:
bin_art.gen.mat <- art.genre.mat
bin_art.gen.mat[,][is.na(bin_art.gen.mat[,])] <- 0
bin_art.gen.mat[,][bin_art.gen.mat[,] > 0] <- 1
The user-item matrix can be binarized as follows:
bin.ua.rating.mat <- lastfm_rating_mat
bin.ua.rating.mat[,][is.na(bin.ua.rating.mat[,])] <- 0
bin.ua.rating.mat[,][bin.ua.rating.mat[,] > 0] <- 1
userArtist_binaryMat <- as(bin.ua.rating.mat,"binaryRatingMatrix")
e_bin <- evaluationScheme(userArtist_binaryMat, method="split", train=0.8, given = 1, goodRating = 1)
n_recommended <- 10
bin_rec <- Recommender(getData(e_bin, "train"), "UBCF",
parameter = list(method = "Jaccard"))
bin_pred <- predict(bin_rec, getData(e_bin, "known"), n = n_recommended, goodRating = 1)
error_b <- calcPredictionAccuracy(bin_pred, getData(e_bin, "unknown"),
given = n_recommended, goodRating = 1)
kable(error_b, caption = "Performance Metrics")
x | |
---|---|
TP | 3.1443850 |
FP | 6.8556150 |
FN | 22.9518717 |
TN | 772.0481283 |
precision | 0.3144385 |
recall | 0.1313149 |
TPR | 0.1313149 |
FPR | 0.0087814 |
The performance metrics for the model are shown above. As we can see, the precision and recall are both relatively low, an also the true positive rate.
bin_pred@items[1:4]
## [[1]]
## [1] 254 169 184 129 128 229 140 37 172 244
##
## [[2]]
## [1] 142 128 131 37 129 132 135 169 140 338
##
## [[3]]
## [1] 78 47 524 99 518 104 244 71 267 254
##
## [[4]]
## [1] 128 129 37 140 132 23 416 561 418 13
n_recommended <- 20
bin_pred <- predict(bin_rec, userArtist_binaryMat, n = n_recommended, goodRating = 1)
bin_pred@items[20:23]
## $`20`
## [1] 78 184 658 427 256 343 369 484 428 461 641 545 370 429 71 643 534
## [18] 431 509 671
##
## $`21`
## [1] 599 47 105 311 295 519 342 216 17 349 579 653 524 450 90 206 463
## [18] 84 59 453
##
## $`22`
## [1] 132 338 145 272 173 139 233 229 270 121 134 40 289 165 172 161 505
## [18] 174 420 133
##
## $`23`
## [1] 117 139 145 131 140 169 233 127 270 141 156 272 229 168 232 172 138
## [18] 125 51 645
#create a data frame to hold 10 recommendations for each artist
u_tenRec <- data.frame(matrix(ncol = 11, nrow = length(user_ids)))
u_tenRec[,1] <- user_ids
colnames(u_tenRec) <- c("userID", "r1", "r2", "r3", "r4", "r5",
"r6", "r7", "r8", "r9", "r10")
# load the recommendations
for (i in 1:length(bin_pred@items)){
tmp_recs <- as.vector(bin_pred@items[[i]])
num_trecs <- length(tmp_recs)
user_arts <- unique(subset(lastfm.ua, userID == user_ids[i])$artistID)
new_recs <- tmp_recs[!(tmp_recs %in% user_arts) ]
num_newrecs <- length(new_recs)
if(num_newrecs < 10) {
new_recs <- sample(top_815$artistID[!(top_815$artistID %in% user_arts)], 10)
}
if (num_newrecs < 13) {
topten <- new_recs[1:10]
} else {
t_seven <- sample(new_recs[1:10], 7)
t_three <- sample(new_recs[11:length(new_recs)], 3)
topten <- c(t_seven, t_three)
}
topten <- sample(topten, 10)
u_tenRec[i,2:11 ] <- topten
}
# randomly select a user
user <- sample(user_ids, 1)
urecs <- sort(as.vector(subset(u_tenRec, userID == user)[2:11]) )
rec_names <- subset(artists, id %in% urecs)$name
kable(rec_names, col.names = "Artists You Might Like")
Artists You Might Like |
---|
Spandau Ballet |
Katie Melua |
CAKE |
Andrew Bird |
Babyshambles |
Katy Perry |
Monrose |
Aaron Carter |
Renato Carosone |
at |
Create a similarity matrix using cosine distance as metric of similarity
# calculate artist similarity matrix
artist_sim <- similarity(as(bin_art.gen.mat, "binaryRatingMatrix"), method = "cosine",
which = "users")
artist_sim <- as(artist_sim, "matrix")
artist_sim[][] <- round(artist_sim[][],3)
colnames(artist_sim) <- ag_artistID
rownames(artist_sim) <- ag_artistID
image(as.matrix(artist_sim), main = "Artists similarity")
n_recommended <- 5
artist <- sample(ag_artistID, 1)
a_name <- artists[artists$id == artist,]$name
arecs <- sort(artist_sim[as.character(artist),], decreasing = TRUE)[1:n_recommended]
arecs_IDs <- as.numeric(names(arecs))
arec_names <- artists[artists$id %in% arecs_IDs,]$name
table_head <- sprintf("Artists Similar to %s", a_name)
kable(arec_names, col.names = table_head)
Artists Similar to Cradle of Filth |
---|
Dimmu Borgir |
Paradise Lost |
Epica |
Deathstars |
Burzum |
This was a wonderful dataset to build recommendation system driven by collaborative user based filtering and content based model technques. By binarizing the data the model did work better and with the low precision and recall rate and also true positive stayed low. Also, made recommendation for a given user and similar artists successfully. In future I would like to build an intractive system using Shiny to domonstrate the implemented features.