Music streaming service companies such as Spotify and Pandora have built recommendation systems to recommend songs and artists to users based upon user information. These systems mainly leverage user based collaborative filtering and content based models to implement the system. They continously learn fron the users behvior, interest and listening pattern.
The goal of this project is to implement a musical artist recommendations. Below are the details of the system implementaion:
The datasets were generated by the Information Retrieval Group at Universidad Autónoma de Madrid (http://ir.ii.uam.es). Before using these datasets, please review the README files for the usage license and other details.
link: https://grouplens.org/datasets/hetrec-2011/
This dataset contains social networking, tagging, and music artist listening information from a set of 2K users from Last.fm online music system. http://www.last.fm
Data statistics
artists.dat file contains information about music artists listened and tagged by the users.
# load list of last.fm artists:
artists <- read_delim("https://raw.githubusercontent.com/niteen11/MSDS/master/DATA643/dataset/hetrec2011-lastfm-2k/artists.dat", delim = "\t") %>% select(id, name)
dim(artists)
## [1] 16423 2
kable(head(artists,10))
id | name |
---|---|
1 | MALICE MIZER |
2 | Diary of Dreams |
3 | Carpathian Forest |
4 | Moi dix Mois |
5 | Bella Morte |
6 | Moonspell |
7 | Marilyn Manson |
8 | DIR EN GREY |
9 | Combichrist |
10 | Grendel |
user_artists.dat file contains the artists listened by each user. It also provides a listening count for each [user, artist] pair.
# loading last.fm user_artists file
ratings <- read.table("https://raw.githubusercontent.com/niteen11/MSDS/master/DATA643/dataset/hetrec2011-lastfm-2k/user_artists.dat", header = TRUE, sep = "", stringsAsFactors = FALSE)
dim(ratings)
## [1] 92834 3
kable(head(ratings,10))
userID | artistID | weight |
---|---|---|
2 | 51 | 13883 |
2 | 52 | 11690 |
2 | 53 | 11351 |
2 | 54 | 10300 |
2 | 55 | 8983 |
2 | 56 | 6152 |
2 | 57 | 5955 |
2 | 58 | 4616 |
2 | 59 | 4337 |
2 | 60 | 4147 |
In the below table TotalUsers column indicates the total number of last.fm users that have listened to the artist represented by the last.fm artist ID.
users_artists.stat <- ratings %>%
group_by(artistID) %>%
summarise(TotalUsers=length(unique(userID))) %>%
arrange(desc(TotalUsers))
kable(head(users_artists.stat,10))
artistID | TotalUsers |
---|---|
89 | 611 |
289 | 522 |
288 | 484 |
227 | 480 |
300 | 473 |
67 | 429 |
333 | 417 |
292 | 407 |
190 | 400 |
498 | 399 |
library(ggplot2)
ggplot(data=head(users_artists.stat,100), aes(x=artistID, y=TotalUsers)) +
geom_bar(stat="identity") +
theme_bw()
Let’s build a user-artist-rating dataframe with artist’s name populated
users.artist.rating <- merge.data.frame(artists,ratings,by.x = 'id',by.y = 'artistID')
kable(head(users.artist.rating))
id | name | userID | weight |
---|---|---|---|
1 | MALICE MIZER | 34 | 212 |
1 | MALICE MIZER | 274 | 483 |
1 | MALICE MIZER | 785 | 76 |
2 | Diary of Dreams | 2066 | 83 |
2 | Diary of Dreams | 135 | 1021 |
2 | Diary of Dreams | 1604 | 455 |
Below dataframe shows artists names that are listened most by the users. Lady Gaga and Britney Spears rank on top of the chart and are most popular artists. But we just need to remember that this dataset is from 2011 and musci industry has chnaged a lot since then and we have lot more new aritsts, songs and listeners.
users.artist.info<- merge.data.frame(artists,users_artists.stat,by.x = 'id',by.y = 'artistID') %>%
arrange(desc(TotalUsers))
kable(head(users.artist.info,20))
id | name | TotalUsers |
---|---|---|
89 | Lady Gaga | 611 |
289 | Britney Spears | 522 |
288 | Rihanna | 484 |
227 | The Beatles | 480 |
300 | Katy Perry | 473 |
67 | Madonna | 429 |
333 | Avril Lavigne | 417 |
292 | Christina Aguilera | 407 |
190 | Muse | 400 |
498 | Paramore | 399 |
295 | Beyoncé | 397 |
154 | Radiohead | 393 |
65 | Coldplay | 369 |
466 | Ke$ha | 362 |
701 | Shakira | 319 |
302 | P!nk | 305 |
229 | The Killers | 304 |
306 | Black Eyed Peas | 304 |
55 | Kylie Minogue | 298 |
461 | Miley Cyrus | 286 |
Let’s focus on top 1000 artists as determined by the number of listeners/ users.
artists.top1000 = users_artists.stat[1:1000,]
kable(head(artists.top1000))
artistID | TotalUsers |
---|---|
89 | 611 |
289 | 522 |
288 | 484 |
227 | 480 |
300 | 473 |
67 | 429 |
missing <- subset(artists.top1000, !(artistID %in% users.artist.info$id))
length(missing$artistID)
## [1] 182
Since 182 artists must be removed, this leaves us with a total of 818 artists to be retained within our user-artists data.
lastfm.ua <- subset(ratings, artistID %in% artists.top1000$artistID)
lastfm.ua <- subset(ratings, !(artistID %in% missing$artistID))
top_818 <- subset(artists.top1000, !(artistID %in% missing$artistID))
rm(artists.top1000)
kable(head(top_818))
artistID | TotalUsers |
---|---|
89 | 611 |
289 | 522 |
288 | 484 |
227 | 480 |
300 | 473 |
67 | 429 |
tags.dat file contains the set of tags available in the dataset. Load list of genres / tags:
lastfm.tags <- read_delim("https://raw.githubusercontent.com/niteen11/MSDS/master/DATA643/dataset/hetrec2011-lastfm-2k/tags.dat", delim = "\t")
dim(lastfm.tags)
## [1] 11946 2
user_taggedartists.dat file contains the tag assignments of artists provided by each particular user. Load list of UserID / Tag pairs:
users.tags <- read_delim("https://raw.githubusercontent.com/niteen11/MSDS/master/DATA643/dataset/hetrec2011-lastfm-2k/user_taggedartists.dat", delim = "\t") %>% select(userID, artistID, tagID)
dim(lastfm.tags)
## [1] 11946 2
kable(head(users.tags))
userID | artistID | tagID |
---|---|---|
2 | 52 | 13 |
2 | 52 | 15 |
2 | 52 | 18 |
2 | 52 | 21 |
2 | 52 | 41 |
2 | 63 | 13 |
users_artists.tag.stat <- users.tags %>%
group_by(tagID) %>%
summarise(TotalUsers=length(unique(userID))) %>%
arrange(desc(TotalUsers))
kable(head(users_artists.tag.stat))
tagID | TotalUsers |
---|---|
73 | 673 |
24 | 585 |
79 | 532 |
18 | 503 |
81 | 450 |
130 | 440 |
Considering Top 200 tags only for analysis purpose. Rock and pop are definitely most popular genres here in the given dataset.
tag_200 <- users_artists.tag.stat[1:200,]
tag_200 <- arrange(tag_200, tagID)
users.artist.tag.info<- merge.data.frame(lastfm.tags,users_artists.tag.stat,by.x = 'tagID',by.y = 'tagID') %>%
arrange(desc(TotalUsers))
kable(head(users.artist.tag.info,20))
tagID | tagValue | TotalUsers |
---|---|---|
73 | rock | 673 |
24 | pop | 585 |
79 | alternative | 532 |
18 | electronic | 503 |
81 | indie | 450 |
130 | female vocalists | 440 |
78 | alternative rock | 374 |
39 | dance | 370 |
84 | indie rock | 262 |
195 | british | 254 |
192 | classic rock | 252 |
25 | 80s | 248 |
33 | experimental | 247 |
181 | punk | 243 |
14 | ambient | 242 |
1 | metal | 237 |
72 | hard rock | 235 |
134 | singer-songwriter | 221 |
292 | folk | 211 |
191 | instrumental | 195 |
Now, let’s exclude artists not in top 818 list from the above artists analysis
user_top_tags<- subset(users.tags, tagID %in% tag_200$tagID)
user_top_tags<- subset(user_top_tags, artistID %in% top_818$artistID)
length(unique(user_top_tags$artistID))
## [1] 815
Now ensure that artistID / tagID pairs are summarized by counting the number of times any given tag has been applied to an artist:
user.topTags <- summarise(group_by(user_top_tags, artistID, tagID ),
Count = length(tagID) )
length(unique(user.topTags$artistID))
## [1] 815
So we have a mismatch: 3 of the 818 artists we retained from the user-artists data have not been genre tagged by any user via the top 200 genre tags
Which 3?
not_tagged <- subset(top_818, !(artistID %in% user_top_tags$artistID))
not_tagged
## # A tibble: 3 x 2
## artistID TotalUsers
## <int> <int>
## 1 1627 19
## 2 3378 18
## 3 4350 16
not_tagged$artistID %in% users.tags$artistID
## [1] TRUE TRUE TRUE
Remove 3 non-tagged artists from top_818
top_815 <- subset(top_818, artistID %in% user_top_tags$artistID)
length(unique(top_815$artistID))
## [1] 815
Remove artists not tagged using top 200 tags from user-artists data
lastfm.ua <- subset(lastfm.ua, artistID %in% top_815$artistID)
length(unique(lastfm.ua$userID))
## [1] 1870
1870 users retained - same as with top 818 artists
One of the goals for this project include the use of a user-based collaborative filter for purposes of generating recommendations of musical artists to last.fm users, we need to convert our reduced user-artists data to a user-item matrix
# convert to wide format
lastfm_mat <- spread(lastfm.ua, artistID, weight)
user_ids <- as.vector(lastfm_mat$userID)
lastfm_rating_mat <- as.matrix(lastfm_mat[,2:ncol(lastfm_mat)])
Calculate density / sparsity
nratings <- length(as.vector(lastfm_rating_mat))
nratings
## [1] 1524050
sum(!is.na(as.vector(lastfm_rating_mat)) ) /
1 - sum(!is.na(as.vector(lastfm_rating_mat)) ) / nratings
## [1] 51364.97
For content-based recommendation efforts for this project, we need to create artist-genre matrix and provide similar artist recommendations to users.
# convert to wide format
tmp_mat <- spread(user.topTags, tagID, Count)
ag_artistID <- as.vector(tmp_mat$artistID)
art.genre.mat <- as.matrix(tmp_mat[,2:ncol(tmp_mat)])
Validating the content of artist-genre matrix against original user-tags data
nrow(subset(users.tags, artistID == 89 & tagID == 24))
## [1] 99
nrow(subset(users.tags, artistID == 72 & tagID == 18))
## [1] 68
Calculate density / sparsity
ntags <- length(as.vector(art.genre.mat))
ntags
## [1] 163000
sum(!is.na(as.vector(art.genre.mat)) ) / ntags
## [1] 0.09107975
1 - sum(!is.na(as.vector(art.genre.mat)) ) / ntags
## [1] 0.9089202
The artist-genre matrix can be binarized as follows:
bin_art.gen.mat <- art.genre.mat
bin_art.gen.mat[,][is.na(bin_art.gen.mat[,])] <- 0
bin_art.gen.mat[,][bin_art.gen.mat[,] > 0] <- 1
The user-item matrix can be binarized as follows:
bin.ua.rating.mat <- lastfm_rating_mat
bin.ua.rating.mat[,][is.na(bin.ua.rating.mat[,])] <- 0
bin.ua.rating.mat[,][bin.ua.rating.mat[,] > 0] <- 1
In this model building section, we will take a look at different models, algorightms and techniques. We’ll also verify that how the model real rating matrix and binary rating matrix perform.
ratings_matrix <- as(bin.ua.rating.mat, "realRatingMatrix")
ratings_matrix
## 1870 x 815 rating matrix of class 'realRatingMatrix' with 1524050 ratings.
image(ratings_matrix[1:100, 1:100], main = "Visualization of Real Ratings Matrix")
userArtist_binaryMat <- as(bin.ua.rating.mat,"binaryRatingMatrix")
userArtist_binaryMat
## 1870 x 815 rating matrix of class 'binaryRatingMatrix' with 51365 ratings.
image(userArtist_binaryMat[1:100, 1:100], main = "Visualization of Binary Ratings Matrix")
In order to recommend items to new users, collaborative filtering estimates the ratings of items that are not yet listened Then, it recommends the top-rated items. We can evaluate the model by comparing the estimated ratings with the real ones.
e <- evaluationScheme(ratings_matrix, method="split", train=0.8, k=1, given=10, goodRating=1 )
e_bin <- evaluationScheme(userArtist_binaryMat, method="split", train=0.8, given = 1, goodRating = 1)
ubcf_rec <- Recommender(getData(e, "train"), "UBCF")
ibcf_rec <- Recommender(getData(e, "train"), "IBCF")
rand_rec <- Recommender(getData(e, "train"), "RANDOM")
pop_rec <- Recommender(getData(e, "train"), "POPULAR")
svd_rec <- Recommender(getData(e, "train"), "SVD")
bin_rec <- Recommender(getData(e_bin, "train"), "UBCF", parameter = list(method = "Jaccard"))
ubcf_pred <- predict(ubcf_rec, getData(e, "known"), type="ratings")
ibcf_pred <- predict(ibcf_rec, getData(e, "known"), type="ratings")
rand_pred <- predict(rand_rec, getData(e, "known"), type="ratings")
pop_pred <- predict(pop_rec, getData(e, "known"), type="ratings")
svd_pred <- predict(svd_rec, getData(e, "known"), type="ratings")
bin_pred <- predict(bin_rec, getData(e_bin, "known"))
errs <- rbind(
ubcf = calcPredictionAccuracy(ubcf_pred, getData(e, "unknown")),
ibcf = calcPredictionAccuracy(ibcf_pred, getData(e, "unknown")),
rand = calcPredictionAccuracy(rand_pred, getData(e, "unknown")),
pop = calcPredictionAccuracy(pop_pred, getData(e, "unknown")),
svd = calcPredictionAccuracy(svd_pred, getData(e, "unknown")),
bin = calcPredictionAccuracy(bin_pred, getData(e_bin, "unknown"),given = 10, goodRating = 1)
)
## Warning in rbind(ubcf = calcPredictionAccuracy(ubcf_pred, getData(e,
## "unknown")), : number of columns of result is not a multiple of vector
## length (arg 1)
Two accuracy metrics are as follows:
Precision: This is the percentage of recommended items that have been listened. It’s the number of FP divided by the total number of positives (TP + FP).
Recall: This is the percentage of listented items that have been recommended. It’s the number of TP divided by the total number of artists listened (TP + FN). It’s also equal to the True Positive Rate.
kable(errs)
TP | FP | FN | TN | precision | recall | TPR | FPR | |
---|---|---|---|---|---|---|---|---|
ubcf | 0.2099523 | 0.0440800 | 0.1390747 | 0.2099523 | 0.0440800 | 0.1390747 | 0.2099523 | 0.0440800 |
ibcf | 0.4612241 | 0.2127277 | 0.2289113 | 0.4612241 | 0.2127277 | 0.2289113 | 0.4612241 | 0.2127277 |
rand | 0.4789870 | 0.2294285 | 0.3001085 | 0.4789870 | 0.2294285 | 0.3001085 | 0.4789870 | 0.2294285 |
pop | 0.1852863 | 0.0343310 | 0.0799117 | 0.1852863 | 0.0343310 | 0.0799117 | 0.1852863 | 0.0343310 |
svd | 0.1884342 | 0.0355075 | 0.0647944 | 0.1884342 | 0.0355075 | 0.0647944 | 0.1884342 | 0.0355075 |
bin | 3.2834225 | 6.7165775 | 22.9786096 | 772.0213904 | 0.3283422 | 0.1364956 | 0.1364956 | 0.0086011 |
True Positive Rate (TPR): This is the percentage of listented items that have been recommended. It’s the number of TP divided by the number of listented items (TP + FN).
False Positive Rate (FPR): This is the percentage of not listented items that have been recommended. It’s the number of FP divided by the number of not listented items (FP + TN)
We can clearly see how binraizing the data really making a big difference here in terms of overall performance.
algorithms <- list("Random" = list(name="RANDOM", param=NULL),
"Popular" = list(name="POPULAR", param=NULL),
"UBCF"= list(name="UBCF", param=list(nn=50)),
"IBCF"= list(name="IBCF", param=list(k=50)),
"SVD"= list(name="SVD")
#"BIN_UBCF"= list(name="UBCF")
)
results <- evaluate(e, algorithms, type = "topNList", n=c(1, 5, 10, 20, 30, 50))
## RANDOM run fold/sample [model time/prediction time]
## 1 [0.04sec/0.97sec]
## POPULAR run fold/sample [model time/prediction time]
## 1 [0.45sec/2.85sec]
## UBCF run fold/sample [model time/prediction time]
## 1 [0.22sec/7.27sec]
## IBCF run fold/sample [model time/prediction time]
## 1 [14.69sec/0.26sec]
## SVD run fold/sample [model time/prediction time]
## 1 [1.36sec/1.01sec]
results_bin <- evaluate(e_bin, "UBCF", type = "topNList", n=c(1, 5, 10, 20, 30, 50))
## UBCF run fold/sample [model time/prediction time]
## 1 [0.05sec/4.12sec]
plot(results_bin, "prec/rec")
title("Binary Rating prec/rec ")
In order to evaluate the models properly, we need to test them, varying the number of items.
plot(results, "prec/rec", annotate=c(1,3), legend="bottomright")
title("Real Rating prec/rec ")
plot(results_bin, "ROC")
title("Binary Rating ROC ")
plot(results, "ROC", annotate=c(4, 3))
title("Real Rating prec/rec ")
Now checking for the suppress warnings for real rating matrix
eval_results <- suppressWarnings(evaluate(x =e , method = algorithms, n = seq(10, 100, 10)))
## RANDOM run fold/sample [model time/prediction time]
## 1 [0.01sec/0.75sec]
## POPULAR run fold/sample [model time/prediction time]
## 1 [0.23sec/1.83sec]
## UBCF run fold/sample [model time/prediction time]
## 1 [0.21sec/7.03sec]
## IBCF run fold/sample [model time/prediction time]
## 1 [13.03sec/0.35sec]
## SVD run fold/sample [model time/prediction time]
## 1 [0.37sec/1.04sec]
plot(eval_results, "prec/rec", annotate = T, main = "Precision-recall")
title("Precision-recall")
plot(eval_results, annotate = 1, legend = "topleft")
title("ROC curve")
n_recommended <- 10
bin_rec <- Recommender(getData(e_bin, "train"), "UBCF", parameter = list(method = "Jaccard"))
bin_pred <- predict(bin_rec, getData(e_bin, "known"), n = n_recommended, goodRating = 1)
error_b <- calcPredictionAccuracy(bin_pred, getData(e_bin, "unknown"),
given = n_recommended, goodRating = 1)
kable(error_b, caption = "Performance Metrics")
x | |
---|---|
TP | 3.2834225 |
FP | 6.7165775 |
FN | 22.9786096 |
TN | 772.0213904 |
precision | 0.3283422 |
recall | 0.1364956 |
TPR | 0.1364956 |
FPR | 0.0086011 |
The performance metrics for the model are shown above. As we can clearly see that for binarized UBCF the precision and recall are both comaparatively low, and also the true positive rate.
bin_pred@items[1:4]
## [[1]]
## [1] 326 37 140 172 254 129 458 184 185 509
##
## [[2]]
## [1] 129 37 128 135 23 338 21 421 140 173
##
## [[3]]
## [1] 99 53 362 561 300 357 352 406 399 320
##
## [[4]]
## [1] 47 196 32 505 770 15 364 28 20 11
n_recommended <- 20
bin_pred <- predict(bin_rec, userArtist_binaryMat, n = n_recommended, goodRating = 1)
bin_pred@items[12:15]
## $`12`
## [1] 129 128 338 145 117 131 133 331 231 326 156 418 458 340 165 376 328
## [18] 475 615 141
##
## $`13`
## [1] 810 99 300 357 362 561 653 408 765 599 633 406 402 47 358 450 602
## [18] 90 559 608
##
## $`14`
## [1] 196 404 349 215 505 79 92 195 53 75 402 209 211 89 88 399 97
## [18] 71 32 753
##
## $`15`
## [1] 140 128 129 338 135 132 142 23 233 145 169 229 117 458 149 130 13
## [18] 161 426 254
Le’s create a data frame to hold the 10 recommendations adn then extracts a subset of 10 artists from the list of 20 generated above
#create a data frame to hold 10 recommendations for each artist
u_tenRec <- data.frame(matrix(ncol = 11, nrow = length(user_ids)))
u_tenRec[,1] <- user_ids
colnames(u_tenRec) <- c("userID", "r1", "r2", "r3", "r4", "r5",
"r6", "r7", "r8", "r9", "r10")
# load the recommendations
for (i in 1:length(bin_pred@items)){
tmp_recs <- as.vector(bin_pred@items[[i]])
num_trecs <- length(tmp_recs)
user_arts <- unique(subset(lastfm.ua, userID == user_ids[i])$artistID)
new_recs <- tmp_recs[!(tmp_recs %in% user_arts) ]
num_newrecs <- length(new_recs)
if(num_newrecs < 10) {
new_recs <- sample(top_815$artistID[!(top_815$artistID %in% user_arts)], 10)
}
if (num_newrecs < 13) {
topten <- new_recs[1:10]
} else {
t_seven <- sample(new_recs[1:10], 7)
t_three <- sample(new_recs[11:length(new_recs)], 3)
topten <- c(t_seven, t_three)
}
topten <- sample(topten, 10)
u_tenRec[i,2:11 ] <- topten
}
Now, build a list of top ten artists that our recommender system suggests that they might like
# randomly select a user
user <- sample(user_ids, 1)
urecs <- sort(as.vector(subset(u_tenRec, userID == user)[2:11]) )
rec_names <- subset(artists, id %in% urecs)$name
kable(rec_names, col.names = "Artists You Might Like")
Artists You Might Like |
---|
Sade |
Kosheen |
The Decemberists |
Sara Tavares |
Katy Perry |
2NE1 |
Crossfade |
The Cribs |
Jethro Tull |
Watch Tower Bible and Tract Society of PA |
In order to make recommendations of artists similar to a specific artist, we need to create an artist similarity matrix using cosine distance as the metric of similarity using similarity function of the recommenderlab package. In this way, we can calculate the “similarity” of any two artists with a cosine distance function
Create a similarity matrix using cosine distance as metric of similarity
# calculate artist similarity matrix
artist_sim <- similarity(as(bin_art.gen.mat, "binaryRatingMatrix"), method = "cosine",
which = "users")
artist_sim <- as(artist_sim, "matrix")
artist_sim[][] <- round(artist_sim[][],3)
colnames(artist_sim) <- ag_artistID
rownames(artist_sim) <- ag_artistID
image(as.matrix(artist_sim), main = "Artists similarity")
n_recommended <- 5
artist <- sample(ag_artistID, 1)
a_name <- artists[artists$id == artist,]$name
arecs <- sort(artist_sim[as.character(artist),], decreasing = TRUE)[1:n_recommended]
arecs_IDs <- as.numeric(names(arecs))
arec_names <- artists[artists$id %in% arecs_IDs,]$name
table_head <- sprintf("Artists Similar to %s", a_name)
kable(arec_names, col.names = table_head)
Artists Similar to Radiohead |
---|
Coldplay |
Depeche Mode |
Placebo |
Muse |
The Beatles |
This was a wonderful dataset to build recommendation system driven by collaborative user based filtering and content based model technques. We evaluated and compared multiple techniques, models and algorithms. By binarizing the data the model did work better and with the low precision and recall rate and also true positive stayed low. We made recommendation on top 10 artists for a given user and also tacked similar artists recommendations successfully. We learned how to evaluate the performance of different models in order to choose the most accurate one. There are different ways to evaluate performances that might potentially lead to different choices. Depending on the business target, the evaluation metric is different. This project also demonstrates how data and modeling tehniques should be combined to achieve the final goal.