Recommendation system for music artists based on user preferences according to the songs listening frequencies (called weights) used as ratings
Last.fm dataset is the official song tag and song similarity dataset of the Million Song Dataset.(+940,000 matched tracks)
Last.fm provides a dataset for music recommendations. For each user in the dataset it contains a list of their top most listened to artists including the number of times those artists were played. It also includes user applied tags which can be used to build a content vector.
This dataset contains social networking, tagging, and music artist listening information from a set of 2K users from Last.fm online music system. The dataset is released in the framework of the 2nd International Workshop on Information Heterogeneity and Fusion in Recommender Systems (HetRec 2011) http://ir.ii.uam.es/hetrec2011 http://www.last.fm
Data Statistics:
1892 users
17632 artists
92834 user-listened artist relations, i.e. tuples [user, artist, listeningCount]
11946 tags
186479 tag assignments (tas), i.e. tuples [user, tag, artist]
# Loading datasets, Package Installation
artists <- read.delim('https://raw.githubusercontent.com/humbertohpgit/MSDS3rdSem_DATA612/master/artists.dat', header = TRUE, sep="\t")
artists <- artists[,1:2] # reducing to artist ID and Name
user_artists <- read.delim('https://raw.githubusercontent.com/humbertohpgit/MSDS3rdSem_DATA612/master/user_artists.dat', header = TRUE, sep="\t")
#install.packages("tidyverse")
#install.packages("recommenderlab")
#install.packages("recosystem")
library(dplyr)
library(tidyr)
library(ggplot2)
library(recommenderlab)
# Reducing data set to the most significant users and artists and Transforming weights to ratings
head(artists)
## id name
## 1 1 MALICE MIZER
## 2 2 Diary of Dreams
## 3 3 Carpathian Forest
## 4 4 Moi dix Mois
## 5 5 Bella Morte
## 6 6 Moonspell
## userID artistID weight
## 1 2 51 13883
## 2 2 52 11690
## 3 2 53 11351
## 4 2 54 10300
## 5 2 55 8983
## 6 2 56 6152
# Focus on top artists based on total listening frequency (Top 2000) and the most relevant users (with minimum listening frequency = 30)
user_artists_mat <- user_artists %>% spread(key=artistID,value=weight)
dim (user_artists_mat)
## [1] 1892 17633
artist_freq <- colSums(user_artists_mat[,-1],na.rm = TRUE)
top_user_artist <- user_artists_mat[,order(artist_freq,decreasing = TRUE)[1:2000]]
list_freq <- apply(top_user_artist,1,function(x) return(sum(!is.na(x))))
top_user_artist <- top_user_artist[list_freq > 30,]
dim(top_user_artist)
## [1] 407 2000
## 288 71 88 291 497 66 287 700 226 299 332 343 377 678 294
## 1 NA 2654 1553 NA NA 3312 NA NA NA NA NA NA NA NA NA
## 3 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
## 4 NA NA NA NA NA NA NA NA 139 NA NA NA NA NA NA
## 6 43864 NA NA 5379 NA NA NA NA NA 1780 510 NA NA NA 3251
## 9 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
## 10 749 NA NA NA NA NA NA NA NA NA NA NA NA NA 264
## 11 3072 NA NA NA 3226 NA NA NA NA NA NA NA 59695 NA NA
## 12 NA NA NA NA NA NA NA NA NA 4 NA NA NA NA NA
## 14 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
## 16 129 NA NA NA NA NA NA 116 NA NA NA NA NA NA NA
# Transforming weights to ratings (from 1 to 5)
quantile(top_user_artist, na.rm=TRUE) #determine ratings limits
## 0% 25% 50% 75% 100%
## 1 151 343 842 128654
## 90%
## 1872
user_artist_trans <- top_user_artist
user_artist_trans[user_artist_trans < 151] <- 1
user_artist_trans[user_artist_trans >= 151 & user_artist_trans< 343] <- 2
user_artist_trans[user_artist_trans >= 343 & user_artist_trans < 842] <- 3
user_artist_trans[user_artist_trans >= 842 & user_artist_trans < 1872] <- 4
user_artist_trans[user_artist_trans >= 1872] <- 5
user_artist_trans[1:10,1:15]
## 288 71 88 291 497 66 287 700 226 299 332 343 377 678 294
## 1 NA 5 4 NA NA 5 NA NA NA NA NA NA NA NA NA
## 3 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
## 4 NA NA NA NA NA NA NA NA 1 NA NA NA NA NA NA
## 6 5 NA NA 5 NA NA NA NA NA 4 3 NA NA NA 5
## 9 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
## 10 3 NA NA NA NA NA NA NA NA NA NA NA NA NA 2
## 11 5 NA NA NA 5 NA NA NA NA NA NA NA 5 NA NA
## 12 NA NA NA NA NA NA NA NA NA 1 NA NA NA NA NA
## 14 NA NA NA NA NA NA NA NA NA NA NA NA NA NA NA
## 16 1 NA NA NA NA NA NA 1 NA NA NA NA NA NA NA
# Create the rating (sparse) matrix
user_artist_trans_mat <- as(as.matrix(user_artist_trans),"realRatingMatrix")
user_artist_trans_mat
## 407 x 2000 rating matrix of class 'realRatingMatrix' with 13909 ratings.
## ratings_vec
## 0 1 2 3 4 5
## 800091 3466 3485 3480 2086 1392
artist_plays <- data.frame( artist = names(artist_freq), plays = artist_freq)
artist_plays <- artist_plays[order(artist_plays$plays, decreasing = TRUE),]
ggplot (artist_plays[1:10,], aes(x=artist, y=plays)) + geom_bar(stat="identity") + ggtitle("Number of plays for top Artists")
#Heatmap of the most relevant artists and users
min_n_artists <- quantile(rowCounts(user_artist_trans_mat), 0.97)
min_n_users <- quantile(colCounts(user_artist_trans_mat), 0.97)
image(user_artist_trans_mat[rowCounts(user_artist_trans_mat)>min_n_artists, colCounts(user_artist_trans_mat)> min_n_users], main="Heatmap of the most relevant artists and users")
#Exploring parameters of recommendation methods in recommenderlab
recommender_models <- recommenderRegistry$get_entries(dataType = "realRatingMatrix")
names(recommender_models)
## [1] "ALS_realRatingMatrix" "ALS_implicit_realRatingMatrix"
## [3] "IBCF_realRatingMatrix" "POPULAR_realRatingMatrix"
## [5] "RANDOM_realRatingMatrix" "RERECOMMEND_realRatingMatrix"
## [7] "SVD_realRatingMatrix" "SVDF_realRatingMatrix"
## [9] "UBCF_realRatingMatrix"
## $ALS_realRatingMatrix
## [1] "Recommender for explicit ratings based on latent factors, calculated by alternating least squares algorithm."
##
## $ALS_implicit_realRatingMatrix
## [1] "Recommender for implicit data based on latent factors, calculated by alternating least squares algorithm."
##
## $IBCF_realRatingMatrix
## [1] "Recommender based on item-based collaborative filtering."
##
## $POPULAR_realRatingMatrix
## [1] "Recommender based on item popularity."
##
## $RANDOM_realRatingMatrix
## [1] "Produce random recommendations (real ratings)."
##
## $RERECOMMEND_realRatingMatrix
## [1] "Re-recommends highly rated items (real ratings)."
##
## $SVD_realRatingMatrix
## [1] "Recommender based on SVD approximation with column-mean imputation."
##
## $SVDF_realRatingMatrix
## [1] "Recommender based on Funk SVD with gradient descend."
##
## $UBCF_realRatingMatrix
## [1] "Recommender based on user-based collaborative filtering."
# Create recommender model - User-based Collaborative Filtering (UCBF)
recommender_model_ubcf <- Recommender(data=getData(eval_sch,"train"),method="UBCF")
# Recommendation for the first 5 users
recom_ubcf <- predict(recommender_model_ubcf, getData(eval_sch,"known")[1:5], n=5, type = "topNList")
recom_ubcf_list <- as(recom_ubcf, "list") #convert recommenderlab object to readable list
recom_ubcf_list
## $`1`
## [1] "289" "292" "67" "229" "154"
##
## $`3`
## [1] "227" "1412" "233" "198" "707"
##
## $`17`
## [1] "498" "289" "154" "207" "292"
##
## $`24`
## [1] "707" "917" "706" "181" "503"
##
## $`25`
## [1] "707" "830" "706" "233" "813"
# Get artist names
recom_ubcf_result <- list()
for (i in c(1:5)){
recom_ubcf_result[[i]] <- artists[artists$id %in% as.vector(recom_ubcf_list[[i]]),2]
}
# Top 5 recommendations for the first 5 users
recom_ubcf_result
## [[1]]
## [1] Madonna Radiohead The Killers
## [4] Britney Spears Christina Aguilera
## 14035 Levels: -123 min. -OZ- -t de sangre !!! !deladap !DISTAIN ... ZZ Top
##
## [[2]]
## [1] System of a Down The Beatles Nine Inch Nails Metallica
## [5] Led Zeppelin
## 14035 Levels: -123 min. -OZ- -t de sangre !!! !deladap !DISTAIN ... ZZ Top
##
## [[3]]
## [1] Radiohead Arctic Monkeys Britney Spears
## [4] Christina Aguilera Paramore
## 14035 Levels: -123 min. -OZ- -t de sangre !!! !deladap !DISTAIN ... ZZ Top
##
## [[4]]
## [1] Paradise Lost In Flames AC/DC Metallica Iron Maiden
## 14035 Levels: -123 min. -OZ- -t de sangre !!! !deladap !DISTAIN ... ZZ Top
##
## [[5]]
## [1] Nine Inch Nails AC/DC Metallica As I Lay Dying
## [5] Atreyu
## 14035 Levels: -123 min. -OZ- -t de sangre !!! !deladap !DISTAIN ... ZZ Top
# Evaluation
# Prediction of ratings for all users
recom_ubcf <- predict(recommender_model_ubcf, getData(eval_sch,"known"), type = "ratings")
accuracy_ubcf <- calcPredictionAccuracy(x = recom_ubcf, data = getData(eval_sch, "unknown"), given=3, goodRating=3)
accuracy_ubcf
## RMSE MSE MAE
## 1.0030380 1.0060851 0.7508933
# Create recommender model - Item-based Collaborative Filtering (IBCF)
recommender_model_ibcf <- Recommender(data=getData(eval_sch,"train"),method="IBCF")
# Recommendation for the first 5 users
recom_ibcf <- predict(recommender_model_ibcf, getData(eval_sch,"known")[1:5], n=5, type = "topNList")
recom_ibcf_list <- as(recom_ibcf, "list") #convert recommenderlab object to readable list
recom_ibcf_list
## $`1`
## [1] "66" "287" "510" "162" "474"
##
## $`3`
## [1] "1933" "278" "812" "1374" "1338"
##
## $`17`
## [1] "66" "332" "202" "1411" "232"
##
## $`24`
## [1] "226" "228" "917" "181" "503"
##
## $`25`
## [1] "376" "3944" "396" "1922" "2427"
# Get artist names
recom_ibcf_result <- list()
for (i in c(1:5)){
recom_ibcf_result[[i]] <- artists[artists$id %in% as.vector(recom_ibcf_list[[i]]),2]
}
# Top 5 recommendations for the first 5 users
recom_ibcf_result
## [[1]]
## [1] Faithless God Is an Astronaut Monica
## [4] Craig David P.O.D.
## 14035 Levels: -123 min. -OZ- -t de sangre !!! !deladap !DISTAIN ... ZZ Top
##
## [[2]]
## [1] 2Pac Dark Tranquillity HammerFall Demons & Wizards
## [5] ムック
## 14035 Levels: -123 min. -OZ- -t de sangre !!! !deladap !DISTAIN ... ZZ Top
##
## [[3]]
## [1] Faithless CAKE Sunset Rubdown Kelly Rowland
## [5] King Crimson
## 14035 Levels: -123 min. -OZ- -t de sangre !!! !deladap !DISTAIN ... ZZ Top
##
## [[4]]
## [1] Paradise Lost Queens of the Stone Age Kings of Leon
## [4] In Flames Iron Maiden
## 14035 Levels: -123 min. -OZ- -t de sangre !!! !deladap !DISTAIN ... ZZ Top
##
## [[5]]
## [1] UVERworld L'Arc~en~Ciel Rie fu Teddy Geiger
## 14035 Levels: -123 min. -OZ- -t de sangre !!! !deladap !DISTAIN ... ZZ Top
# Evaluation
# Prediction of ratings for all users
recom_ibcf <- predict(recommender_model_ibcf, getData(eval_sch,"known"), type = "ratings")
accuracy_ibcf <- calcPredictionAccuracy(x = recom_ibcf, data = getData(eval_sch, "unknown"), given=3, goodRating=3)
accuracy_ibcf
## RMSE MSE MAE
## 1.1212407 1.2571807 0.7040816
comp_methods <- rbind(accuracy_ubcf, accuracy_ibcf)
rownames(comp_methods) <- c("UBCF", "IBCF")
library(knitr)
kable(comp_methods)
RMSE | MSE | MAE | |
---|---|---|---|
UBCF | 1.003038 | 1.006085 | 0.7508933 |
IBCF | 1.121241 | 1.257181 | 0.7040816 |
# ROC & Precision & Recall
rec_methods <- list(
"UBCF" = list(name = "UBCF", param = NULL),
"IBCF" = list(name = "IBCF", param = NULL))
eval_results <- evaluate(x = eval_sch, method = rec_methods, n = c(1, 5, 10, 15, 20,25,30))
## UBCF run fold/sample [model time/prediction time]
## 1 [0sec/0.36sec]
## IBCF run fold/sample [model time/prediction time]
## 1 [36.67sec/0.04sec]
UBCF recommender model demonstrates to be the best model for music artist recommendations. Although IBCF performed relatively very close in terms of ratings calculation accuracy (RMSE_UBCF = 1.00 vs RMSE_IBCF = 1.12).
ROC shows a substantial AUC for the UBCF model with the point at 10 artist recommendation as the best trade-off point between TPR and FPR. Regarding the Precision/Recall curve, it aligns with the ROC for the UBCF model, showing the best balance between Precision and Recall at the 10 to 15 artists recommendation.
In both curves, the IBCF model shows very poor perfomance.