Intro

For the final project, I used the Last.fm dataset obtained from grouplens.org. The dataset contains various aspects of a users listening experiences such as listen count, artist, and tag words.

Libraries

The primary packages used within this project are sparklyr, dplyr, plotly, and DT for table manipulations. The spark node is using a custom configuration, and it is being executed within a local environment.

#ver2.3.0
library("sparklyr")
library("dplyr")
library("plotly")
library("DT")

my_config = spark_config()
my_config$`sparklyr.shell.driver-memory` = "5G"
my_config$`sparklyr.shell.executor-memory` = "5G"
my_config$`sparklyr.cores.local`=9


sc = spark_connect(master = "local", version = "2.3.0",config = my_config)
spark_connection_is_open(sc)
## [1] TRUE

Loading Ratings Data

The first significant step is loading the data. The data is stored locally and is tab separated. Additionally, we can see the distribution of the listening column that is highly skewed.

user_artists = read.csv("C:/Users/OmegaCel/Documents/MasterDataAnalytics/643Recomender Sys/Summer2018/Project6Final/Data/lastfm/user_artists.dat", strip.white = T,stringsAsFactors = F,sep = "\t")


par(mfrow=c(1,2))
hist(as.numeric(user_artists$weight),1000,xlim = c(1,10000),main = "Listening Weight Distribution",xlab = "")
hist(log(user_artists$weight),25,main = "Log of Listening Weight",xlab = "")

The listening weights were transformed into a rating score that is proportional to the weights. For example, listen of one would be transformed into a one-star rating.

user_artists$Rating = cut(user_artists$weight,breaks = c(quantile(user_artists$weight, probs = seq(0, 1, by = 0.20))),labels = c (1:5),included.lowest= T,ordered_result=T)

user_artists$Rating[is.na(user_artists$Rating)]=1

user_artists$Rating= as.numeric(user_artists$Rating)

The below code combines the remaining datasets and extracts features from the columns. Finally, the data is stored in a CSV file.

user_taggedartists = read.csv("C:/Users/OmegaCel/Documents/MasterDataAnalytics/643Recomender Sys/Summer2018/Project6Final/Data/lastfm/user_taggedartists.dat", strip.white = T,stringsAsFactors = F,sep = "\t")

tags = read.csv("C:/Users/OmegaCel/Documents/MasterDataAnalytics/643Recomender Sys/Summer2018/Project6Final/Data/lastfm/tags.dat", strip.white = T,stringsAsFactors = F,sep = "\t")

#need
artist_tags = user_taggedartists %>% select(userID,artistID,tagID) %>%  left_join(tags , by = "tagID") %>% select(artistID,tagValue) %>% group_by(artistID) %>% summarise(tags= toString(tagValue)) %>%
  ungroup()

artist_tags$tags= gsub(",", ";", artist_tags$tags)

artists = read.csv("C:/Users/OmegaCel/Documents/MasterDataAnalytics/643Recomender Sys/Summer2018/Project6Final/Data/lastfm/artists.csv", strip.white = T,stringsAsFactors = F,sep = ",")


user_artists = user_artists %>% left_join(artists , by = c("artistID" = "id")) %>%left_join(artist_tags , by = c("artistID" = "artistID"))

user_artists$tags = tolower(user_artists$tags)

user_artists$Rock =ifelse(grepl("rock", user_artists$tags)==TRUE,1,0)
user_artists$POP =ifelse(grepl("pop", user_artists$tags)==TRUE,1,0)
user_artists$Electronic =ifelse(grepl("electronic|electronic dance music", user_artists$tags)==TRUE,1,0)
user_artists$Jazz =ifelse(grepl("jazz", user_artists$tags)==TRUE,1,0)
user_artists$HipHop =ifelse(grepl("hip-hop", user_artists$tags)==TRUE,1,0)

#user_artists=user_artists %>% select(-weight)

#write.csv(user_artists,"user_artistsFinalv2.csv",row.names = FALSE)

summary(user_artists)
##      userID        artistID         weight             Rating     
##  Min.   :   2   Min.   :    1   Min.   :     1.0   Min.   :1.000  
##  1st Qu.: 502   1st Qu.:  436   1st Qu.:   107.0   1st Qu.:2.000  
##  Median :1029   Median : 1246   Median :   260.0   Median :3.000  
##  Mean   :1037   Mean   : 3331   Mean   :   745.2   Mean   :2.996  
##  3rd Qu.:1568   3rd Qu.: 4350   3rd Qu.:   614.0   3rd Qu.:4.000  
##  Max.   :2100   Max.   :18745   Max.   :352698.0   Max.   :5.000  
##      name               url             pictureURL       
##  Length:92834       Length:92834       Length:92834      
##  Class :character   Class :character   Class :character  
##  Mode  :character   Mode  :character   Mode  :character  
##                                                          
##                                                          
##                                                          
##      tags                Rock             POP           Electronic   
##  Length:92834       Min.   :0.0000   Min.   :0.0000   Min.   :0.000  
##  Class :character   1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:0.000  
##  Mode  :character   Median :1.0000   Median :1.0000   Median :0.000  
##                     Mean   :0.6409   Mean   :0.5637   Mean   :0.335  
##                     3rd Qu.:1.0000   3rd Qu.:1.0000   3rd Qu.:1.000  
##                     Max.   :1.0000   Max.   :1.0000   Max.   :1.000  
##       Jazz             HipHop      
##  Min.   :0.00000   Min.   :0.0000  
##  1st Qu.:0.00000   1st Qu.:0.0000  
##  Median :0.00000   Median :0.0000  
##  Mean   :0.08814   Mean   :0.1242  
##  3rd Qu.:0.00000   3rd Qu.:0.0000  
##  Max.   :1.00000   Max.   :1.0000
rm(user_taggedartists,tags,artist_tags,artists)

Top Artist

Plotting the top 15 artists identifies Britney Spears, Depeche Mode, and Lady Gaga as the top artist listened by the users.

user_artists %>% select(name,weight) %>% group_by(name) %>% summarise(Total_PLays= sum(weight)) %>%top_n(15)  %>% plot_ly(x = ~name, y = ~Total_PLays, type = 'bar') %>% config(showLink=F,collaborate = F,displaylogo=F) %>% layout(margin = list( b = 100,r=50),title  = 'Top 15 most listened artist',xaxis = list(title = "Artist"))

Load data to spark

Due to spark inability to load the data frame directly an alternative method was used,the data was loaded from the CSV file created earlier.

#spark_user_artists = sdf_copy_to(sc,user_artists, overwrite = TRUE)

spark_user_artists = spark_read_csv(sc, "spark_user_artists", "C:/Users/OmegaCel/Documents/MasterDataAnalytics/643Recomender Sys/Summer2018/Project6Final/user_artistsFinalv2.csv",delimiter = ",", header = TRUE,columns = list(userID = "integer",artistID = "integer",Rating = "integer",name="character",url="character",pictureURL="character",tags="character",Rock="integer",POP="integer",Electronic="integer",Jazz="integer",HipHop="integer"),infer_schema = FALSE,charset = "UTF-8", null_value = NULL,repartition = 0, memory = TRUE, overwrite = TRUE)

Confirm Spark table

src_tbls(sc)
## [1] "spark_user_artists"

ALS Model

The first Alternative Least Square model constituted of the entire data set and was based on the rating, user, and artist. The model uses a regularization parameter of .1 to avoid overfitting, and it is run five times.

model = ml_als_factorization(spark_user_artists,rating_col ="Rating",user_col ="userID",item_col="artistID",regularization.parameter = 0.1,iter.max = 5)
summary(model)
##                         Length Class      Mode       
## uid                     1      -none-     character  
## param_map               4      -none-     list       
## rank                    1      -none-     numeric    
## recommend_for_all_items 1      -none-     function   
## recommend_for_all_users 1      -none-     function   
## item_factors            2      tbl_spark  list       
## user_factors            2      tbl_spark  list       
## user_col                1      -none-     character  
## item_col                1      -none-     character  
## prediction_col          1      -none-     character  
## .jobj                   2      spark_jobj environment

Here the predictions are collected from spark to analyze the results further.

prediction =collect(sdf_predict(spark_user_artists, model))

The predictions are adjusted to avoid negative or values over 5. Furthermore, the results are of the predictions are displayed in the searchable table below.

prediction$prediction =ifelse(prediction$prediction > 5.0,5,prediction$prediction)

prediction$prediction =ifelse(prediction$prediction < 1.0,1,prediction$prediction)

prediction %>% select(userID,name,prediction) %>% mutate(prediction=round(prediction,2)) %>%datatable( colnames = c('USER', 'ARTIST', 'PREDICTION'),options = list(pageLength = 10,columnDefs = list(list(className = 'dt-center', targets="_all"))),rownames = FALSE)

The RMSE for the overall model is:

Model_one_RMSE = sqrt(mean((prediction$Rating -prediction$prediction)^2))
Model_one_RMSE
## [1] 0.5588021
hist(prediction$prediction,25,xlab = "",main = "Prediction Histogram")

After various transformations, we will recommend the top 5 artists for each user and display the results in a table.

results =ml_recommend(model, type = c("items", "users"), n = 5)

results= as.data.frame(results)

artist_name =user_artists %>% select(artistID,name) %>% group_by(artistID,name) %>% summarise(t=n()) %>% select(-t)
 
results %>% left_join(artist_name,by = c("artistID" = "artistID")) %>% select(userID,name,rating) %>% mutate(rating=round(rating,2))%>%datatable( colnames = c('USER', 'ARTIST', 'RATING'),options = list(pageLength = 10,columnDefs = list(list(className = 'dt-center', targets="_all"))),rownames = FALSE)

Two more models will be built using the ALS function. The first an ALS model using a test and training data set and the second using implicit preference.

partitions = spark_user_artists %>%  sdf_partition(training = 0.70, test = 0.30, seed = 143)

train = partitions$training
test = partitions$test

Model 2

Model two is the same as the first model but uses a training and test set. The model will also use a cold start drop item should there be no similar item to recommend.

model_two = ml_als_factorization(train,rating_col ="Rating",user_col ="userID",item_col="artistID",regularization.parameter = 0.1,iter.max = 5,cold_start_strategy="drop")
prediction_model_two =collect(sdf_predict(test,model_two))
prediction_model_two$prediction =ifelse(prediction_model_two$prediction > 5.0,5,prediction$prediction)

prediction_model_two$prediction =ifelse(prediction_model_two$prediction < 1.0,1,prediction$prediction)

Model_two_RMSE = sqrt(mean((prediction_model_two$Rating -prediction_model_two$prediction)^2))

Model_two_RMSE
## [1] 1.872053

Model 3

Model three uses Implicit data, in essence, it is used to gather user behavior such as the user listening to a song or not. The model will generate a confidence interval instead of a rating for a particular item. A higher number would indicate higher confidence.

model_three = ml_als_factorization(train,rating_col ="Rating",user_col ="userID",item_col="artistID",regularization.parameter = 0.1,iter.max = 5,cold_start_strategy="drop",implicit_prefs=TRUE)
prediction_model_three =collect(sdf_predict(test,model_three))

prediction_model_three %>% select(userID,name,prediction) %>%head(10)
## # A tibble: 10 x 3
##    userID name            prediction
##     <int> <chr>                <dbl>
##  1    566 The Crüxshadows    0.0263 
##  2   1551 The Crüxshadows    0.0660 
##  3   1679 The Crüxshadows    0.0491 
##  4   2046 Cradle of Filth    0.253  
##  5    735 Cradle of Filth    0.110  
##  6    459 Cradle of Filth    0.270  
##  7    313 Cradle of Filth    0.215  
##  8    205 Cradle of Filth    0.116  
##  9    806 Cradle of Filth    0.0449 
## 10    280 Dawn of Ashes      0.00299

Conclusion

The original model had an RMSE of .55 and model two had a 1.87. Model Three needs to be evaluated using a ROC curve and confusion matrix, but due to timing issues, this was not implemented. Overall factors such as countries could determine a better recommendation for specific users, but this field is missing.

The project was exciting and fun. Most of the time was spent trying to implement spark correctly. Due to lack of time features that I wanted to implement were not conceived. However, in the process, new knowledge has been gained.

For a future implementation, I found a more significant last.fm dataset with 350k user and features such as sex and country included. I would like to implement multiple recommender algorithms to help extract better recommendations.

spark_disconnect(sc)
rm(list = ls())

References

stackoverflow create quartile-rank

Stack Overflow collapse concatenate columns

spark apache

col-type-in-sparklyr

als-implicit-collaborative

References

When using this dataset you should cite: - Last.fm website, http://www.lastfm.com

You may also cite HetRec’11 workshop as follows:

@inproceedings{Cantador:RecSys2011, author = {Cantador, Iv'{a}n and Brusilovsky, Peter and Kuflik, Tsvi}, title = {2nd Workshop on Information Heterogeneity and Fusion in Recommender Systems (HetRec 2011)}, booktitle = {Proceedings of the 5th ACM conference on Recommender systems}, series = {RecSys 2011}, year = {2011}, location = {Chicago, IL, USA}, publisher = {ACM}, address = {New York, NY, USA}, keywords = {information heterogeneity, information integration, recommender systems}, }

Credits

This dataset was built by Ignacio Fernández-Tobías with the collaboration of Iván Cantador and Alejandro Bellogín, members of the Information Retrieval group at Universidad Autonoma de Madrid (http://ir.ii.uam.es)