The goal of this assignment is give you practice working with accuracy and other recommender system metrics. In this assignment you’re asked to do at least one or (if you like) both of the following:
• Work in a small group, and/or
• Choose a different dataset to work with from your previous projects.
Deliverables
As in your previous assignments, compare the accuracy of at least two recommender system algorithms against your offline data.
Implement support for at least one business or user experience goal such as increased serendipity, novelty, or diversity.
Compare and report on any change in accuracy before and after you’ve made the change in #2.
As part of your textual conclusion, discuss one or more additional experiments that could be performed and/or metrics that could be evaluated only if online evaluation was possible. Also, briefly propose how you would design a reasonable online evaluation environment.
You’ll find some of the material discussed in this week’s reading to be helpful in completing this project. You may also want to look at papers online, or search for “recsys” on youtube or slideshare.net.
For this assignment, I will be using Movies ratings dataset. This provides extensive ratings from users for different movies. Below link contains detailed information about the dataset.
Dataset Link: https://grouplens.org/datasets/movielens/
ratings_na <- read.csv("data/ratings.csv",sep=",",header=TRUE) %>% select(-timestamp) %>% spread(movieId,rating) %>% select(-userId) %>% as.matrix()
movie <- ratings_na %>% as("realRatingMatrix")Explore the movies dataset with different chart types and see which has the highest ratings.
vector_ratings <- as.vector(movie@data)
#Clearly the distribution has many 0 or missing ratings
qplot(vector_ratings) + ggtitle("distribution of ratings")## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
#Table count
table(vector_ratings)## vector_ratings
## 0 0.5 1 1.5 2 2.5 3 3.5 4
## 5983282 1101 3326 1687 7271 4449 20064 10538 28750
## 4.5 5
## 7723 15095
vector_ratings <- vector_ratings[vector_ratings !=0]
qplot(vector_ratings) + ggtitle("Distribution of ratings without 0")## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
avg_ratings_user <- rowMeans(movie@data)
qplot(avg_ratings_user) + stat_bin(binwidth = .01) + ggtitle("Distribution of average user rating")## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
avg_ratings_movies <- colMeans(movie@data)
qplot(avg_ratings_movies) + stat_bin(binwidth = .01) + ggtitle("Distribution of average movie rating")## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Filter the dataset which has some decent ratings.
movies_ratings <- movie[rowCounts(movie) > 30, colCounts(movie) > 80]Below is the helper function for baseline predictor for missing values
baseline_predictor = function(df,train_raw_mean) {
#User bias: means of each user - raw mean
user_mean = c(rowMeans(df,na.rm=TRUE)-train_raw_mean)
#book bias: means of each book- raw mean
movie_mean = c(colMeans(df,na.rm=TRUE)-train_raw_mean)
temp_df = data.frame()
for(i in 1:nrow(df)){
#add all the user and book bias
final_bias <- train_raw_mean+ user_mean[i] +movie_mean
temp_df <- rbind(temp_df,final_bias)
}
#Set temp names
temp_df = setNames(temp_df, c(1:ncol(temp_df))) %>% data.frame()
temp_df[is.na(temp_df)] = train_raw_mean
#Return the baseline predicted value
return(temp_df)
}Missing values can be calculacuted using different methods given below.
missingval <- function(umatrix, method) {
if(method == "mean"){
train_raw_mean <- mean(umatrix,na.rm = TRUE)
umatrix[is.na(umatrix)] = train_raw_mean
}
else if(method =="mean_center"){
train_raw_mean <- mean(umatrix,na.rm = TRUE)
umatrix[is.na(umatrix)] = train_raw_mean
umatrix <- scale(umatrix ,center = T,scale=F) %>% as.matrix()
}
else if(method=="withna"){
umatrix <- umatrix
}
else if (method == "baseline"){
train_raw_mean <- mean(umatrix,na.rm = TRUE)
umatrix <- baseline_predictor(umatrix,train_raw_mean) %>% as.matrix()
}
return(umatrix)
}Below function helps to split the the training and test dataset.
percentage_training <- 0.8
items_to_keep <- 2
rating_threshold <- 3
n_eval <- 1
n_fold <- 4
evaluationmethod <- function(umatrix, method ) {
if(method == "split"){
eval_sets <- evaluationScheme(data = umatrix, method =
"split", train = percentage_training, given = items_to_keep,
goodRating = rating_threshold, k = n_eval)
}
else if(method =="bootstrap"){
eval_sets <- evaluationScheme(data = umatrix, method =
"bootstrap", train = percentage_training, given = items_to_keep,
goodRating = rating_threshold, k = n_eval)
}
else if(method =="cv"){
eval_sets <- evaluationScheme(data = umatrix, method = "cross-validation", k = n_fold, given = items_to_keep, goodRating = rating_threshold)
}
return(eval_sets)
}Below are the different models which we will evaluate to find out the best model for this dataset.
models_to_evaluate <- list(
IBCF_cos = list(name = "IBCF", param = list(method ="cosine",normalize="center")),
IBCF_cor = list(name = "IBCF", param = list(method ="pearson",normalize="center")),
UBCF_cos = list(name = "UBCF", param = list(method ="cosine",normalize="center")),
UBCF_cor = list(name = "UBCF", param = list(method ="pearson",normalize="center")),
SVD = list(name = "SVD",param= list(normalize="center")),
SVDF = list(name = "SVDF",param=list(normalize="center")),
random = list(name = "RANDOM", param=NULL),
popular = list(name="POPULAR", param=NULL),
als = list(name="ALS", param=NULL)
)
n_recommendations <- c(1, 5, seq(10, 100, 10))
movies_ratings_clean <- missingval(movies_ratings,"withna")
eval_sets <- evaluationmethod(movies_ratings_clean,"cv")
list_results <- evaluate(x = eval_sets, method = models_to_evaluate, n = n_recommendations,progress=F)## IBCF run fold/sample [model time/prediction time]
## 1 [0.19sec/0.17sec]
## 2 [0.38sec/0.09sec]
## 3 [0.13sec/0.03sec]
## 4 [0.13sec/0.03sec]
## IBCF run fold/sample [model time/prediction time]
## 1 [0.12sec/0.08sec]
## 2 [0.16sec/0.03sec]
## 3 [0.14sec/0.03sec]
## 4 [0.14sec/0.03sec]
## UBCF run fold/sample [model time/prediction time]
## 1 [0sec/0.22sec]
## 2 [0.05sec/0.2sec]
## 3 [0sec/0.2sec]
## 4 [0sec/0.2sec]
## UBCF run fold/sample [model time/prediction time]
## 1 [0sec/0.16sec]
## 2 [0sec/0.17sec]
## 3 [0sec/0.17sec]
## 4 [0sec/0.19sec]
## SVD run fold/sample [model time/prediction time]
## 1 [0.02sec/0.07sec]
## 2 [0.55sec/0.08sec]
## 3 [0.02sec/0.08sec]
## 4 [0.02sec/0.08sec]
## SVDF run fold/sample [model time/prediction time]
## 1 [13.54sec/2.69sec]
## 2 [9.52sec/1.06sec]
## 3 [8.66sec/1.12sec]
## 4 [9.59sec/1.11sec]
## RANDOM run fold/sample [model time/prediction time]
## 1 [0sec/0.06sec]
## 2 [0.03sec/0.06sec]
## 3 [0sec/0.06sec]
## 4 [0sec/0.09sec]
## POPULAR run fold/sample [model time/prediction time]
## 1 [0sec/0.29sec]
## 2 [0.05sec/0.33sec]
## 3 [0sec/0.31sec]
## 4 [0sec/0.3sec]
## ALS run fold/sample [model time/prediction time]
## 1 [0sec/9.02sec]
## 2 [0.11sec/9.06sec]
## 3 [0sec/9.03sec]
## 4 [0sec/8.8sec]
#ROC Curve
plot(list_results, annotate = 1, legend = "topleft")
title("ROC curve")#Precision-recall curve
plot(list_results, "prec/rec", annotate = 1, legend = "bottomright")
title("Precision-recall")From the about charts it is seems the popular algorithm outperforms other alogrithms. So we can use that algorithm to predict the ratings.
items_to_recommend = 6
eval_recommender <- Recommender(data = getData(eval_sets, "train"),
method = "POPULAR", parameter = NULL)
eval_prediction <- predict(object = eval_recommender, newdata =getData(eval_sets, "known"), n = items_to_recommend, type = "ratings")
#Prediction accuracy
(eval_accuracy <- calcPredictionAccuracy(x = eval_prediction, data = getData(eval_sets,"unknown"), byUser =F))## RMSE MSE MAE
## 1.0761522 1.1581036 0.8095463
By comparing the above results we can conclude that the RMSE which we got is generating the serendipity between the items.