The goal of this assignment is give you practice working with Matrix Factorization techniques.
Your task is implement a matrix factorization method—such as singular value decomposition (SVD) or Alternating Least Squares (ALS)—in the context of a recommender system.
You may approach this assignment in a number of ways. You are welcome to start with an existing recommender system written by yourself or someone else. Remember as always to cite your sources, so that you can be graded on what you added, not what you found.
SVD can be thought of as a pre-processing step for feature engineering. You might easily start with thousands or millions of items, and use SVD to create a much smaller set of “k” items (e.g. 20 or 70).
• SVD builds features that may or may not map neatly to items (such as movie genres or news topics). As in many areas of machine learning, the lack of explainability can be an issue).
• SVD requires that there are no missing values. There are various ways to handle this, including (1) imputation of missing values, (2) mean-centering values around 0, or (3)
• Calculating the SVD matrices can be computationally expensive, although calculating ratings once the factorization is completed is very fast. You may need to create a subset of your data for SVD calculations to be successfully performed, especially on a machine with a small RAM footprint.
For this assignment, I will be using Jester ratings dataset. This provides extensive ratings from users for different jokes. Below link contains detailed information about the dataset.
Dataset Link: http://eigentaste.berkeley.edu/dataset/
Importing the data from the csv file
jester <- read.csv('C:/Users/paperspace/Google Drive/CUNY/Courses/Archive/643 Project2/jesterfinal151cols.csv',header=FALSE,sep=",",
stringsAsFactors = FALSE, na.strings = c('99')) %>% select(c(-1)) As the dataset is huge, for this project we are going to take only a fraction of data.
set.seed(7340)
sample_rows <- sample(nrow(jester))
jester <- jester[sample_rows,]
count = nrow(jester) * .01
jester_matrix <- jester[1:count,] Cleaning up the NA data in the dataset.
#rpubs.com/waltw/285262
#Filter the NA columns and all rows
allNA <- sapply(jester_matrix, function(x) all(is.na(x)))
jester_matrix <- jester_matrix[,!allNA]
noNA <- sapply(jester_matrix, function(x) all(!is.na(x)))
jester_matrix <- jester_matrix[, !noNA] %>% as.matrix()Here we are creating a function to pre-process the data and perform modelling. Below are some helper functions
Creating a function for calculating baseline predictor
# Project1 reference
baseline_predictor = function(df,train_raw_mean) {
#User bias: means of each user - raw mean
user_mean = c(rowMeans(df,na.rm=TRUE)-train_raw_mean)
#book bias: means of each book- raw mean
movie_mean = c(colMeans(df,na.rm=TRUE)-train_raw_mean)
temp_df = data.frame()
for(i in 1:nrow(df)){
#add all the user and book bias
final_bias <- train_raw_mean+ user_mean[i] +movie_mean
temp_df <- rbind(temp_df,final_bias)
}
#Set temp names
temp_df = setNames(temp_df, c(1:ncol(temp_df))) %>% data.frame()
temp_df[is.na(temp_df)] = train_raw_mean
#Return the baseline predicted value
return(temp_df)
}Creating a function for calculating SVD. This function will automatically pick the best k value which is of 80%.
svd_val <- function(matrix) {
s_center <- svd(matrix)
diagonal <- s_center$d
threshold = 0
for(i in 1:length(diagonal)) {
if(sum(diagonal[1:i]^2)/sum(s_center$d^2) >= .80) {
threshold =i
#print(paste("Threshold value:", threshold))
break
}
}
diagonal[which(diagonal %in% diagonal) >threshold] = 0
s_svd<- s_center$u %*% diag(diagonal) %*% t(s_center$v)
return(rmse(matrix, s_svd))
}Below are the steps which are followed in this function
Part 1: 1. Cleanup data missing data via mean 2. Fill missing data via mean and center it 3. Do no changes to original missing data 4. Create a baseline prediction of all the values
Part 2: 1. Perform simple SVD and calculate RMSE 2. Perform SVD using recommenderlab SVD 3. Perform SVD using recommenderlab SVDF
missingval <- function(umatrix, method,type, center) {
if(method == "mean"){
train_raw_mean <- mean(umatrix,na.rm = TRUE)
umatrix[is.na(umatrix)] = train_raw_mean
}
else if(method =="mean_center"){
train_raw_mean <- mean(umatrix,na.rm = TRUE)
umatrix[is.na(umatrix)] = train_raw_mean
umatrix <- scale(umatrix ,center = T,scale=F) %>% as.matrix()
}
else if(method=="withna"){
umatrix <- umatrix
}
else if (method == "baseline"){
train_raw_mean <- mean(umatrix,na.rm = TRUE)
umatrix <- baseline_predictor(umatrix,train_raw_mean) %>% as.matrix()
}
if(type =="basesvd"){
rmse_value = svd_val(umatrix)
newlist <- list("matrix" = umatrix, "rmse" = rmse_value)
}
else if(type =="recommendersvd"){
jester_realmatrix <- umatrix %>% as("realRatingMatrix")
recommender_model <- Recommender(jester_realmatrix, "SVD",parameter = list(normalize = "center"))
recc_predicted <- predict(object = recommender_model, newdata = jester_realmatrix,type="ratingMatrix")
resultset_svd <- recc_predicted@data %>% as.matrix()
newlist <- list("matrix" = resultset_svd, "rmse" = calcPredictionAccuracy(recc_predicted, jester_realmatrix))
}
else if(type =="recommendersvdf"){
jester_realmatrix <- umatrix %>% as("realRatingMatrix")
recommender_model <- Recommender(jester_realmatrix, "SVDF",parameter = list(normalize = "center"))
recc_predicted <- predict(object = recommender_model, newdata = jester_realmatrix,type="ratingMatrix")
resultset_svdf <- recc_predicted@data %>% as.matrix()
newlist <- list("matrix" = resultset_svdf, "rmse" = calcPredictionAccuracy(recc_predicted, jester_realmatrix))
}
return(newlist)
}Call the function with differerent parameters and get the dataset with RMSE
text <- c("With Na dataset and recommenderlab SVD","With Na dataset and recommenderlab SVDF","Baseline predictors with Base SVD","Global mean with Base SVD","Global mean and centered data with Base SVD")
results <- data.frame(Calculation =text)
#With Na dataset and recommenderlab SVD
results[1,2] <- missingval(jester_matrix,"withna","recommendersvd",F)$rmse[1]
#With Na dataset and recommenderlab SVDF
results[2,2] <- missingval(jester_matrix,"withna","recommendersvdf",F)$rmse[1]
#Baseline predictors with Base SVD
results[3,2] <- missingval(jester_matrix,"baseline","basesvd",F)$rmse
#Global mean with Base SVD
results[4,2] <- missingval(jester_matrix,"mean","basesvd",F)$rmse
#Global mean and centered data with Base SVD
results[5,2] <- missingval(jester_matrix,"mean_center","basesvd",F)$rmse
results <- rename(results, RMSE = V2)Below provided is the final RMSE result obtained after different calculations
results## Calculation RMSE
## 1 With Na dataset and recommenderlab SVD 3.8144245
## 2 With Na dataset and recommenderlab SVDF 3.8144245
## 3 Baseline predictors with Base SVD 1.2585646
## 4 Global mean with Base SVD 1.3928547
## 5 Global mean and centered data with Base SVD 0.9857861
Observations are as follows