Project 3 - Matrix Factorization

Project Description

The goal of this assignment is give you practice working with Matrix Factorization techniques.

Your task is implement a matrix factorization method—such as singular value decomposition (SVD) or Alternating Least Squares (ALS)—in the context of a recommender system.

You may approach this assignment in a number of ways. You are welcome to start with an existing recommender system written by yourself or someone else. Remember as always to cite your sources, so that you can be graded on what you added, not what you found.

SVD can be thought of as a pre-processing step for feature engineering. You might easily start with thousands or millions of items, and use SVD to create a much smaller set of “k” items (e.g. 20 or 70).

Notes/Limitations:

• SVD builds features that may or may not map neatly to items (such as movie genres or news topics). As in many areas of machine learning, the lack of explainability can be an issue).

• SVD requires that there are no missing values. There are various ways to handle this, including (1) imputation of missing values, (2) mean-centering values around 0, or (3) using a more advance technique, such as stochastic gradient descent to simulate SVD in populating the factored matrices.

• Calculating the SVD matrices can be computationally expensive, although calculating ratings once the factorization is completed is very fast. You may need to create a subset of your data for SVD calculations to be successfully performed, especially on a machine with a small RAM footprint.

Dataset

For this assignment, I will be using Jester ratings dataset. This provides extensive ratings from users for different jokes. Below link contains detailed information about the dataset.

Dataset Link: http://eigentaste.berkeley.edu/dataset/

Importing and Cleaning

Importing the data from the csv file

jester <-  read.csv('C:/Users/paperspace/Google Drive/CUNY/Courses/Archive/643 Project2/jesterfinal151cols.csv',header=FALSE,sep=",",
            stringsAsFactors = FALSE, na.strings = c('99')) %>% select(c(-1))

Sampling data

As the dataset is huge, for this project we are going to take only a fraction of data.

set.seed(7340)
sample_rows <- sample(nrow(jester))

jester <- jester[sample_rows,] 

count = nrow(jester) * .01
jester_matrix <- jester[1:count,]

Cleanup

Cleaning up the NA data in the dataset.

#rpubs.com/waltw/285262

#Filter the NA columns and all rows
allNA <- sapply(jester_matrix, function(x) all(is.na(x)))

jester_matrix <- jester_matrix[,!allNA]

noNA <- sapply(jester_matrix, function(x) all(!is.na(x)))

jester_matrix <- jester_matrix[, !noNA] %>% as.matrix()

Pre-pross data & Modelling

Here we are creating a function to pre-process the data and perform modelling. Below are some helper functions

Baseline predictor

Creating a function for calculating baseline predictor

# Project1 reference

baseline_predictor = function(df,train_raw_mean) {

#User bias: means of each user - raw mean
user_mean =  c(rowMeans(df,na.rm=TRUE)-train_raw_mean)

#book bias: means of each book- raw mean
movie_mean = c(colMeans(df,na.rm=TRUE)-train_raw_mean) 

temp_df = data.frame()

for(i in 1:nrow(df)){
  
  #add all the user and book bias
  final_bias <- train_raw_mean+ user_mean[i] +movie_mean
  temp_df <- rbind(temp_df,final_bias)
  
}

#Set temp names
temp_df = setNames(temp_df, c(1:ncol(temp_df))) %>% data.frame()

temp_df[is.na(temp_df)] = train_raw_mean


#Return the baseline predicted value
return(temp_df)
}

Base SVD Calculation

Creating a function for calculating SVD. This function will automatically pick the best k value which is of 80%.

svd_val <- function(matrix) {

s_center <- svd(matrix)

diagonal <- s_center$d

threshold = 0

for(i in 1:length(diagonal)) {
  
  if(sum(diagonal[1:i]^2)/sum(s_center$d^2) >= .80) {
    threshold =i
    #print(paste("Threshold value:", threshold))
    break
  }
}

diagonal[which(diagonal %in% diagonal) >threshold] = 0

s_svd<- s_center$u %*% diag(diagonal) %*% t(s_center$v)

return(rmse(matrix, s_svd))
}

Calculation of entire dataset

Below are the steps which are followed in this function

Part 1: 1. Cleanup data missing data via mean 2. Fill missing data via mean and center it 3. Do no changes to original missing data 4. Create a baseline prediction of all the values

Part 2: 1. Perform simple SVD and calculate RMSE 2. Perform SVD using recommenderlab SVD 3. Perform SVD using recommenderlab SVDF

missingval <- function(umatrix, method,type, center) {
 

if(method == "mean"){

  train_raw_mean <- mean(umatrix,na.rm = TRUE)
  umatrix[is.na(umatrix)] = train_raw_mean
  
}
  
else if(method =="mean_center"){
    train_raw_mean <- mean(umatrix,na.rm = TRUE)
  umatrix[is.na(umatrix)] = train_raw_mean
  
  umatrix <- scale(umatrix ,center = T,scale=F) %>% as.matrix()
}
else if(method=="withna"){
  umatrix <- umatrix
  
}
else if (method == "baseline"){
  
    train_raw_mean <- mean(umatrix,na.rm = TRUE)
  umatrix <- baseline_predictor(umatrix,train_raw_mean) %>% as.matrix()
  
}

  
if(type =="basesvd"){
  rmse_value = svd_val(umatrix)
  newlist <- list("matrix" = umatrix, "rmse" = rmse_value)
  
}
  
else if(type =="recommendersvd"){
  
  jester_realmatrix <- umatrix %>%  as("realRatingMatrix") 
  recommender_model <- Recommender(jester_realmatrix, "SVD",parameter = list(normalize = "center"))
  recc_predicted <- predict(object = recommender_model, newdata = jester_realmatrix,type="ratingMatrix")
  
   resultset_svd <- recc_predicted@data %>% as.matrix()
   
 newlist <- list("matrix" = resultset_svd, "rmse" = calcPredictionAccuracy(recc_predicted, jester_realmatrix))
  }
  
  else if(type =="recommendersvdf"){
  
  jester_realmatrix <- umatrix %>%  as("realRatingMatrix") 
  recommender_model <- Recommender(jester_realmatrix, "SVDF",parameter = list(normalize = "center"))
  recc_predicted <- predict(object = recommender_model, newdata = jester_realmatrix,type="ratingMatrix")
  
   resultset_svdf <- recc_predicted@data %>% as.matrix()
   
 newlist <- list("matrix" = resultset_svdf, "rmse" = calcPredictionAccuracy(recc_predicted, jester_realmatrix))
  }
  

return(newlist)
  
}

Validation of dataset

Call the function with differerent parameters and get the dataset with RMSE

text <- c("With Na dataset and recommenderlab SVD","With Na dataset and recommenderlab SVDF","Baseline predictors with Base SVD","Global mean with Base SVD","Global mean and centered data with Base SVD")


results <- data.frame(Calculation =text)



#With Na dataset and recommenderlab SVD

results[1,2] <- missingval(jester_matrix,"withna","recommendersvd",F)$rmse[1]


#With Na dataset and recommenderlab SVDF

results[2,2] <- missingval(jester_matrix,"withna","recommendersvdf",F)$rmse[1]


#Baseline predictors with Base SVD

results[3,2] <- missingval(jester_matrix,"baseline","basesvd",F)$rmse


#Global mean with Base SVD

results[4,2]  <- missingval(jester_matrix,"mean","basesvd",F)$rmse


#Global mean and centered data with Base SVD

results[5,2]  <- missingval(jester_matrix,"mean_center","basesvd",F)$rmse

results <- rename(results, RMSE = V2)

Summary

Below provided is the final RMSE result obtained after different calculations

results

##                                   Calculation      RMSE
## 1      With Na dataset and recommenderlab SVD 3.8144245
## 2     With Na dataset and recommenderlab SVDF 3.8144245
## 3           Baseline predictors with Base SVD 1.2585646
## 4                   Global mean with Base SVD 1.3928547
## 5 Global mean and centered data with Base SVD 0.9857861

Observations are as follows

From the current sample, it seems centered data with base SVD outperforms.
Threshold vary for different types of NA calculation.
In this dataset SVDF and SVD are showing providing similar results.
Recommendations can be provided from the output of these functions.