Instruction

In this assignment, we will implement a matrix factorization method - such as singular value decomposition (SVD) or Alternating Least Squares (ALS) - in the context of a recommender system by starting with an existing recommender system written by ourselves.

Introduction

In this project, we will implement a matrix factorization method, singular value decomposition (SVD), in the context of a recommender system, and compare its performance with the User-Based Collaborative Filtering (UBCF) model and Item-Based Collaborative Filtering (IBCF) model evaluated with different approaches from our previous assignment.

From the brief explanation of SVD by Standford University [https://youtu.be/yLdOS6xyM_Q], in linear algebra, the singular value decomposition (SVD) is a factorization of a matrix that generalizes the eigendecomposition of a square normal matrix to any \(m*n\) matrix via an extension of the polar decomposition, which requires no missing values. The singular value decomposition of an \(m*n\) matrix \(A\) is a factorization of the form \(U \Sigma V^T\), i.e. \(A = U \Sigma V^T\), where \(U\) is an \(m*r\) orthogonal matrix, \(\Sigma\) is an \(r*r\) diagonal matrix with the \(\sigma_{ii}\) values in descending order, and \(V\) is an \(n*r\) orthogonal matrix, where \(r \le min(m,n)\) is the rank of the original matrix \(A\).

From the dimensionality reduction introduced by Standford University [https://youtu.be/c7e-D2tmRE0], SVD gives the best low rank approximation matrix \(B = USV^T\) from matrix \(A\) by keeping 80% to 90% of the sum of the squares of the singular values (\(\sum \sigma_{ii}^2\)), where \(S\) is a diagonal \(r*r\) matrix with its diagonal values \(s_{ii} = \sigma_{ii} (i=1...k)\) else \(s_{ii}=0\), which implies the formula \[\frac {\sum_{i=1}^{k}\sigma_{ii}^2} {\sum_{i=1}^{r} \sigma_{ii}^2} \approx 0.9\].

We will use one of the Jester datasets for Recommender Systems and Collaborative Filtering Research by Ken Goldberg, AUTOLab, UC Berkeley [http://eigentaste.berkeley.edu/dataset/] as our raw dataset.

Load Packages

library(recommenderlab)
library(tidyverse)
library(kableExtra)
library(formattable)
library(caTools)
library(grid)
library(knitr)
library(Matrix)

Read Data

Only the first five rows of data are displayed in order to compress the size of the output file.

data <- read_csv('https://raw.githubusercontent.com/oggyluky11/DATA-612-2020-SUMMER/master/Project%203/jester_data_1_3.csv', col_names = FALSE) 

colnames(data) <- c('User_Rating_Cnt',str_c('J', c(1:100)))
head(data, 5)

Data Exploration

The dataset is a matrix with dimensions 24,938 rows x 101 columns. It contains ratings from 24,938 users who have rated between 15 and 35 jokes ranging from -10.00 to +10.00, with '99' corresponds to 'null' value. One row represents one user, while the first column gives the number of jokes rated by that user. The next 100 columns give the ratings for jokes 01-100. As the dataset is relatively large, we will randomly select 10,000 rows from the original dataset for our studies below. The first 5 rows are displayed.

set.seed(1)
# data cleaning after randomly select 10,000 rows from the raw data
ui_mtrx <- data %>%
  .[sample(nrow(data),10000),] %>%
  #add row number as User_ID
  mutate(User_ID = row_number()) %>%
  #convert value '99' to NA
  gather(key = 'Joke', value = 'Rating', -User_ID,-User_Rating_Cnt) %>%
  filter(Rating != 99) %>%
  spread(key = 'Joke', value = 'Rating') 

# remove old data to clean up memory
rm(data)

head(ui_mtrx,5)  #our original dataset for this project

After replacing the '99' values with 'NA', we got the summary statistics of the User_Rating_Cnt and Rating after eliminating the 'NA' values. From the summary statistics shown below, we can see that the users have rated about 26.05 jokes in average with range (15,35) and the ratings are in the range of (-9.95, 10) with mean 0.2863.

ui_mtrx_long <- ui_mtrx %>%
  gather(key = 'Joke', value = 'Rating', -User_ID, -User_Rating_Cnt, na.rm = TRUE)

# summary statistics
ui_mtrx_long %>% select(User_Rating_Cnt, Rating) %>% summary()

##  User_Rating_Cnt     Rating       
##  Min.   :15.00   Min.   :-9.9500  
##  1st Qu.:21.00   1st Qu.:-4.1700  
##  Median :26.00   Median : 0.8300  
##  Mean   :26.02   Mean   : 0.2752  
##  3rd Qu.:31.00   3rd Qu.: 4.7600  
##  Max.   :35.00   Max.   :10.0000

The histogram also shows the distributions of the ratings of our dataset. It also shows the mode rating of the dataset as -0.29.

# Histogram
mode <- ui_mtrx_long %>%
  group_by(Rating) %>%
  summarise(Count = n()) %>%
  top_n(1) %>%
  select(Rating) %>%
  as.numeric()

ui_mtrx_long %>%
  ggplot(aes(x = Rating, col = ..count..)) +
  geom_bar() +
  labs(title="Distribution of Ratings of Original Dataset") +
  theme(plot.title = element_text(hjust=0.5)) +
  annotation_custom(grobTree(textGrob(str_c('<-- Mode = ', mode %>% as.character()), x= 0.5, y = 0.85, hjust = 0)))

Handle the Missing Values

As singular value decomposition (SVD) requires no missing values from the dataset, we will first handle the missing values with baseline predictors.

By calculating the mean, user biases, and item biases, we can fill in the missing values with baseline predictors and have a full dataset. The first 5 rows are displayed.

# create a duplicate dataset as template for baseline predictors
ui_mtrx_bp <- ui_mtrx

# mean of all non-null ratings
mean_raw <- mean(ui_mtrx_long$Rating, na.rm = TRUE)
# create a 10000x100 matrix of raw mean
mean_raw_mtrx <-matrix(mean_raw,nrow = 10000, ncol = 100)

# user biases
user_bias_tb <- ui_mtrx_long %>%
  group_by(User_ID) %>%
  summarise(User_Bias = mean(Rating) - mean_raw)
# create a 10000X100 matrix of user biases
user_bias_mtrx <- replicate(100,user_bias_tb$User_Bias)

# item biases
item_bias_tb <- ui_mtrx_long %>%
  group_by(Joke) %>%
  summarise(Item_Bias = mean(Rating) - mean_raw)
# create a 10000x100 matrix of item biases
item_bias_mtrx <- t(replicate(10000,item_bias_tb$Item_Bias))

# create a temporary base line predictor matrix
temp_mtrx_bp <- pmax(pmin(mean_raw_mtrx + user_bias_mtrx + item_bias_mtrx,10),-10) %>%
  data.frame()

colnames(temp_mtrx_bp) <- colnames(ui_mtrx)[-(1:2)]

temp_mtrx_bp <- temp_mtrx_bp %>%
  cbind(ui_mtrx[c(1,2)]) %>%
  select(User_Rating_Cnt,User_ID, everything())

# fill in missing values
for(i in 3:102){
  vect_a <- ui_mtrx_bp[i] %>% unlist()
  vect_b <- temp_mtrx_bp[i] %>% unlist()
  ui_mtrx_bp[i] <- ifelse(is.na(vect_a),vect_b, vect_a)
}


head(ui_mtrx_bp, 5)

We have the summary statistics of ratings of our baseline predictors dataset as below.

ui_mtrx_bp_long <- ui_mtrx_bp %>%
  gather(key = 'Joke', value = 'Rating', -User_ID, -User_Rating_Cnt, na.rm = TRUE)

#Summary statistics
ui_mtrx_bp_long %>% select(User_Rating_Cnt, Rating) %>% summary()

##  User_Rating_Cnt     Rating       
##  Min.   :15.00   Min.   :-10.000  
##  1st Qu.:20.00   1st Qu.: -2.115  
##  Median :24.00   Median :  0.490  
##  Mean   :24.71   Mean   :  0.228  
##  3rd Qu.:29.00   3rd Qu.:  2.849  
##  Max.   :35.00   Max.   : 10.000

From the boxplots which showing the user ratings from the original dataset with NA values omitted and the new dataset with NA values being replaced by baseline predictors, we can see that the interquartile range of the new dataset has narrowed down relatively from the original dataset. Also, the distribution of the new dataset is more normal than the original dataset by looking at their histograms of user ratings.

#boxplots comparing
par(mfrow = c(1,2))
boxplot(ui_mtrx_long["Rating"], main="Original Ratings", col="lightyellow")
boxplot(ui_mtrx_bp_long["Rating"], main="NA Ratings replaced \nby Baseline Predictors", col="lightblue")

#histograms comparing
par(mfrow = c(2,1))
hist(ui_mtrx_long$Rating, main="Original Ratings", col="lightyellow")
hist(ui_mtrx_bp_long$Rating, main="NA Ratings replaced by Baseline Predictors", col="lightblue")

Singular Value Decomposition (SVD)

The singular value decomposition (SVD) of an \(m*n\) matrix \(A\) is a factorization of the form \(U \Sigma V^T\), i.e. \(A = U \Sigma V^T\), where \(U\) is an \(m*r\) orthogonal matrix, \(\Sigma\) is an \(r*r\) diagonal matrix with the \(\sigma_{ii}\) values in descending order, and \(V\) is an \(n*r\) orthogonal matrix, where \(r \le min(m,n)\) is the rank of the original matrix \(A\). In order to apply SVD to our dataset, we will first convert the dataset to matrix format.

#define a matrix
ui_mtrx_bp_rr <- ui_mtrx_bp %>%
  select(-User_Rating_Cnt, -User_ID) %>%
  as.matrix()

#get the rank of our matrix
r <- qr(ui_mtrx_bp_rr)$rank
print(str_c('The rank r of our matrix is: ', as.character(r)))

## [1] "The rank r of our matrix is: 100"

As the rank 100 equals to our number of columns, we have \(m=10,000\) and \(r=n=100\) for our matrix \(A = U \Sigma V^T\), where \(U\) is an \(m*r\) orthogonal matrix, \(\Sigma\) is an \(n*n\) diagonal matrix with the \(\sigma_{ii}\) values in descending order, and \(V\) is an \(n*n\) orthogonal matrix.

We can then calculate the SVD using the svd function. The singular values \(\sigma_{ii}\) are plotted in a scatterplot below, which shows that the singular values are in descending order.

#calculate svd
ui_mtrx_svd <- svd(ui_mtrx_bp_rr)

#plot the singular values
plot(ui_mtrx_svd$d, main="Singular Values", xlab=expression(paste("Component ", sigma)), ylab="Singular Value", pch=19, col='deeppink')

Dimensionality Reduction

Find k

By calculation, the value k that keeps approximately 90% of the sum of the squares of the singular values is 31. The scatterplot below shows the ratios of the changing k to all squares of singular values.

#find the sum of squares of all singular values
sum_sq_sv_all <- sum(ui_mtrx_svd$d^2)

#find the value k that satisfy the formula
dim_red <- NULL
k <- 0
for (i in 1:length(ui_mtrx_svd$d)){
  dim_red[i] <- ((sum(ui_mtrx_svd$d[1:i]^2))/sum_sq_sv_all)
  if (((sum(ui_mtrx_svd$d[1:i]^2))/sum_sq_sv_all) >= 0.9 && k==0){
    k=i
  }
}
if (abs((sum(ui_mtrx_svd$d[1:k]^2))/sum_sq_sv_all-0.9)>abs((sum(ui_mtrx_svd$d[1:k-1]^2))/sum_sq_sv_all-0.9)){
  k=k-1
}
print(str_c('The value k is: ', as.character(k)))

## [1] "The value k is: 30"

#plot the ratio of the sum of the squares of the first k singular values over that of all singular values
plot(dim_red, main="Ratio of Squares of Singular Values vs k", xlab="k", ylab="Ratio of k to all Squares of Singular Values", pch=19, col='deeppink')
abline(h=0.9, v=k)

Reduce Matrices' Dimensionality

As we have the value \(k=31\) by calculation, we will reduce the matrices' dimensionality of \(A = U \Sigma V^T\) by crossing out the (k+1 and after) columns of \(U\), the (k+1 and after) singular values of \(\Sigma\), and the (k+1 and after) rows for \(V^T\) to produce \(B = USV^T\).

#reduce dimensionality of U
u_k <- ui_mtrx_svd$u[,1:k]

#reduce dimensionality of S
s_k <- Diagonal(n=k, x=ui_mtrx_svd$d[1:k])

#reduce dimensionality of V^T
v_t_k <- t(ui_mtrx_svd$v)[1:k,]

Best Low Rank Approximation

From the above section, we used SVD and Dimensionality Reduction to find the best rank-k approximation \(B = USV^T\) to \(A\), with rank(\(B\))=k. In this part, we will generate the matrix \(B\) using the matrices u_k, s_k, and v_k. The dimension of matrix \(B\) should match the dimension of matrix \(A\), which is 10,000 rows x 100 columns.

ui_mtrx_svd_approx <- u_k %*% s_k %*% v_t_k %>%
  as.matrix() 

# limit the range of ratings to [-10,10]
ui_mtrx_svd_approx <- pmax(pmin(ui_mtrx_svd_approx,10),-10)

dim(ui_mtrx_svd_approx)

## [1] 10000   100

Let's see some summary statistics and the distribution of the predicted matrix \(B\) and compare them with our original dataset w/o NA and the dataset with NA replaced by baseline predictors.

ui_mtrx_svd_approx_long <- ui_mtrx_svd_approx %>%
  data.frame() %>%
  gather(key = 'Joke', value = 'Rating', na.rm = TRUE)

#Summary statistics
ui_mtrx_long %>% 
  select(Rating) %>% 
  summary() %>%
  data.frame() %>%
  select(Freq) %>%
  separate(Freq,into = c('Metrics', 'Value'),':') %>%
  cbind(ui_mtrx_bp_long %>% 
          select(Rating) %>% 
          summary() %>%
          data.frame() %>%
          select(Freq) %>%
          separate(Freq,into = c('Metrics', 'Value'),':')) %>%
  cbind(ui_mtrx_svd_approx_long %>% 
          select(Rating) %>% 
          summary() %>%
          data.frame() %>%
          select(Freq) %>%
          separate(Freq,into = c('Metrics', 'Value'),':')) %>%
  kable() %>%
  kable_styling(bootstrap_options = c('striped','bordered'), full_width = FALSE) %>%
  add_header_above(c('Original Dataset' = 2, 'Dataset with Baseline Predictors' = 2, 'Predicted Matrix B' = 2)) %>%
  add_header_above(c('Summary Statistics' = 6))

Summary Statistics
Original Dataset		Dataset with Baseline Predictors		Predicted Matrix B
Metrics	Value	Metrics	Value	Metrics	Value
Min.	-9.9500	Min.	-10.000	Min.	-10.0000
1st Qu.	-4.1700	1st Qu.	-2.115	1st Qu.	-1.9893
Median	0.8300	Median	0.490	Median	0.4607
Mean	0.2752	Mean	0.228	Mean	0.2295
3rd Qu.	4.7600	3rd Qu.	2.849	3rd Qu.	2.6916
Max.	10.0000	Max.	10.000	Max.	10.0000

From the summary statistics and the boxplots below, it is obvious that the predictor matrix \(B\)'s interquartile range has narrowed down relatively from the previous two datasets. Also, the distribution of the new dataset is similar to the dataset with baseline predictors but more normal than it.

#boxplots comparing
par(mfrow = c(1,3))
boxplot(ui_mtrx_long["Rating"], main="Original Ratings", col="lightyellow")
boxplot(ui_mtrx_bp_long["Rating"], main="NA Ratings replaced \nby Baseline Predictors", col="lightblue")
boxplot(ui_mtrx_svd_approx_long["Rating"], main="Predicted Matrix with \nSVD Dimensionality Reduction", col="deeppink")

#histograms comparing
par(mfrow = c(3,1))
hist(ui_mtrx_long$Rating, main="Original Ratings", col="lightyellow")
hist(ui_mtrx_bp_long$Rating, main="NA Ratings replaced by Baseline Predictors", col="lightblue")
hist(ui_mtrx_svd_approx_long$Rating, main="Predicted Matrix with SVD Dimensionality Reduction", col="deeppink")

SVD Accuracy Evaluation

Finally, we compare the predicted matrix with SVD dimensionality reduction with our original dataset, the one with NA values.

#original dataset
ui_mtrx_rr <- ui_mtrx %>%
  select(-User_Rating_Cnt, -User_ID) %>%
  as.matrix() %>% 
  as("realRatingMatrix")

#SVD dataset
ui_mtrx_svd_approx_rr <- ui_mtrx_svd_approx %>%
  as.matrix() %>% 
  as("realRatingMatrix")

#SVD dataset accuracy
SVD_acc <- calcPredictionAccuracy(ui_mtrx_svd_approx_rr, ui_mtrx_rr)

SVD_acc_Metrics <- rbind('SVD_acc' = SVD_acc) %>%
  data.frame() %>%
  rownames_to_column('Model')

SVD_acc_Metrics %>%
  mutate_if(is.numeric, ~round(.,6)) %>%
  mutate(RMSE = cell_spec(RMSE, bold  = ifelse(RMSE == min(RMSE),TRUE,FALSE)),
         MSE = cell_spec(MSE, bold  = ifelse(MSE == min(MSE),TRUE,FALSE)),
         MAE = cell_spec(MAE, bold  = ifelse(MAE == min(MAE),TRUE,FALSE))
         ) %>%
  kable(escape = FALSE) %>%
  kable_styling(bootstrap_options = c('striped','bordered'), full_width = FALSE) %>%
  add_header_above(c('SVD-Dimensionality-Reduction Model' = 4))

SVD-Dimensionality-Reduction Model
Model	RMSE	MSE	MAE
SVD_acc	2.371124	5.622229	0.960098

Build Recommendation Models with UBCF and IBCF

By splitting our original dataset into train and test datasets, we will implement the User-Based Collaborative Filtering (UBCF) and Item-Based Collaborative Filtering (IBCF) algorithms to the original dataset. We will also use different normalization techniques (centering and Z-score) and similarity measures (Cosine distance, Pearson correlation, and Euclidean distance).

ui_mtrx_split <- evaluationScheme(data=ui_mtrx_rr, method='cross-validation', k = 5, 
                                  given=15, goodRating=0)

ui_mtrx_train <- getData(ui_mtrx_split, 'train')
ui_mtrx_test_known <- getData(ui_mtrx_split, 'known')
ui_mtrx_test_unknown <- getData(ui_mtrx_split, 'unknown')

User-Based Collaborative Filtering Models

We will create 6 models of User-Based Collaborative Filtering algorithm by using the Recommender function from the recommenderlab package with two normalization techniques (center and Z-score) and three similarity measures (Cosine distance, Pearson correlation, and Euclidean distance).

After restricting the rating boundary to (-10, 10), we calculate the accuracies of the predictions with the actual ratings given by users. The result is sorted by RMSE in ascending order.

#UBCF models
model_UBCF_CC <- Recommender(data = ui_mtrx_train, method = 'UBCF', parameter = list(normalize = "center", method="Cosine"))
model_UBCF_CP <- Recommender(data = ui_mtrx_train, method = 'UBCF', parameter = list(normalize = "center", method="Pearson"))
model_UBCF_CE <- Recommender(data = ui_mtrx_train, method = 'UBCF', parameter = list(normalize = "center", method="Euclidean"))
model_UBCF_ZC <- Recommender(data = ui_mtrx_train, method = 'UBCF', parameter = list(normalize = "Z-score", method="Cosine"))
model_UBCF_ZP <- Recommender(data = ui_mtrx_train, method = 'UBCF', parameter = list(normalize = "Z-score", method="Pearson"))
model_UBCF_ZE <- Recommender(data = ui_mtrx_train, method = 'UBCF', parameter = list(normalize = "Z-score", method="Euclidean"))

suppress_rating <- function(x, min = -10, max = 10){
  return(pmax(pmin(x, 10),-10))
  }

#predictions with boundaries set
p_UBCF_CC <- predict(model_UBCF_CC, ui_mtrx_test_known, type='ratings')
p_UBCF_CC@data@x <- pmax(pmin(p_UBCF_CC@data@x, 10),-10)

p_UBCF_CP <- predict(model_UBCF_CP, ui_mtrx_test_known, type='ratings') 
p_UBCF_CP@data@x <- pmax(pmin(p_UBCF_CP@data@x, 10),-10)

p_UBCF_CE <- predict(model_UBCF_CE, ui_mtrx_test_known, type='ratings') 
p_UBCF_CE@data@x <- pmax(pmin(p_UBCF_CE@data@x, 10),-10)

p_UBCF_ZC <- predict(model_UBCF_ZC, ui_mtrx_test_known, type='ratings') 
p_UBCF_ZC@data@x <- pmax(pmin(p_UBCF_ZC@data@x, 10),-10)

p_UBCF_ZP <- predict(model_UBCF_ZP, ui_mtrx_test_known, type='ratings') 
p_UBCF_ZP@data@x <- pmax(pmin(p_UBCF_ZP@data@x, 10),-10)

p_UBCF_ZE <- predict(model_UBCF_ZE, ui_mtrx_test_known, type='ratings') 
p_UBCF_ZE@data@x <- pmax(pmin(p_UBCF_ZE@data@x, 10),-10)

#accuracies
UBCF_Model_Metrics <- rbind(
  'UBCF_CC' = calcPredictionAccuracy(p_UBCF_CC, ui_mtrx_test_unknown),
  'UBCF_CP' = calcPredictionAccuracy(p_UBCF_CP, ui_mtrx_test_unknown),
  'UBCF_CE' = calcPredictionAccuracy(p_UBCF_CE, ui_mtrx_test_unknown),
  'UBCF_ZC' = calcPredictionAccuracy(p_UBCF_ZC, ui_mtrx_test_unknown),
  'UBCF_ZP' = calcPredictionAccuracy(p_UBCF_ZP, ui_mtrx_test_unknown),
  'UBCF_ZE' = calcPredictionAccuracy(p_UBCF_ZE, ui_mtrx_test_unknown)
) %>%
  data.frame() %>%
  rownames_to_column('Model') %>%
  arrange(RMSE)

UBCF_Model_Metrics %>%
  mutate_if(is.numeric, ~round(.,6)) %>%
  mutate(RMSE = cell_spec(RMSE, bold  = ifelse(RMSE == min(RMSE),TRUE,FALSE)),
         MSE = cell_spec(MSE, bold  = ifelse(MSE == min(MSE),TRUE,FALSE)),
         MAE = cell_spec(MAE, bold  = ifelse(MAE == min(MAE),TRUE,FALSE))
         ) %>%
  kable(escape = FALSE) %>%
  kable_styling(bootstrap_options = c('striped','bordered'), full_width = FALSE) %>%
  add_header_above(c('Comparison of User-Based Collaborative Filtering Models' = 4))

Comparison of User-Based Collaborative Filtering Models
Model	RMSE	MSE	MAE
UBCF_ZE	4.611506	21.265991	3.644787
UBCF_CE	4.617589	21.322128	3.668172
UBCF_ZP	4.643276	21.560016	3.687265
UBCF_CP	4.652924	21.649697	3.71054
UBCF_ZC	4.668218	21.792262	3.703087
UBCF_CC	4.680375	21.905906	3.725229

Item-Based Collaborative Filtering Models

We will then create 6 models of Item-Based Collaborative Filtering algorithm with the same method: by using the Recommender function from the recommenderlab package with two normalization techniques (center and Z-score) and three similarity measures (Cosine distance, Pearson correlation, and Euclidean distance).

After restricting the rating boundary to (-10, 10), we calculate the accuracies of the predictions with the actual ratings given by users. The result is sorted by RMSE in ascending order.

#IBCF models
model_IBCF_CC <- Recommender(data = ui_mtrx_train, method = 'IBCF', parameter = list(normalize = "center", method="Cosine"))

model_IBCF_CP <- Recommender(data = ui_mtrx_train, method = 'IBCF', parameter = list(normalize = "center", method="Pearson"))

model_IBCF_CE <- Recommender(data = ui_mtrx_train, method = 'IBCF', parameter = list(normalize = "center", method="Euclidean"))

model_IBCF_ZC <- Recommender(data = ui_mtrx_train, method = 'IBCF', parameter = list(normalize = "Z-score", method="Cosine"))

model_IBCF_ZP <- Recommender(data = ui_mtrx_train, method = 'IBCF', parameter = list(normalize = "Z-score", method="Pearson"))

model_IBCF_ZE <- Recommender(data = ui_mtrx_train, method = 'IBCF', parameter = list(normalize = "Z-score", method="Euclidean"))


#predictions with boundaries set
p_IBCF_CC <- predict(model_IBCF_CC, ui_mtrx_test_known, type='ratings')
p_IBCF_CC@data@x <- pmax(pmin(p_IBCF_CC@data@x, 10),-10)

p_IBCF_CP <- predict(model_IBCF_CP, ui_mtrx_test_known, type='ratings') 
p_IBCF_CP@data@x <- pmax(pmin(p_IBCF_CP@data@x, 10),-10)

p_IBCF_CE <- predict(model_IBCF_CE, ui_mtrx_test_known, type='ratings') 
p_IBCF_CE@data@x <- pmax(pmin(p_IBCF_CE@data@x, 10),-10)

p_IBCF_ZC <- predict(model_IBCF_ZC, ui_mtrx_test_known, type='ratings') 
p_IBCF_ZC@data@x <- pmax(pmin(p_IBCF_ZC@data@x, 10),-10)

p_IBCF_ZP <- predict(model_IBCF_ZP, ui_mtrx_test_known, type='ratings') 
p_IBCF_ZP@data@x <- pmax(pmin(p_IBCF_ZP@data@x, 10),-10)

p_IBCF_ZE <- predict(model_IBCF_ZE, ui_mtrx_test_known, type='ratings') 
p_IBCF_ZE@data@x <- pmax(pmin(p_IBCF_ZE@data@x, 10),-10)


#accuracies
IBCF_Model_Metrics <- rbind(
  'IBCF_CC' = calcPredictionAccuracy(p_IBCF_CC, ui_mtrx_test_unknown),
  'IBCF_CP' = calcPredictionAccuracy(p_IBCF_CP, ui_mtrx_test_unknown),
  'IBCF_CE' = calcPredictionAccuracy(p_IBCF_CE, ui_mtrx_test_unknown),
  'IBCF_ZC' = calcPredictionAccuracy(p_IBCF_ZC, ui_mtrx_test_unknown),
  'IBCF_ZP' = calcPredictionAccuracy(p_IBCF_ZP, ui_mtrx_test_unknown),
  'IBCF_ZE' = calcPredictionAccuracy(p_IBCF_ZE, ui_mtrx_test_unknown)
) %>%
  data.frame() %>%
  rownames_to_column('Model') %>%
  arrange(RMSE)


IBCF_Model_Metrics %>%
  mutate_if(is.numeric, ~round(.,6)) %>%
  mutate(RMSE = cell_spec(RMSE, bold  = ifelse(RMSE == min(RMSE),TRUE,FALSE)),
         MSE = cell_spec(MSE, bold  = ifelse(MSE == min(MSE),TRUE,FALSE)),
         MAE = cell_spec(MAE, bold  = ifelse(MAE == min(MAE),TRUE,FALSE))
         ) %>%
  kable(escape = FALSE) %>%
  kable_styling(bootstrap_options = c('striped','bordered'), full_width = FALSE) %>%
  add_header_above(c('Comparison of Item-Based Collaborative Filtering Models' = 4))

Comparison of Item-Based Collaborative Filtering Models
Model	RMSE	MSE	MAE
IBCF_ZE	4.800295	23.042836	3.608947
IBCF_CP	4.806898	23.106268	3.645505
IBCF_CE	4.832951	23.357417	3.634166
IBCF_ZP	4.854067	23.561966	3.674543
IBCF_ZC	5.77074	33.301436	4.630818
IBCF_CC	5.783684	33.451001	4.638247

Summary

The barplot below compares the accuracies between our SVD model, 6 UBCF models, and 6 IBCF models. It is sorted by RMSE in ascending order. The lower the RMSE value, the better the performance of the model. Although the UBCF models generally performs better than the IBCF models, our SVD-dimensionality-reduction model (with NA values replaced by baseline predictors) still outperforms all those 12 models, which indicates that SVD-dimensionality-reduction recommender system provides more accurate user rating predictions than collaborative filtering models.

SVD_acc_Metrics %>%
  rbind(UBCF_Model_Metrics) %>%
  rbind(IBCF_Model_Metrics) %>%
  select(Model, RMSE) %>%
  ggplot(aes(x=reorder(Model, -RMSE), y=RMSE, fill=RMSE)) +
  geom_text(aes(label=round(RMSE,4), hjust = 'left'))+
  geom_bar(stat='identity') +
  coord_flip()+
  ylim(0,6)+
  scale_fill_gradient(low = 'deeppink1', high = 'deeppink4') +
  labs(title = 'RMSE Comparison of All Models',
       x = 'MODEL', 
       y = 'RMSE')

#clean up memory
rm(list = ls())

DATA 612 Project 3 - Matrix Factorization Methods

DATA 612 Project 3 - Matrix Factorization Methods

Instruction

Introduction

Load Packages

Read Data

Data Exploration

Handle the Missing Values

Singular Value Decomposition (SVD)

Dimensionality Reduction

Find k

Reduce Matrices' Dimensionality

Best Low Rank Approximation

SVD Accuracy Evaluation

Build Recommendation Models with UBCF and IBCF

User-Based Collaborative Filtering Models

Item-Based Collaborative Filtering Models

Summary