library(tidyverse)
library(readr)
library(sqldf)
library(dplyr)
library(tidyr)
library(tinytex)
library(recommenderlab)
library(kableExtra)
library(gridExtra)

Project 3 - Dimensionality Reduction

Question

The task is to implement a matrix factorization method - such as singular value decomposition (SVD) or Alternating Least Squares (ALS) - In the context of a recommender system.

Singular Value Decomposition

Singular Value Decomposition is a method that takes a rectangular matrix of rank k and reduces the dimensionality of that matrix into three component matrices from the original m x n matrix. the matrices produced by this function include:

Matrix U: m x n orthogonal matrix Matrix D: n x n diagonal matrix
Matrix V: n x n orthogonal matrix
One of the assumptions associated with SVD are that the set of data points lie in a space or subspace and if given an arbitrary dataset, you cnau determine what subspace best represents the data. Dimensionalirt reduction can be thought of as given a large number of rows and columns, every column represents a different a dimension, you can take this data and try to shrink the number of columns while still preserving what the data describes at some information retainment threshold. A Rank of the matrix is represented by the number of linearly independent columns of A as described in the assignment Youtube Video.

Why Reduce Dimensions?
There are many benefits to reducing the dimensionality of a dataset including:  - Discover hidden correlations/topics  - Information that can be described as a combination rather than independently  - Remove redundant and noisy features. not all words may be relevant in a tf-idf, or not all columns may be necessary in a user-item matrix  - Interpretation and Visualization  - Easier storage and processing of the data

Matrix factorization models are superior to classic nearest-neighbor techniques for producing recommendations.Commom recommendation engines used include:

Content filtering - recommendations can be generated based on information about a user or movie. e.g. user content could include age, sex, location etc. movie content could include actors, director, genre etc.

Collaborative filtering - recommendations can be generated based on the history of a user. for example if two users liked the same movies, a recommendation can be made to user A based on a movie user B enjoyed that user A has not yet seen. This method is usually more accurate than content filtering but suffers from cold starts when lack of user data is initially sparse.

Matrix factorization models takes both users and items and places this into a joint latent factor factor space of dimensionality “f” so that user-item interations are inner products of that space.

Singular value decomposition (SVD) is a common dimensionality reduction technique that identifies latent semantic factors for retrieving information. Some of the difficulties associated with this technique can be contributed to sparse data (missing values) often present in user-item matrices. Unfortunately filling huge matrices with missing values can often be expensive or even misleading. Regularization models (penalty-based error minimizing) can assist with this.

Example SVD

The SVD package in r computes the singular-value decomposition of a rectangular matrix. The arguments of the function are shown below where x is a numeric or complex matrix that is to be decomposed. nu represents the number of left singular vectors to be computed and nv computes the number of right singular vectors to be computed.

args(svd)
## function (x, nu = min(n, p), nv = min(n, p), LINPACK = FALSE) 
## NULL

We will begin by creating a 6 x 5 matrix with random values between 1 and 5 (mimicking that of a movie rating matrix) and then computing the SVD:

set.seed(123)

rand_mat<-as.matrix(data.frame(
      a = c(sample(c(1:5),5,replace = T)),
      b =  c(sample(c(1:5),5,replace = T)),
      c =  c(sample(c(1:5),5,replace = T)),
      d =  c(sample(c(1:5),5,replace = T)),
      e =  c(sample(c(1:5),5,replace = T)),
      f =  c(sample(c(1:5),5,replace = T))))

rand_mat %>% 
  kable() %>% 
  kable_styling(full_width = FALSE,
                position = "center",
                bootstrap_options = c("hover", "condense","responsive"))
a b c d e f
3 5 5 1 2 3
3 4 3 1 1 5
2 1 3 5 3 4
2 2 1 3 4 2
3 3 4 2 1 5
SVDMat<-svd(rand_mat)

SVDMat
## $d
## [1] 16.1252620  5.2686879  2.9405817  1.8664468  0.2936115
## 
## $u
##            [,1]       [,2]       [,3]        [,4]       [,5]
## [1,] -0.4989176 -0.4120380 -0.5548928  0.50786552 -0.1243887
## [2,] -0.4639223 -0.3438072  0.2136460 -0.62028326 -0.4859802
## [3,] -0.4422518  0.6480934  0.3382763  0.37214052 -0.3625865
## [4,] -0.3228541  0.5103294 -0.5794749 -0.46593859  0.2871222
## [5,] -0.4858353 -0.1776518  0.4429756  0.04164084  0.7310558
## 
## $v
##            [,1]        [,2]        [,3]       [,4]        [,5]
## [1,] -0.3644119 -0.09179620 -0.06026390 -0.2142718  0.71908725
## [2,] -0.4276360 -0.43646960 -0.48005018 -0.2017843 -0.54847591
## [3,] -0.4638254 -0.25577705  0.02507339  0.8012680  0.14884293
## [4,] -0.3171627  0.69472823  0.16923898  0.2323938 -0.33998312
## [5,] -0.2831436  0.50108542 -0.59724061 -0.1662218  0.19422872
## [6,] -0.5370614 -0.04372565  0.61640346 -0.4355499 -0.08136437

We can see that we produced our U D and V vectors. The product of these matrices will yield our original data.

Applying SVD to Recommender System

In the previous assignment, we determined that our User-User recommendation model using jaccard and pearsons methods provided the best results for making movie recommendations. We can take those same models and see how they compare when performing recommendation using matrix factorization via the single value decomposition method.

data_package <- data(package = "recommenderlab")
data("MovieLense")


recmodels<-recommenderRegistry$get_entries(dataType = "realRatingMatrix")
kable(names(recmodels)) %>%
  kable_styling(full_width = FALSE,
                position = "center",
                bootstrap_options = c("hover", "condense","responsive")) %>% 
  add_header_above(header = "Available Recommender Models")
Available Recommender Models
x
ALS_realRatingMatrix
ALS_implicit_realRatingMatrix
IBCF_realRatingMatrix
LIBMF_realRatingMatrix
POPULAR_realRatingMatrix
RANDOM_realRatingMatrix
RERECOMMEND_realRatingMatrix
SVD_realRatingMatrix
SVDF_realRatingMatrix
UBCF_realRatingMatrix

We will re-initiate the scripts for just the pearson and jaccard UBCF method and apply the SVD model to the same training and testing dataset. The ratings for the movies are taking from the MovieLense dataset and we will remove movies with less than 100 reviews, as well as users with less than 100 ratings to avoid having severely sparse data. Out rating threshold will remain as 3 (movies rated 3 or greater are considered positive), our training percentage will be an 80/20 split of the data and we will compute 10 recommendations per user for calculating RMSE. when calculating ROC curve, we will look at different quantities of recommendations produced by the models and see how the models perform.

ratings_movies <- MovieLense[rowCounts(MovieLense) > 100,  
                             colCounts(MovieLense) > 100] 

train_percent<-0.8
kept_items<-10
rating_threshold<-3
n_eval<-8
no_recommendations<-10


eval_sets <- evaluationScheme(data = ratings_movies, method = "cross-validation", 
                              train = train_percent, given = kept_items, goodRating = rating_threshold, k = n_eval)


#used to train
recc_train<- getData(eval_sets,'train')
#used to predict
recc_known<-getData(eval_sets,'known')
#used to test
recc_unknown<-getData(eval_sets,'unknown')
UBCF_eval<-Recommender(recc_train,method = "UBCF", parameter = list(method = "pearson"))
SVD_eval<-Recommender(recc_train,method = "SVD")
ALS_eval<-Recommender(recc_train,method = "ALS")

recommendations<-function(eval){
  recc_predicted<-predict(object = eval,newdata=recc_known,n=no_recommendations)

recc_matrix <- sapply(recc_predicted@items, function(x){
  colnames(ratings_movies)[x]
})

number_of_items<-recc_matrix %>% unlist() %>% table() %>% as.data.frame()

table_top <- data.frame("names.number_of_items_top." = number_of_items$.,  
                        "number_of_items_top"= number_of_items$Freq)


table_top %>%
  top_n(20) %>% 
  ggplot(mapping = aes(x=fct_reorder(names.number_of_items_top.,-as.numeric(number_of_items_top)), y = as.numeric(number_of_items_top)))+
  geom_col(aes(fill = as.numeric(number_of_items_top)),color = 'black', alpha = 0.5)+
  theme(axis.text.x = element_text(angle = 90),
        legend.position = 'none',
        panel.grid = element_blank(),
        panel.background = element_blank())+
  labs(x = "Movie Name",
       y = "Number of recommendations",
       title = "Top 20 movie Recomendations")
}

recommendations(UBCF_eval)

recommendations(SVD_eval)

ModelErrors<-function(eval){
  
    recc_predicted<-predict(object = eval,newdata=recc_known,n=no_recommendations, type= "ratings")
    calcPredictionAccuracy(recc_predicted,recc_unknown, byUser = F)

}

rbind(
  UBCF= ModelErrors(UBCF_eval),
  SVD= ModelErrors(SVD_eval),
  ALS= ModelErrors(ALS_eval))
##           RMSE       MSE       MAE
## UBCF 0.9547423 0.9115328 0.7523733
## SVD  0.9862475 0.9726841 0.7806139
## ALS  0.9102107 0.8284835 0.7193582

We can see that our model shows ALS being the least erroneous method, and SVD does not perform better than UBCF pearson in this case.
Finally, We can look at the ROC curve and tune the models on various normalization techniques to see if this provides and adjustments to model accuracy. The normalization techniques include “center” where the data is adjusted so that the mean of each user is 0. “Z-Score” normalization allows the dataset to be adjusted by measuring the distance (number of standard deviations) a rating is from the mean for a given user.
Our ROC curve below indicates that the most accurate recommender system from the models evaluated are our Alternate Least Squares model (ALS) both with center and z-score normalization, and our our UBCF methods depending on the number of items recommended.

models_to_evaluate <- list(
  UBCF_pearson = list(name = "UBCF", param = list(method =  
                                                    "pearson")), 
  UBCF_jaccard = list(name = "UBCF", param = list(method =  
                                                    "jaccard")), 
  SVD_Z = list(name = "SVD", param = list(normalize = "Z-score")),
  SVD_C = list(name = "SVD", param = list(normalize = "center")),
  
  ALS_N = list(name = "ALS", param = list(normalize = NULL)),
  ALS_C = list(name = "ALS", param = list(normalize = "center")),
  ALS_Z = list(name = "ALS", param = list(normalize = "Z-score")),
  random = list(name = "RANDOM", param=NULL)
)


n_recommendations <- c(1, 5, seq(10, 100, 10))

list_results <- evaluate(x = eval_sets, method = models_to_evaluate, n= n_recommendations)
## UBCF run fold/sample [model time/prediction time]
##   1  [0sec/0.06sec] 
##   2  [0sec/0.06sec] 
##   3  [0sec/0.05sec] 
##   4  [0sec/0.06sec] 
##   5  [0sec/0.06sec] 
##   6  [0sec/0.17sec] 
##   7  [0sec/0.04sec] 
##   8  [0sec/0.06sec] 
## UBCF run fold/sample [model time/prediction time]
##   1  [0sec/0.05sec] 
##   2  [0sec/0.05sec] 
##   3  [0sec/0.05sec] 
##   4  [0sec/0.05sec] 
##   5  [0sec/0.05sec] 
##   6  [0sec/0.05sec] 
##   7  [0sec/0.05sec] 
##   8  [0.01sec/0.05sec] 
## SVD run fold/sample [model time/prediction time]
##   1  [0.04sec/0.01sec] 
##   2  [0sec/0.01sec] 
##   3  [0.02sec/0.01sec] 
##   4  [0.03sec/0.02sec] 
##   5  [0.01sec/0.02sec] 
##   6  [0.03sec/0.01sec] 
##   7  [0.02sec/0.01sec] 
##   8  [0.03sec/0sec] 
## SVD run fold/sample [model time/prediction time]
##   1  [0.02sec/0.01sec] 
##   2  [0.02sec/0.01sec] 
##   3  [0.01sec/0.02sec] 
##   4  [0.01sec/0.02sec] 
##   5  [0.01sec/0sec] 
##   6  [0.01sec/0sec] 
##   7  [0.02sec/0.01sec] 
##   8  [0.02sec/0sec] 
## ALS run fold/sample [model time/prediction time]
##   1  [0sec/5.54sec] 
##   2  [0sec/5.3sec] 
##   3  [0sec/5.57sec] 
##   4  [0sec/5.43sec] 
##   5  [0sec/5.42sec] 
##   6  [0sec/5.88sec] 
##   7  [0sec/5.51sec] 
##   8  [0sec/5.99sec] 
## ALS run fold/sample [model time/prediction time]
##   1  [0sec/5.81sec] 
##   2  [0sec/5.97sec] 
##   3  [0.02sec/5.88sec] 
##   4  [0sec/5.46sec] 
##   5  [0sec/5.39sec] 
##   6  [0sec/5.63sec] 
##   7  [0sec/5.89sec] 
##   8  [0sec/5.81sec] 
## ALS run fold/sample [model time/prediction time]
##   1  [0sec/5.43sec] 
##   2  [0sec/5.28sec] 
##   3  [0sec/5.71sec] 
##   4  [0sec/5.67sec] 
##   5  [0sec/5.48sec] 
##   6  [0sec/5.38sec] 
##   7  [0sec/5.54sec] 
##   8  [0sec/5.48sec] 
## RANDOM run fold/sample [model time/prediction time]
##   1  [0sec/0.02sec] 
##   2  [0sec/0sec] 
##   3  [0sec/0.01sec] 
##   4  [0sec/0.02sec] 
##   5  [0sec/0sec] 
##   6  [0sec/0.02sec] 
##   7  [0sec/0.01sec] 
##   8  [0sec/0.02sec]
avg_matrices <- lapply(list_results, avg)


plot(list_results, annotate = 1, legend = "topleft") 
title("ROC curve")