Proj3 MovieLense Singular

The goal of this assignment is practicing working with Matrix Factorization techniques.

Task is implement a matrix factorization method—such as singular value decomposition (SVD) or Alt ernating Least Squares (ALS)—in the context of a recommender system.

SVD can be thought of as a pre-processing step for feature engineering.

Notes/Limitations: • SVD builds features that may or may not map neatly to items (such as movie genres or news topics). As in many areas of machine learning, the lack of explainability can be an issue). • SVD requires that there are no missing values.

There are various ways to handle this, including (1) imputation of missing values, (2) mean-centering values around 0, or (3) using a more advance technique, such as stochastic gradient descent to simulate SVD in populating the factored matrices. • Calculating the SVD matrices can be computationally expensive, although calculating ratings once the factorization is completed is very fast. You

# install.packages('pander')

Data Preperation

library(recommenderlab)

## Warning: package 'recommenderlab' was built under R version 3.6.3

## Loading required package: Matrix

## Loading required package: arules

## Warning: package 'arules' was built under R version 3.6.3

## 
## Attaching package: 'arules'

## The following objects are masked from 'package:base':
## 
##     abbreviate, write

## Loading required package: proxy

## Warning: package 'proxy' was built under R version 3.6.3

## 
## Attaching package: 'proxy'

## The following object is masked from 'package:Matrix':
## 
##     as.matrix

## The following objects are masked from 'package:stats':
## 
##     as.dist, dist

## The following object is masked from 'package:base':
## 
##     as.matrix

## Loading required package: registry

## Registered S3 methods overwritten by 'registry':
##   method               from 
##   print.registry_field proxy
##   print.registry_entry proxy

library(ggplot2)

## Warning: package 'ggplot2' was built under R version 3.6.3

library(tidyverse)

## Warning: package 'tidyverse' was built under R version 3.6.3

## -- Attaching packages ------------------------------------------------------------------------------ tidyverse 1.3.0 --

## v tibble  3.0.0     v dplyr   0.8.5
## v tidyr   1.0.2     v stringr 1.4.0
## v readr   1.3.1     v forcats 0.5.0
## v purrr   0.3.3

## Warning: package 'tibble' was built under R version 3.6.3

## Warning: package 'tidyr' was built under R version 3.6.3

## Warning: package 'readr' was built under R version 3.6.3

## Warning: package 'purrr' was built under R version 3.6.3

## Warning: package 'dplyr' was built under R version 3.6.3

## Warning: package 'stringr' was built under R version 3.6.3

## Warning: package 'forcats' was built under R version 3.6.3

## -- Conflicts --------------------------------------------------------------------------------- tidyverse_conflicts() --
## x tidyr::expand() masks Matrix::expand()
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
## x tidyr::pack()   masks Matrix::pack()
## x dplyr::recode() masks arules::recode()
## x tidyr::unpack() masks Matrix::unpack()

data(MovieLense)
movielense <- MovieLense # Loading the movie datset
print(paste0("The dimensions of dataset :", nrow(movielense), " Movies x ",ncol(movielense)," Users " ))

## [1] "The dimensions of dataset :943 Movies x 1664 Users "

 ## to check if there is missing data point 
is.na(movielense)

## Warning in is.na(movielense): is.na() applied to non-(list or vector) of type
## 'S4'

## [1] FALSE

Data Visualization

mvector <- as.vector(movielense@data)

print(paste0("The Unique Numbers of rating is : "))

## [1] "The Unique Numbers of rating is : "

unique(mvector)

## [1] 5 4 0 3 1 2

print(paste0("The First 9 rows preview : "))

## [1] "The First 9 rows preview : "

head(mvector, 9)

## [1] 5 4 0 0 4 4 0 0 0

## to exclude the rating of 0 out 
mvector <- mvector[mvector != 0] 
print(paste0("The Unique Numbers of rating after excluding rating 0 is : "))

## [1] "The Unique Numbers of rating after excluding rating 0 is : "

unique(mvector)

## [1] 5 4 3 1 2

print(paste0("The First 9 rows preview : "))

## [1] "The First 9 rows preview : "

head(mvector, 9)

## [1] 5 4 4 4 4 3 1 5 4

mvector <- factor(mvector)
qplot(mvector) + ggtitle("Ratings Histogram") +  xlab("Ratings") + ylab("Counts")

We can see that the rating 4 is the most common one, followed by rating 3 , then by rating 5.

movie_watched <- data.frame(
    movie_name = names(colCounts(movielense)),
    watched_times = colCounts(movielense)
)
# dim(movie_watched)
head(movie_watched, 3)

##                          movie_name watched_times
## Toy Story (1995)   Toy Story (1995)           452
## GoldenEye (1995)   GoldenEye (1995)           131
## Four Rooms (1995) Four Rooms (1995)            90

top_ten_movies <- movie_watched[order(movie_watched$watched_times, decreasing = TRUE), ][1:10, ] 

ggplot(top_ten_movies)     + 
    aes( x=reorder (movie_name, -watched_times), y=watched_times)     +  
     geom_bar(stat = "identity")    + 
     xlab("Movie") + ylab("Count")     +  
     theme(axis.text = element_text(angle = 40, hjust = 1) )   +  
    ggtitle ("Top 10 Movie Watched" )

The 1977 Star Wars is the top watched movie, followed by Contact 1997.

## look at the first few ratings of the first user
head(as(movielense[1,], "list")[[1]])

##                                     Toy Story (1995) 
##                                                    5 
##                                     GoldenEye (1995) 
##                                                    3 
##                                    Four Rooms (1995) 
##                                                    4 
##                                    Get Shorty (1995) 
##                                                    3 
##                                       Copycat (1995) 
##                                                    3 
## Shanghai Triad (Yao a yao yao dao waipo qiao) (1995) 
##                                                    5

Next, we are limiting the data to only heavy users and popular movies only.

movielense <- movielense [rowCounts(movielense) > 30, colCounts(movielense) > 60]

print(paste0("The dimensions of the reduced dataset is :", nrow(movielense), " Movies x ",ncol(movielense)," Users " ))

## [1] "The dimensions of the reduced dataset is :726 Movies x 529 Users "

# print(paste0("Number of Rows (Uses) after filtering : ", nrow(movielense)))
# print(paste0("Number of Columns (Movies) after filtering : ", ncol(movielense)))

Now the matrix is reduced to 675 users by 519 movies.

Split Data into Training and Testing Data Sets

set.seed(2020)#seed as year
rows <- nrow(movielense)
# Selecting sample data with probility 0.75 and 0.25 for training and testing purpose
pobality_data <- sample(x = c(TRUE, FALSE), size = rows, replace = TRUE, prob = c(0.75, 0.25))
training_dataset <- movielense[pobality_data, ] #selecting data for training
testing_dataset <- movielense[!pobality_data, ] #selecting data for testing
print(paste0("Traing data has ", nrow(training_dataset)," rows"))

## [1] "Traing data has 544 rows"

print(paste0("Testing data has ", nrow(testing_dataset)," rows"))

## [1] "Testing data has 182 rows"

Singular Value Decomposition Model

We choose Singular Value Decompostion to train our model so that we can use it to recommend movies.

training_time <- system.time({
    model <- Recommender(data = training_dataset, method = "SVD") 
})
print("Model training time : ")

## [1] "Model training time : "

print(training_time)

##    user  system elapsed 
##    0.04    0.00    0.05

print(model)

## Recommender of type 'SVD' for 'realRatingMatrix' 
## learned using 544 users.

Prediction

predicted_top_four_movies <- predict(object = model, newdata = testing_dataset, n = 4)
predicted_top_four_movies_df <- data.frame (users = sort(rep(1:length(predicted_top_four_movies@items), 
              predicted_top_four_movies@n)), ratings = unlist(predicted_top_four_movies@ratings),
              index = unlist(predicted_top_four_movies@items))

Lets display the top 4 recommendations for first 6 users

predicted_top_four_movies_df$title <- predicted_top_four_movies@itemLabels[predicted_top_four_movies_df$index]

predicted_top_four_movies_df$year <- MovieLenseMeta$year[predicted_top_four_movies_df$index]

predicted_top_four_movies_df <- predicted_top_four_movies_df %>% group_by(users) %>% top_n(4,ratings)

predicted_top_four_movies_df[predicted_top_four_movies_df$users %in% (1:6), ]

## # A tibble: 24 x 5
## # Groups:   users [6]
##    users ratings index title                            year
##    <int>   <dbl> <int> <chr>                           <dbl>
##  1     1    4.53   360 It's a Wonderful Life (1946)     1995
##  2     1    4.24   131 Wrong Trousers, The (1993)       1961
##  3     1    4.10   373 Third Man, The (1949)            1994
##  4     1    4.09   301 Close Shave, A (1995)            1997
##  5     2    3.37   134 Empire Strikes Back, The (1980)  1941
##  6     2    3.26   136 Raiders of the Lost Ark (1981)   1939
##  7     2    3.20    67 Sleepless in Seattle (1993)      1994
##  8     2    3.18    19 Braveheart (1995)                1995
##  9     3    4.14   225 Leaving Las Vegas (1995)         1996
## 10     3    3.74   345 Trainspotting (1996)             1998
## # ... with 14 more rows

Ten-folds to validate the results

evaulation_set <- evaluationScheme(data = movielense, method = "cross-validation", k = 10, given = 20, goodRating = 3)
print(evaulation_set)

## Evaluation scheme with 20 items given
## Method: 'cross-validation' with 10 run(s).
## Good ratings: >=3.000000
## Data set: 726 x 529 rating matrix of class 'realRatingMatrix' with 74956 ratings.

library(pander)

## Warning: package 'pander' was built under R version 3.6.3

error_rate <- evaluate(x = evaulation_set, method = "SVD", type = "ratings")

## SVD run fold/sample [model time/prediction time]
##   1  [0.05sec/0.03sec] 
##   2  [0.2sec/0.02sec] 
##   3  [0.07sec/0.02sec] 
##   4  [0.05sec/0.02sec] 
##   5  [0.05sec/0.01sec] 
##   6  [0.07sec/0.01sec] 
##   7  [0.03sec/0.03sec] 
##   8  [0.04sec/0.02sec] 
##   9  [0.21sec/0sec] 
##   10  [0.05sec/0.01sec]

model_computed <- as.data.frame(sapply(avg(error_rate), rbind))
model_computed <- as.data.frame(t(as.matrix(model_computed)))
colnames(model_computed) <- c("RMSE", "MSE", "MAE")
row.names(model_computed) <- NULL

pander(model_computed, caption = "Model Prediction")

Model Prediction
RMSE	MSE	MAE
1.02	1.041	0.813

# Error in pander(model_computed, caption = "Model Prediction") : 
#   could not find function "pander"