In Project 2, we implmented 2 different types of Collaborative Filtering systems:

  1. User Based Collaborative Filtering Systems (UBCF)

  2. Item Based Collaborative Filtering Systems (IBCF)

A comparative analysis was then performed on the two different approaches using these different algorithms

Note: We shall follow the spirit & methodologies as prescribed in Chaper 3 from the book:

“BUILDING_A_RECOMMENDATION_SYSTEM_WITH_R”

Reference:

Movielens Data source was downloaded from Kaggle website: https://www.kaggle.com/prajitdatta/movielens-100k-dataset?


Load libraries

library(Matrix)
library(reshape2)
library(data.table)
library(tidyr)
library(dplyr)
library(kableExtra)
library("scales")
library("recommenderlab")
library(tidytext)
library(psych)
library(knitr)
library(ggplot2)
require(ggthemes)

Loading & Merging 3 separate Data tables

url3 <-'https://raw.githubusercontent.com/ssufian/DAT-612/master/u.item.csv'

#User Table contains User (viewers) information
users <- read.table("https://raw.githubusercontent.com/ssufian/DAT-612/master/u.user.data", sep='')
colnames(users) <- c('user_id', 'age', 'sex', 'occupation', 'zip_code') #Assign column names
#u.Data contains Ratings by users and associated movie ID nformation
ratings <- read.table('https://raw.githubusercontent.com/ssufian/DAT-612/master/u.data', sep='')
colnames(ratings) <- c('user_id', 'movie_id', 'rating', 'unix_timestamp')
#u.item (csv file in Github) contains Movie information such as Titles, Movie ID, releae dates and its imdb urls
movies <- read.csv(file=url3, sep=",",na.strings = c("NA","",""),strip.white = T, stringsAsFactors = F, header=F)
colnames(movies) <- c('movie_id', 'title', 'release_date', 'imdb_url')
movie_ratings <- merge(movies, ratings, by="movie_id")
lens <- merge(movie_ratings, users, by="user_id") %>% select(-c(unix_timestamp,age,sex,occupation,zip_code,imdb_url,release_date))

Creating Utility Matrix

# Convert to matrix
UIMatrix <- lens %>% spread(title, rating)
# The utilitiy matrix is really Sparse!  That is why is comented out to Not take up space in Rpubs
#head(UIMatrix)

Observation 1:


I: Creating an Item_Item Matrix from ratings table

# Making the Item_Item Matrix into Wide Format
IMatrix <- ratings %>% 
  select(-unix_timestamp) %>% 
  spread(movie_id, rating)

Coercing the Item_Item Matrix into a class called “RealRatingMatrix”

IMatrix <- IMatrix[-c(1)]
#Forcinig the IMatrix to belong to the class called "realRatingMatrix" per Chp 3 
mMatrix <- as(as.matrix(IMatrix), "realRatingMatrix")

mMatrix
## 943 x 1682 rating matrix of class 'realRatingMatrix' with 100000 ratings.
#checking the class out
methods(class = class(mMatrix))
##  [1] [                      [<-                    binarize              
##  [4] calcPredictionAccuracy coerce                 colCounts             
##  [7] colMeans               colSds                 colSums               
## [10] denormalize            dim                    dimnames              
## [13] dimnames<-             dissimilarity          evaluationScheme      
## [16] getData.frame          getList                getNormalize          
## [19] getRatingMatrix        getRatings             getTopNLists          
## [22] image                  normalize              nratings              
## [25] Recommender            removeKnownRatings     rowCounts             
## [28] rowMeans               rowSds                 rowSums               
## [31] sample                 show                   similarity            
## see '?methods' for accessing help and source code

Observation 2:


object.size(mMatrix)
## 1316744 bytes
object.size(as(mMatrix, "matrix"))
## 12797096 bytes

Exploring the values & range of the raw data: values of original ratings

vector_ratings <- as.vector(mMatrix@data)
unique(vector_ratings)
## [1] 5 4 0 3 1 2
table_ratings <- table(vector_ratings)
table_ratings
## vector_ratings
##       0       1       2       3       4       5 
## 1486126    6110   11370   27145   34174   21201
vector_ratings <- vector_ratings[vector_ratings != 0]
vector_ratings <- factor(vector_ratings)
qplot(vector_ratings,fill = vector_ratings)+ ggtitle("Fig1: Distribution of ratings > 0")+theme_economist()

Observation 3:


Data preparation

Because of these 2 reasons:

  1. user ratings with 100 movies or and movies that have been viewed at least 100 times

  2. Top quantile movie viewers and movies viewed; top 2% percentile


Selecting the most relevant data

#Data relevancy setting min thresholds of 100 respectively
ratings_movies <- mMatrix[rowCounts(mMatrix) > 100,colCounts(mMatrix) > 100] 

# Filtering on top quantile of movies viewed and movie viewers; in this case top 2 percent viewers and moives that were viewed
min_movies <- quantile(rowCounts(ratings_movies), 0.98)
min_users <- quantile(colCounts(ratings_movies), 0.98)

#picking only the top 2% and see how their ratings looks like
top_quantile_ratings_movies <-ratings_movies[rowCounts(ratings_movies) > min_movies,colCounts(ratings_movies) > min_users]
top_quantile_ratings_per_user <- rowMeans(top_quantile_ratings_movies )

#average ratings of relevant raters
average_ratings_per_user <- rowMeans(ratings_movies )
qplot(average_ratings_per_user,colour = "red") + stat_bin(binwidth = 0.1) + 
ggtitle("Fig2a: Distribution of movies ratings & movies viewed > 100 times")+theme_economist()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

qplot(top_quantile_ratings_per_user,fill="yellow" ) + stat_bin(binwidth = .01) +
ggtitle("Fig2b: Distribution of the top 2% percentile ratings per user")+theme_economist()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Observation 4:


Normalizing the data

#The prebuilt normalize function does it automatically
# This is to "unbiased"  extremely low or high ratings by a few users
ratings_movies_norm <- normalize(ratings_movies)
avg <- round(rowMeans(ratings_movies_norm),5)
table(avg)
## avg
##   0 
## 361

Here’s a view of non-normalized vs. normalized datasets

minItems <- quantile(rowCounts(ratings_movies), 0.95)
minUsers <- quantile(colCounts(ratings_movies), 0.95)

image(ratings_movies[rowCounts(ratings_movies) > minItems, 
                 colCounts(ratings_movies) > minUsers],main = "Fig3a:Heatmap of relevant Users & Movies (Non-Normalized)")

image(ratings_movies_norm [rowCounts(ratings_movies_norm ) > minItems, 
                 colCounts(ratings_movies_norm ) > minUsers], 
      main = "Fig3b: Heatmap of relevant Users & Movies (Normalized)")

Observation 5:

The first difference is the colors, and this is because the data is continuous. Previously, the rating was an integer between 1 and 5. After the normalization, the rating can ranged between -5 and +5


II: Item-Item Collaborative Filtering (IBCF)

set.seed(1)
which_train <- sample(x = c(TRUE, FALSE), size = nrow(ratings_movies),replace = TRUE, prob = c(0.8, 0.2))

movieTrain <- ratings_movies[which_train, ]
movieTest <- ratings_movies[!which_train, ]
movieTrain
## 297 x 334 rating matrix of class 'realRatingMatrix' with 37357 ratings.
movieTest
## 64 x 334 rating matrix of class 'realRatingMatrix' with 8100 ratings.

Parameters of the IBCF method

recc_model <-Recommender(movieTrain, method = "IBCF" ,parameter=list(normalize = "Z-score",method="Jaccard",k=30)) 

model_details <- getModel(recc_model)$sim

wMax <- order(colSums(model_details > 0), decreasing = TRUE)[1:10]

topMovies <- as.data.frame(as.integer(rownames(model_details)[wMax]))
colnames(topMovies) <- c("movie_id")


data <- inner_join(topMovies,movies, by = "movie_id") %>% select(Movie = "title")

kable(data , format = "pandoc", digits = 3,align= "l",caption = "Fig3a- IBCF: Top 10 movies that are similar to other moviess",font_size=12)
Fig3a- IBCF: Top 10 movies that are similar to other moviess
Movie
Benny & Joon (1993)
Father of the Bride Part II (1995)
Happy Gilmore (1996)
Michael (1996)
Peacemaker, The (1997)
Down Periscope (1996)
Craft, The (1996)
Grumpier Old Men (1995)
Multiplicity (1996)
Singin’ in the Rain (1952)

Applying the recommender model on the test set

n_recommended <- 10#number of reccomendations

recc_predicted <- predict(object = recc_model, newdata = movieTest, n = n_recommended)
recc_predicted
## Recommendations as 'topNList' with n = 10 for 64 users.

Recommendation for the 1st Viewer

# Recommendations for the first user
recommended <- recc_predicted@itemLabels[recc_predicted@items[[1]]]
recommended <- as.data.frame(as.integer(recommended))
colnames(recommended) <- c("movie_id")
data2 <- recommended %>% inner_join(movies, by = "movie_id") %>% select(Movie = "title")
kable(data2 , format = "pandoc", digits = 3,align= "l",caption = "Fig3c-IBCF: : Movies recommended to 1st viewer",font_size=12)
Fig3c-IBCF: : Movies recommended to 1st viewer
Movie
Singin’ in the Rain (1952)
Benny & Joon (1993)
Saint, The (1997)
Father of the Bride Part II (1995)
Happy Gilmore (1996)
Michael (1996)
Peacemaker, The (1997)
Down Periscope (1996)
Craft, The (1996)
Grumpier Old Men (1995)

Observation 6:

The Top 10 similar movies from training data set matches about 70% to the recommended Top 10 movies predicted on the test data set. Not sure if this was “random luck” or the training went really well and was able to generalize to the testing data set?


III: User_User Collaborative Filtering (UBCF)

  1. Measure how similar each user is to the new one. Like IBCF, popular similarity measures are correlation and cosine
(recc_model_UBCF <- Recommender(movieTrain, method = "UBCF",parameter=list(normalize ="Z-score",method="Jaccard",nn=10)))
## Recommender of type 'UBCF' for 'realRatingMatrix' 
## learned using 297 users.
n_recommended1 <- 10
recc_predicted_UBCF <- predict(object = recc_model_UBCF,newdata = movieTest, n = n_recommended1) 

recc_predicted_UBCF
## Recommendations as 'topNList' with n = 10 for 64 users.

Repeating the same process

recommended2 <- recc_predicted_UBCF@itemLabels[recc_predicted_UBCF@items[[1]]]
recommended2 <- as.data.frame(as.integer(recommended2))
colnames(recommended2) <- c("movie_id")
data4 <- recommended2 %>% inner_join(movies, by = "movie_id") %>% select(Movie = "title")

kable(data4 , format = "pandoc", digits = 3,align= "l",caption = "Fig4-UBCF: Movies recommended to 1st viewer",font_size=12)
Fig4-UBCF: Movies recommended to 1st viewer
Movie
Close Shave, A (1995)
Lone Star (1996)
Postino, Il (1994)
Toy Story (1995)
Big Night (1996)
Wrong Trousers, The (1993)
This Is Spinal Tap (1984)
Philadelphia Story, The (1940)
Cinema Paradiso (1988)
Mr. Holland’s Opus (1995)

Evalauation of IBCF vs. UBCF Filtering on ratings = 5

xval_split <- evaluationScheme(ratings_movies, method = "split", train = 0.8,given=3,goodRating = 5)

eval_prediction_IBCF_split <- predict(recc_model, getData(xval_split, "known"), type = "ratings")
eval_prediction_UBCF_split <- predict(recc_model_UBCF, getData(xval_split, "known"), type = "ratings")

err_eval_split <- rbind(
  IBCF = calcPredictionAccuracy(eval_prediction_IBCF_split, getData(xval_split , "unknown")),
  UBCF = calcPredictionAccuracy(eval_prediction_UBCF_split , getData(xval_split , "unknown")))

err_eval_split
##          RMSE      MSE       MAE
## IBCF 1.316658 1.733588 0.9713193
## UBCF 1.090818 1.189885 0.8708852
xval_kfold <- evaluationScheme(ratings_movies, method = "cross-validation", k=5,given=15,goodRating = 5)

eval_prediction_IBCF <- predict(recc_model, getData(xval_kfold , "known"), type = "ratings")
eval_prediction_UBCF <- predict(recc_model_UBCF, getData(xval_kfold , "known"), type = "ratings")

err_eval <- rbind(
  IBCF = calcPredictionAccuracy(eval_prediction_IBCF , getData(xval_kfold , "unknown")),
  UBCF = calcPredictionAccuracy(eval_prediction_UBCF  , getData(xval_kfold , "unknown")))

err_eval
##          RMSE      MSE       MAE
## IBCF 1.350568 1.824033 1.0098408
## UBCF 1.007187 1.014426 0.7975767

Summary of Results

This project focused on collaborative filtering because its one of the most basic and popular techniques of recommendation systems.  Also, its the only one that is supported by recommendrlab which makes it very convenient for beginners like myself to learn about the technique via its pre-built functionalities.  The project followed very closely the spirit of chp 3 in the book: "BUILDING_A_RECOMMENDATION_SYSTEM_WITH_R".

The data preparation process converted a highly sparse matrix to a more palatable "compact and dense" matrix by filtering only on the relevant data subset of viewers and rated movies greater than 100 times.  This was followed by normalization which made the ratings less biased.  

Item Based Collabortive Filtering (IBCF):

- Training on IBCF was based on a 80/20 split ratio
- Recommender Model were based on K=30 (similar items), Jaccard similarity function and normalized on z-scores
- For first viewer only, the Top 10 similar movies matches quite well (70%) with the predicted (recommended) movies in the test set
-Based on ratings of 5, the overall RSME using split ratio of 80/20 and K-fold of 5 were 1.38 & 1.35 respectively

User Based Collabortive Filtering (UBCF):

- Training on IBCF was based on a 80/20 split ratio
- Recommender Model were based on K=30 (similar items), Jaccard similarity function and normalized on z-scores
- For first viewer only, the Top 10 movies predicted (recommended) were nothing like in the training set 
-Based on ratings of 5, the overall RSME using split ratio of 80/20 and K-fold of 5 were 1.10 & 0.98 respectively

Conclusion

This project followed the steps as prescribe from chp 3 in the book. Recommenderlab was a really handy tool to visualize the inner workings of the IBCF and UBCF techniques. According to the book, UBCF’s accuracy is proven to be slightly more accurate than IBCF. However, the draw back of UBCF is in the presence of large ratings matrix and the trade-offs between computational intensities of the 2 methods. In this instance, the overall RSME in UBCF was better than IBCF but it’s worth to expand on this and perform addtional work to see if that kind of generialization is the case. Therefore, it is important to note that this is a learning exercise. More development and testing would be needed to really ascertain the reliability of the recommended results.