Data 612 Final Project

Overview

For this project, the goal was to develop and deploy a Collaborative Filtering Recommender System (CFR) for Movie Recommendations.

A Collaborative Filtering approach consists of only the User’s Preferences, therefore, does not factors in the values or features of the particular variable being recommended.

In addition, the Movie Lens Dataset was used to gerenate values for this Recommendation System.

After the Trained CFR Model was successfully implemented, this system was deployed in a Shiny R Application.

The Data

The data for this project is the MovieLens Dataset, which can be found here here.

Containing 105339 ratings and 6138 tag applications, across 10329 movies, rated by 668 users.

The zipfile downloaded from the above link contained four files: links.csv, movies.csv, ratings.csv and tags.csv.

This system implements the use of the files movies.csv and ratings.csv.

movies <- read.csv("https://raw.githubusercontent.com/josephsimone/Data-612/master/final_project/movies.csv",stringsAsFactors=FALSE)
ratings <- read.csv("https://raw.githubusercontent.com/josephsimone/Data-612/master/final_project/ratings.csv")

Decscription of movies file:

head(movies)

Summary of of ratings file:

summary(ratings$rating)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   0.500   3.000   3.500   3.517   4.000   5.000

head(ratings)

In for this to function properly, both the usersId and movieId will have to be changed from their data types of integers to factors.

In addition, the genres of the movies will ne to be reformatted.

Data Processing

First, the movie’s genres will have to be converted into a one-hot encoding format.

This will serve as the backbone of how users will be able to search for the movies they have watched within specific genres in the long format.

Extract a list of genres

g <- as.data.frame(movies$g, stringsAsFactors=FALSE)
library(data.table)
g2 <- as.data.frame(tstrsplit(g[,1], '[|]', 
                                   type.convert=TRUE), 
                         stringsAsFactors=FALSE)
colnames(g2) <- c(1:10)

gl <- c("Action", "Adventure", "Animation", "Children", 
                "Comedy", "Crime","Documentary", "Drama", "Fantasy",
                "Film-Noir", "Horror", "Musical", "Mystery","Romance",
                "Sci-Fi", "Thriller", "War", "Western") # we have 18 genres in total

gm <- matrix(0,10330,18) #empty matrix, 10330=no of movies+1, 18=no of genres
gm[1,] <- gl #set first row to genre list
colnames(gm) <- gl #set column names to genre list

#iterate through matrix
for (i in 1:nrow(g2)) {
  for (c in 1:ncol(g2)) {
    genmat_col = which(gm[1,] == g2[i,c])
    gm[i+1,genmat_col] <- 1
  }
}

#convert into dataframe
gm2 <- as.data.frame(gm[-1,], stringsAsFactors=FALSE) #remove first row, which was the genre list
for (c in 1:ncol(gm2)) {
  gm2[,c] <- as.integer(gm2[,c])
} #convert from characters to integers

head(gm2)

Matrix of Movies and their Genres

The creation of a search matrix will act as a database of a movie by their genre(s).

search_movies <- cbind(movies[,1:2], g2)
head(search_movies)

From here we can see the data begin to grow in size and sparcity.

Now each movie will correspond to one or more genres.

realRatingMatrix Creation

This project utilizes the use of the recommenderlab R package.

In order to build a recommendation engine within recommenderlab, the conversion of the newly created ratings matrix into a Sparse Matrix known as a realRatingMatrix.

rm <- dcast(ratings, userId~movieId, value.var = "rating", na.rm=FALSE)
rm <- as.matrix(rm[,-1])

rm <- as(rm, "realRatingMatrix")
rm

## 668 x 10325 rating matrix of class 'realRatingMatrix' with 105339 ratings.

Recommendation Models

The recommenderlab package contains preconstructed models for the use of recommendation algorithms:

recommender_models <- recommenderRegistry$get_entries(dataType = "realRatingMatrix")
names(recommender_models)

##  [1] "ALS_realRatingMatrix"          "ALS_implicit_realRatingMatrix"
##  [3] "IBCF_realRatingMatrix"         "LIBMF_realRatingMatrix"       
##  [5] "POPULAR_realRatingMatrix"      "RANDOM_realRatingMatrix"      
##  [7] "RERECOMMEND_realRatingMatrix"  "SVD_realRatingMatrix"         
##  [9] "SVDF_realRatingMatrix"         "UBCF_realRatingMatrix"

lapply(recommender_models, "[[", "description")

## $ALS_realRatingMatrix
## [1] "Recommender for explicit ratings based on latent factors, calculated by alternating least squares algorithm."
## 
## $ALS_implicit_realRatingMatrix
## [1] "Recommender for implicit data based on latent factors, calculated by alternating least squares algorithm."
## 
## $IBCF_realRatingMatrix
## [1] "Recommender based on item-based collaborative filtering."
## 
## $LIBMF_realRatingMatrix
## [1] "Matrix factorization with LIBMF via package recosystem (https://cran.r-project.org/web/packages/recosystem/vignettes/introduction.html)."
## 
## $POPULAR_realRatingMatrix
## [1] "Recommender based on item popularity."
## 
## $RANDOM_realRatingMatrix
## [1] "Produce random recommendations (real ratings)."
## 
## $RERECOMMEND_realRatingMatrix
## [1] "Re-recommends highly rated items (real ratings)."
## 
## $SVD_realRatingMatrix
## [1] "Recommender based on SVD approximation with column-mean imputation."
## 
## $SVDF_realRatingMatrix
## [1] "Recommender based on Funk SVD with gradient descend (https://sifter.org/~simon/journal/20061211.html)."
## 
## $UBCF_realRatingMatrix
## [1] "Recommender based on user-based collaborative filtering."

This project will utilize both the IBCF and UBCF Models for comparison and performance.

recommender_models$IBCF_realRatingMatrix$parameters

## $k
## [1] 30
## 
## $method
## [1] "Cosine"
## 
## $normalize
## [1] "center"
## 
## $normalize_sim_matrix
## [1] FALSE
## 
## $alpha
## [1] 0.5
## 
## $na_as_zero
## [1] FALSE

recommender_models$UBCF_realRatingMatrix$parameters

## $method
## [1] "cosine"
## 
## $nn
## [1] 25
## 
## $sample
## [1] FALSE
## 
## $normalize
## [1] "center"

Collaborative Filtering is based on the measuring between the similarity of users or between items.

Within recommenderlab, the supported methods to computate similarities are cosine, pearson, and jaccard.

Exploration of Similarity Data

Next, determing how similar the first four users are with each other.

similarity_users <- similarity(rm[1:4, ], 
                               method = "cosine", 
                               which = "users")
as.matrix(similarity_users)

##           1         2         3         4
## 1 0.0000000 0.9760860 0.9641723 0.9914398
## 2 0.9760860 0.0000000 0.9925732 0.9374253
## 3 0.9641723 0.9925732 0.0000000 0.9888968
## 4 0.9914398 0.9374253 0.9888968 0.0000000

image(as.matrix(similarity_users), main = "User similarity")

Using the same approach, for the first four movies.

similarity_items <- similarity(rm[, 1:4], method =
                                 "cosine", which = "items")
as.matrix(similarity_items)

##           1         2         3         4
## 1 0.0000000 0.9669732 0.9559341 0.9101276
## 2 0.9669732 0.0000000 0.9658757 0.9412416
## 3 0.9559341 0.9658757 0.0000000 0.9864877
## 4 0.9101276 0.9412416 0.9864877 0.0000000

image(as.matrix(similarity_items), main = "Movies similarity")

Data Exploration Continued

Now, exploring the second data file’s values of ratings.

rating_values <- as.vector(rm@data)
unique(rating_values) # what are unique values of ratings

##  [1] 0.0 5.0 4.0 3.0 4.5 1.5 2.0 3.5 1.0 2.5 0.5

count_ratings <- table(rating_values) # what is the count of each rating value
count_ratings

## rating_values
##       0     0.5       1     1.5       2     2.5       3     3.5       4     4.5 
## 6791761    1198    3258    1567    7943    5484   21729   12237   28880    8187 
##       5 
##   14856

There are 11 unique rating values.

Distribution of Ratings

Arating equal to 0 represents a missing value, therefore, them from the dataset before visualizing the results.

rating_values <- rating_values[rating_values != 0]
rating_values <- factor(rating_values)

qplot(rating_values) + 
  ggtitle("Ratings Distrubtions")

The most common rating is \(4\).

The majority of movies are rated with a score of 3 or higher.

Number of Views ~ Top Movies

views_per_movie <- colCounts(rm)

count_views <- data.frame(movie = names(views_per_movie),
                          views = views_per_movie)
count_views <- count_views[order(count_views$views, 
                                 decreasing = TRUE), ]
count_views$title <- NA

for (i in 1:10325){
  count_views[i,3] <- as.character(subset(movies, 
                                         movies$movieId == count_views[i,1])$title)
}

count_views[1:6,]

ggplot(count_views[1:6, ], aes(x = title, y = views)) +
  geom_bar(stat="identity") + 
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) + 
ggtitle("Views of Top Movies")

“Pulp Fiction (1994)” is the most watched Movie, with “Forrest Gump (1994)” being the second.

Distribution of the Average Ratings

To find the Top-Rated Movies, the average rating for each was calculated.

average_ratings <- colMeans(rm)

qplot(average_ratings) + 
  stat_bin(binwidth = 0.1) +
  ggtitle("Distribution of the average movie rating")

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

average_ratings_relevant <- average_ratings[views_per_movie > 50] 
qplot(average_ratings_relevant) + 
  stat_bin(binwidth = 0.1) +
  ggtitle(paste("Distribution of Relevant Average Ratings"))

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

The first graph represents the distribution of the average movie rating. The highest value is ~ 3, with a few movies whose rating is either 1 or 5.

This is most likely due to the fact that these movies received a rating from only a few users, thererfore we should exclude these ratings.

The movies where number of views is below the defined threshold of 50 we removed. This creates a more narrow subset of the most relevant movies.

The second graph represents the distribution of the relevant average ratings. The rankings are between 2.16 and 4.45. The highest value changes, and now it is ~ 4.

Data Preparation

Part 1 - Revevant Data

In order to select relevant data, defining the minimum number of users per rated movie and the minimum views per movie as 50

movie_ratings <- rm[rowCounts(rm) > 50,
                             colCounts(rm) > 50]
movie_ratings

## 420 x 447 rating matrix of class 'realRatingMatrix' with 38341 ratings.

#rm

The previous rating-matrix had 668 users and 10325 movies, now the newly create most relevant rating-matrix contains 420 users and 447 movies.

average_ratings_per_user <- rowMeans(movie_ratings)
qplot(average_ratings_per_user) + stat_bin(binwidth = 0.1) +
  ggtitle("Distribution of Average Ratings, per User")

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Part 2 - Normalization

When dealing with a user pool who rates at a high or low ratings can result in bias.

In an effort to circumvent this problem, the normalization of the data was need

movie_ratings_norm <- normalize(movie_ratings)
sum(rowMeans(movie_ratings_norm) > 0.00001)

## [1] 0

Now, I visualize the normalized matrix for the top movies. It is colored now because the data is continuous:

min_movies <- quantile(rowCounts(movie_ratings), 0.95)
min_users <- quantile(colCounts(movie_ratings), 0.95)
image(movie_ratings_norm[rowCounts(movie_ratings) > min_movies,
                          colCounts(movie_ratings) > min_users], 
main = "Heatmap - Top Users & Movies")

There are still some lines that seem to be more blue or more red.

This is due to the above chart is visualizing only the top movies.

However, the average rating is still 0 for each user.

Part 3 - Convert Data to Binary

In order for the recommendation models to work well with the data we already having, conversion to binary data, will be useful. This is done by defining a matrixta encompassing 0’s and 1’s.

The 0’s will be either treated as missing values or as bad ratings.

Set-up

Define a matrix having 1 if the user rated the movie, and 0 otherwise.
- In this case, the information about the rating is lost.
Define a matrix having 1 if the rating is above or equal to a definite threshold (for example, 3), and 0 otherwise.
- In this case, giving a bad rating to a movie is equivalent to not having rated it.

Depending on the context, one option might be more suited to the needs of the models, depending on the context.

Next, defining of two matrices following the two different “set-up” options which visualize 5% portion of each of newly created binary matrices.

Option 1: Define a matrix equal to 1, if the movie has been watched

movie_ratings_watched <- binarize(movie_ratings, minRating = 1)
boolean_min_movies <- quantile(rowCounts(movie_ratings), 0.95)
boolean_min_users <- quantile(colCounts(movie_ratings), 0.95)
image(movie_ratings_watched[rowCounts(movie_ratings) > boolean_min_movies,
                             colCounts(movie_ratings) > boolean_min_users], 
main = "Heatmap - Top Users & Movies")

Option 2: Define a matrix equal to 1 if the cell has a rating above the threshold

movie_ratings_good <- binarize(movie_ratings, minRating = 3)
image(movie_ratings_good[rowCounts(movie_ratings) > boolean_min_movies, 
colCounts(movie_ratings) > boolean_min_users], 
main = "Heatmap - Top Users & Movies")

In the second heatmap, there are more white cells which means that there are more movies with no or bad ratings than movies that were not viewed.

ITEM-BASED Collaborative Filtering Model

For this type of model, we will first need to create a rating-matrix, which rows corresponds to users and columns corresponds to items.

This approach is based on:

For each two items, measure similar ratings by similar users
For each item, identify the k most similar items
For each user, identify the items that are most similar to the user’s ratings or reviews

Train & Test Sets

Built the model using \(80%\) of the total dataset as a training set, and \(20%\) as a test set.

which_train <- sample(x = c(TRUE, FALSE), 
                      size = nrow(movie_ratings),
                      replace = TRUE, 
                      prob = c(0.8, 0.2))

movie_train <- movie_ratings[which_train, ]
movie_test <- movie_ratings[!which_train, ]

Build Model

Let’s have a look at the default parameters of IBCF model. Here, k is the number of items to compute the similarities among them in the first step. After, for each item, the algorithm identifies its k most similar items and stores the number. method is a similarity funtion, which is Cosine by default, may also be pearson. I create the model using the default parameters of method = Cosine and k=30.

recommender_models <- recommenderRegistry$get_entries(dataType ="realRatingMatrix")
recommender_models$IBCF_realRatingMatrix$parameters

## $k
## [1] 30
## 
## $method
## [1] "Cosine"
## 
## $normalize
## [1] "center"
## 
## $normalize_sim_matrix
## [1] FALSE
## 
## $alpha
## [1] 0.5
## 
## $na_as_zero
## [1] FALSE

movie_model <- Recommender(data = movie_train, 
                          method = "IBCF",
                          parameter = list(k = 30))

movie_model

## Recommender of type 'IBCF' for 'realRatingMatrix' 
## learned using 340 users.

class(movie_model)

## [1] "Recommender"
## attr(,"package")
## [1] "recommenderlab"

Exploring Model

results_model <- getModel(movie_model)

class(results_model$sim)

## [1] "dgCMatrix"
## attr(,"package")
## [1] "Matrix"

dim(results_model$sim)

## [1] 447 447

n_items_top <- 20
image(results_model$sim[1:n_items_top, 1:n_items_top],
      main = "Heatmap - First Rows and Columns")

row_sums <- rowSums(results_model$sim > 0)
table(row_sums)

## row_sums
##  30 
## 447

col_sums <- colSums(results_model$sim > 0)
qplot(col_sums) + stat_bin(binwidth = 1) + ggtitle("Distribution of Column Count")

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

This newly created dgCMatrix similarity matrix has dimensions are 447 x 447, which is equal to the number of items.

The Heatmap shows 20 first items show that many values are equal to 0.

This is due to each row contains only \(k = 30\) elements that are greater than 0.

The number of non-null elements for each column depends on the amount the corresponding movie was included in the top k of another movie.

Therfore, this matrix is not simmetric, which is also the same in our model.

The chart of the distribution of the number of elements respresents, by column, that there are a few movies that are similar to others.

IBCF Recommendation System Implementation

n_recommended <- 10 

model_predictions <- predict(object = movie_model, 
                          newdata = movie_test, 
                          n = n_recommended)

model_predictions

## Recommendations as 'topNList' with n = 10 for 80 users.

Let’s explore the results of the recommendations for the First User

user_recommendation_1 <- model_predictions@items[[1]] 
movies_user_1 <- model_predictions@itemLabels[user_recommendation_1]
movies_user_2 <- movies_user_1

for (i in 1:10){
  movies_user_2[i] <- as.character(subset(movies, 
                                         movies$movieId == movies_user_1[i])$title)
}

movies_user_2

##  [1] "Dr. Strangelove or: How I Learned to Stop Worrying and Love the Bomb (1964)"
##  [2] "Wedding Singer, The (1998)"                                                 
##  [3] "Clueless (1995)"                                                            
##  [4] "Austin Powers: The Spy Who Shagged Me (1999)"                               
##  [5] "Star Trek: Generations (1994)"                                              
##  [6] "Bridget Jones's Diary (2001)"                                               
##  [7] "Austin Powers: International Man of Mystery (1997)"                         
##  [8] "Robin Hood: Men in Tights (1993)"                                           
##  [9] "Leaving Las Vegas (1995)"                                                   
## [10] "GoldenEye (1995)"

matrix_recommendation <- sapply(model_predictions@items, 
                      function(x){ as.integer(colnames(movie_ratings)[x]) }) 
matrix_recommendation[,1:4]

##       [,1] [,2]  [,3] [,4]
##  [1,]  750   25  8665    1
##  [2,] 1777   32  2424    2
##  [3,]   39  104 48780    3
##  [4,] 2683  110  5816    5
##  [5,]  329  161  3703    7
##  [6,] 4246  165  2542   10
##  [7,] 1517  223 60069   11
##  [8,]  520  231  3114   16
##  [9,]   25  317   784   17
## [10,]   10  318  3421   19

Here, the columns represent the first 4 users, and the rows are the movieId values of recommended 10 movies.

Now, let’s identify the most recommended movies. The following image shows the distribution of the number of items for IBCF:

num_items <- factor(table(matrix_recommendation))

chart_title <- "Distribution of Items for IBCF"
qplot(num_items) + ggtitle(chart_title)

num_items_sorted <- sort(num_items, decreasing = TRUE)
num_items_top <- head(num_items_sorted, n = 4)
table_top <- data.frame(as.integer(names(num_items_top)),
                       num_items_top)

for (i in 1:4){
  table_top[i,1] <- as.character(subset(movies, 
                                         movies$movieId == table_top[i,1])$title)
}

colnames(table_top) <- c("Movie Title", "# Items")
head(table_top)

Most of the movies have been recommended only a few times, and a few movies have been recommended more than 5 times.

IBCF recommends items on the basis of the similarity matrix. It’s an eager-learning model, that is, once it’s built, it doesn’t need to access the initial data. For each item, the model stores the k-most similar, so the amount of information is small once the model is built. This is an advantage in the presence of lots of data.

In addition, this algorithm is efficient and scalable, so it works well with big rating matrices.

USER-BASED Collaborative Filtering Model

Now, I will use the user-based approach. According to this approach, given a new user, its similar users are first identified. Then, the top-rated items rated by similar users are recommended.

For each new user, these are the steps:

Measure how similar each user is to the new one. Like IBCF, popular similarity measures are correlation and cosine.
Identify the most similar users. The options are:
- Take account of the top k users (k-nearest_neighbors)
- Take account of the users whose similarity is above a defined threshold
Rate the movies rated by the most similar users. The rating is the average rating among similar users and the approaches are:
- Average rating
- Weighted average rating, using the similarities as weights
Pick the top-rated movies.

Build Model

Again, let’s first check the default parameters of UBCF model. Here, nn is a number of similar users, and method is a similarity function, which is cosine by default. I build a recommender model leaving the parameters to their defaults and using the training set.

recommender_models <- recommenderRegistry$get_entries(dataType ="realRatingMatrix")
recommender_models$UBCF_realRatingMatrix$parameters

## $method
## [1] "cosine"
## 
## $nn
## [1] 25
## 
## $sample
## [1] FALSE
## 
## $normalize
## [1] "center"

movie_model <- Recommender(data = movie_train, method = "UBCF")
movie_model

## Recommender of type 'UBCF' for 'realRatingMatrix' 
## learned using 340 users.

results_model <- getModel(movie_model)

results_model$data

## 340 x 447 rating matrix of class 'realRatingMatrix' with 30415 ratings.
## Normalized using center on rows.

UBCF Recommendation System Implementation

In the same way as the IBCF, I now determine the top ten recommendations for each new user in the test set.

n_recommended <- 10
model_predictions <- predict(object = movie_model,
                          newdata = movie_test, 
                          n = n_recommended) 
model_predictions

## Recommendations as 'topNList' with n = 10 for 80 users.

Explore results

First Four Users

matrix_recommendation <- sapply(model_predictions@items, 
                      function(x){ as.integer(colnames(movie_ratings)[x]) })

matrix_recommendation[, 1:4]

##        [,1] [,2] [,3] [,4]
##  [1,]  1276  296  318  593
##  [2,]  1250  356   50  318
##  [3,]  1252  260   11  356
##  [4,]  1288  527  357   47
##  [5,]     6  318  265  110
##  [6,]   778   34 1193   11
##  [7,] 51662  778 1183  595
##  [8,]  2617 1210  497  480
##  [9,]  2692   50   17  527
## [10,]  1247  150  141 2571

The matrix above contains contain the movieId of each recommended movie, rows, for the first four users, cloumns, in the test dataset.

Frequency Histogram

num_items <- factor(table(matrix_recommendation))

chart_title <- "Distribution of Items for UBCF"
qplot(num_items) + ggtitle(chart_title)

Compared with the IBCF, the distribution of some movies that are recommended much more often than the others.

The maximum is more than 30, compared to \(10\) for IBCF.

Top Movie Titles

num_items_sorted <- sort(num_items, decreasing = TRUE)
num_items_top <- head(num_items_sorted, n = 4)
table_top <- data.frame(as.integer(names(num_items_top)), num_items_top)

for (i in 1:4){
  table_top[i,1] <- as.character(subset(movies, 
                                         movies$movieId == table_top[i,1])$title)
}
colnames(table_top) <- c("Movie Titles", "# Items")
head(table_top)

Comparison of the outcomes of both the UBCF with the IBCF, aids in finding useful insight on different methods.

The UBCF needs to access the initial data. Since it needs to keep the entire database in memory, it doesn’t work well in the presence of a big rating matrix.

In addition, building the similarity matrix requires a lot of computing power and time.

However, UBCF’s accuracy is proven to be slightly more accurate than IBCF (I will also discuss it in the next section), so it’s a good option if the dataset is not too big.

Evaluatiion of the Recommender Systems

In order to compare the models’ performances and choose the best suited model:

Prepare the data to evaluate performance
Evaluate the performance of some models
Choose the best performing models
Optimize model parameters

Data Preparation for the data to Evaluate Models

Splitting the data into Training and Test Sets
Bootstrapping Data
K-Fold Approach

Splitting Data

Splitting the data into Training and Test Sets at a 80/20 proportion.

train_percent <- 0.8

First, for each user in the test set, we need to define how the number of moviess to use, in order to generate recommendations.

To achieve this we need to check the minimum number of movies rated by users to be sure there will be no users with no movie to test.

min(rowCounts(movie_ratings))

## [1] 8

keep <- 5 
best_rating <- 3 
n_eval <- 1

eval_sets <- evaluationScheme(data = movie_ratings, 
                              method = "split",
                              train = train_percent, 
                              given = keep, 
                              goodRating = best_rating, k = n_eval)

eval_sets

## Evaluation scheme with 5 items given
## Method: 'split' with 1 run(s).
## Training set proportion: 0.800
## Good ratings: >=3.000000
## Data set: 420 x 447 rating matrix of class 'realRatingMatrix' with 38341 ratings.

getData(eval_sets, "train")

## 336 x 447 rating matrix of class 'realRatingMatrix' with 30369 ratings.

getData(eval_sets, "known")

## 84 x 447 rating matrix of class 'realRatingMatrix' with 420 ratings.

getData(eval_sets, "unknown")

## 84 x 447 rating matrix of class 'realRatingMatrix' with 7552 ratings.

qplot(rowCounts(getData(eval_sets, "unknown"))) + 
  geom_histogram(binwidth = 10) + 
  ggtitle("unknown moviess by the users")

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

This represents the unknown moviess by the users, varying quite a lot.

Bootstrapping

The same user can be sampled more than once during bootstraping.

When the training set has the same size previously, this will be more users in the test set.

eval_sets <- evaluationScheme(data = movie_ratings, 
                              method = "bootstrap", 
                              train = train_percent, 
                              given = keep,
                              goodRating = best_rating, 
                              k = n_eval)

table_train <- table(eval_sets@runsTrain[[1]])
n_repetitions <- factor(as.vector(table_train))
qplot(n_repetitions) + 
  ggtitle("Repetitions in the Training Set")

Respresents that most of the users that have sampled lower than four times.

K-Fold Cross-Validation Approach

K-Fold Cross-Validation yields is the most accurate, however, is the most resource intensive.

n_total <- 4
eval_sets <- evaluationScheme(data = movie_ratings, 
                              method = "cross-validation",
                              k = n_total, 
                              given = keep, 
                              goodRating = best_rating)
size_sets <- sapply(eval_sets@runsTrain, length)
size_sets

## [1] 315 315 315 315

When using the K-Fold approach, results in four sets of the same size 315.

Evaluation of Ratings

K-Fold Approach is used for evaluation of the results:

First, re-defining the evaluation sets.
Build IBCF model
Creating a matrix with predicted ratings.

eval_sets <- evaluationScheme(data = movie_ratings, 
                              method = "cross-validation",
                              k = n_total, 
                              given = keep, 
                              goodRating = best_rating)

evaluation_for_model <- "IBCF"
model_parameters <- NULL

evaluate_model <- Recommender(data = getData(eval_sets, "train"),
                                method = evaluation_for_model, parameter = model_parameters)

ri <- 10
evaluate_prediction <- predict(object = evaluate_model, 
                           newdata = getData(eval_sets, "known"), 
                           n = ri, 
                           type = "ratings")

qplot(rowCounts(evaluate_prediction)) + 
  geom_histogram(binwidth = 10) +
  ggtitle("Distribution of Movies, per user")

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Represents the distribution of movies per user in the matrix of predicted ratings.

Compute Accuracy

Most of the RMSEs (Root mean square errors) are in the range of 0.5 to 1.8:

evaluate_accuracy <- calcPredictionAccuracy(x = evaluate_prediction, 
                                        data = getData(eval_sets, "unknown"), 
                                        byUser = TRUE)
head(evaluate_accuracy)

##           RMSE       MSE       MAE
## [1,] 1.0399991 1.0815981 0.7794960
## [2,] 2.0816660 4.3333333 2.0000000
## [3,] 1.0787390 1.1636779 0.7935005
## [4,] 1.1072313 1.2259611 0.8281737
## [5,] 0.6859943 0.4705882 0.4705882
## [6,] 0.6653993 0.4427563 0.5689712

qplot(evaluate_accuracy[, "RMSE"]) + 
  geom_histogram(binwidth = 0.1) +
  ggtitle("Distribution of  RMSE, by user")

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## Warning: Removed 3 rows containing non-finite values (stat_bin).

## Warning: Removed 3 rows containing non-finite values (stat_bin).

evaluate_accuracy <- calcPredictionAccuracy(x = evaluate_prediction, 
                                        data = getData(eval_sets, "unknown"), 
                                        byUser = FALSE) 
evaluate_accuracy

##     RMSE      MSE      MAE 
## 1.415940 2.004886 1.038826

Evaluation of Recommendations

results <- evaluate(x = eval_sets, 
                    method = evaluation_for_model, 
                    n = seq(10, 100, 10))

## IBCF run fold/sample [model time/prediction time]
##   1  [0.25sec/0.03sec] 
##   2  [0.26sec/0.04sec] 
##   3  [0.3sec/0.03sec] 
##   4  [0.34sec/0.05sec]

head(getConfusionMatrix(results)[[1]])

##          TP        FP       FN       TN precision     recall        TPR
## 10 1.676190  8.038095 78.42857 353.8571 0.1725490 0.01897097 0.01897097
## 20 3.571429 15.857143 76.53333 346.0381 0.1838235 0.04110615 0.04110615
## 30 5.209524 23.742857 74.89524 338.1524 0.1798257 0.06062756 0.06062756
## 40 6.733333 31.428571 73.37143 330.4667 0.1763070 0.07755275 0.07755275
## 50 8.266667 38.704762 71.83810 323.1905 0.1764674 0.09480993 0.09480993
## 60 9.695238 45.542857 70.40952 316.3524 0.1766989 0.11216576 0.11216576
##           FPR
## 10 0.02244192
## 20 0.04419473
## 30 0.06604743
## 40 0.08737263
## 50 0.10741809
## 60 0.12653896

sun_col <- c("TP", "FP", "FN", "TN")
indices_summed <- Reduce("+", getConfusionMatrix(results))[, sun_col]
head(indices_summed)

##           TP        FP       FN       TN
## 10  6.752381  32.12381 295.3810 1433.743
## 20 13.190476  64.28571 288.9429 1401.581
## 30 19.552381  95.61905 282.5810 1370.248
## 40 25.380952 126.08571 276.7524 1339.781
## 50 30.780952 155.29524 271.3524 1310.571
## 60 35.676190 183.01905 266.4571 1282.848

ROC

plot(results, annotate = TRUE, main = "ROC Curve")

plot(results, "prec/rec", annotate = TRUE, main = "Precision/Recall")

From the above graphs, if a small percentage of rated movies is recommended, the precision decreases.

At the same time, the higher percentage of rated movies that are recommended, the higher the recall.

Model Comparisons

In order to compare different models, create a basline measure out of the following list:

Item-Based Collaborative Filtering - using the Cosine as the distance function
Item-based Collaborative Filtering - using the Pearson correlation as the distance function
User-based Collaborative Filtering - using the Cosine as the distance function
User-based Collaborative Filtering - using the Pearson correlation as the distance function
Random Recommendations

base_line <- list(
IBCF_cosine = list(name = "IBCF", 
                param = list(method = "cosine")),
IBCF_pearson = list(name = "IBCF", 
                param = list(method = "pearson")),
UBCF_cosine = list(name = "UBCF", 
                param = list(method = "cosine")),
UBCF_pearson = list(name = "UBCF", 
                param = list(method = "pearson")),
Random = list(name = "RANDOM", param=NULL)
)

n_recommendations <- c(1, 5, seq(10, 100, 10))
results_list <- evaluate(x = eval_sets, 
                         method = base_line, 
                         n = n_recommendations)

## IBCF run fold/sample [model time/prediction time]
##   1  [0.31sec/0.03sec] 
##   2  [0.25sec/0.04sec] 
##   3  [0.2sec/0.03sec] 
##   4  [0.22sec/0.01sec] 
## IBCF run fold/sample [model time/prediction time]
##   1  [0.29sec/0.01sec] 
##   2  [0.33sec/0.03sec] 
##   3  [0.33sec/0.08sec] 
##   4  [0.33sec/0.03sec] 
## UBCF run fold/sample [model time/prediction time]
##   1  [0sec/0.17sec] 
##   2  [0sec/0.15sec] 
##   3  [0sec/0.13sec] 
##   4  [0sec/0.14sec] 
## UBCF run fold/sample [model time/prediction time]
##   1  [0sec/0.16sec] 
##   2  [0sec/0.18sec] 
##   3  [0sec/0.15sec] 
##   4  [0sec/0.14sec] 
## RANDOM run fold/sample [model time/prediction time]
##   1  [0sec/0.03sec] 
##   2  [0sec/0.03sec] 
##   3  [0sec/0.04sec] 
##   4  [0sec/0.03sec]

sapply(results_list, class) == "evaluationResults"

##  IBCF_cosine IBCF_pearson  UBCF_cosine UBCF_pearson       Random 
##         TRUE         TRUE         TRUE         TRUE         TRUE

avg_matrices <- lapply(results_list, avg)
head(avg_matrices$IBCF_cos[, 5:8])

##    precision      recall         TPR         FPR
## 1  0.2052959 0.002714384 0.002714384 0.002080115
## 5  0.1794248 0.012236246 0.012236246 0.010943537
## 10 0.1738647 0.022505419 0.022505419 0.022026967
## 20 0.1706899 0.042833868 0.042833868 0.044138414
## 30 0.1705764 0.063677949 0.063677949 0.065537575
## 40 0.1685407 0.081774511 0.081774511 0.086342530

Plot - Best Fit Model

ROC curves & Precision/Recall Curves.

plot(results_list, annotate = 1, legend = "topleft") 
title("ROC Curve")

plot(results_list, "prec/rec", annotate = 1, legend = "bottomright")
title("Precision-Recall")

The UBCF with cosine distance performs the best out of all the models.

Optimiziation

IBCF takes in consideration the k-nearest neighbor.

Tuning parameters ranging between 5 and 40.

k <- c(5, 10, 20, 30, 40)
base_line <- lapply(k, function(k){
  list(name = "IBCF",
       param = list(method = "cosine", k = k))
})
names(base_line) <- paste0("IBCF_k_", k)

Construct IBCF/Cosine Models with different values of the k-nearest neighbor.

n_recommendations <- c(1, 5, seq(10, 100, 10))
results_list <- evaluate(x = eval_sets, 
                         method = base_line, 
                         n = n_recommendations)

## IBCF run fold/sample [model time/prediction time]
##   1  [0.22sec/0.01sec] 
##   2  [0.25sec/0.03sec] 
##   3  [0.22sec/0.02sec] 
##   4  [0.22sec/0.03sec] 
## IBCF run fold/sample [model time/prediction time]
##   1  [0.23sec/0.02sec] 
##   2  [0.22sec/0.02sec] 
##   3  [0.22sec/0.02sec] 
##   4  [0.24sec/0.01sec] 
## IBCF run fold/sample [model time/prediction time]
##   1  [0.2sec/0.04sec] 
##   2  [0.23sec/0.05sec] 
##   3  [0.24sec/0.03sec] 
##   4  [0.22sec/0.02sec] 
## IBCF run fold/sample [model time/prediction time]
##   1  [0.22sec/0.02sec] 
##   2  [0.23sec/0.04sec] 
##   3  [0.21sec/0.04sec] 
##   4  [0.25sec/0.02sec] 
## IBCF run fold/sample [model time/prediction time]
##   1  [0.22sec/0.03sec] 
##   2  [0.23sec/0.02sec] 
##   3  [0.22sec/0.03sec] 
##   4  [0.24sec/0.01sec]

plot(results_list, annotate = 1, legend = "topleft") 
title("ROC Curve")

plot(results_list, "prec/rec", annotate = 1, legend = "bottomright")
title("Precision-Recall")

The ROC Curve’s Plot shows k having the biggest \(AUC = 10\).

In additon. another good candidate is \(5\), because it can never have a high TPR.

Furthermore, the IBCF with k = 5 only recommends a few movies similar to the ratings.

Therefore, it cannot be the model used for recommendations.

Based on the Precision/Recall Plot, \(k\) should equal \(10\) to achieve the highest recall.

In that event, if we are more interested in the Precision, we set \(k\) to \(5\).

Conslusion

User-Based Collaborative Filtering

Strengths

User-Based Collaborative Filtering provides recommendations that are complimentary to the item the user are observing.
This prvides stronger Recommendations than an Item-Based Recommender.

Weaknesses

User-based Collaborative Filtering is a memeory intensive Collaborative Filtering that uses all user data in the database to create Recommendations.
However, comparing the pairwise correlation of every user in your dataset is not scalable. The more users the more time it would take to compute the Recommendations.

Shiny Application

For the project, the developed and evaluated a Collaborative Filtering Recommender System was implemented for the Recommendation of Movies.

The deployment of a Shiny R Application demonstrates the User-Based Collaborative Filtering Approach for the Recommendation Model.

Implementation

What2Watch

Appendix

Source Code

Data 612 Final Project

What2Watch - Recommendation System for Movies

Joseph Simone

May 19, 2020

Overview

The Data

Data Processing

Extract a list of genres

Matrix of Movies and their Genres

realRatingMatrix Creation

Recommendation Models

Exploration of Similarity Data

Data Exploration Continued

Distribution of Ratings

Number of Views ~ Top Movies

Distribution of the Average Ratings

Data Preparation

Part 1 - Revevant Data

Part 2 - Normalization

Part 3 - Convert Data to Binary

ITEM-BASED Collaborative Filtering Model

Train & Test Sets

Build Model

Exploring Model

IBCF Recommendation System Implementation

USER-BASED Collaborative Filtering Model

Build Model

UBCF Recommendation System Implementation

Explore results

Frequency Histogram

Top Movie Titles

Evaluatiion of the Recommender Systems

Data Preparation for the data to Evaluate Models

Splitting Data

Bootstrapping

K-Fold Cross-Validation Approach

Evaluation of Ratings

Compute Accuracy

Evaluation of Recommendations

ROC

Model Comparisons

Plot - Best Fit Model

Optimiziation

Conslusion

Shiny Application

Appendix