Data 612 Final Project

Project Overview

In this project, I develop a collaborative filtering recommender (CFR) system for recommending movies.

The collaborative filtering approach considers only user preferences and does not take into account the features or contents of the items (books or movies) being recommended. In this project, in order to recommend movies I will use a large set of users preferences towards the movies from a publicly available movie rating dataset.

Once the CFR is “tunned” will develop a shinny app with the implementation of the trained model

Dataset

The dataset used was from MovieLens, and is publicly available at http://grouplens.org/datasets/movielens/latest. This dataset contains 105339 ratings and 6138 tag applications across 10329 movies. These data were created by 668 users.The data are contained in four files: links.csv, movies.csv, ratings.csv and tags.csv. I only use the files movies.csv and ratings.csv to build a recommendation system.

movies <- read.csv("https://raw.githubusercontent.com/sortega7878/DATA612/master/final/movies.csv",stringsAsFactors=FALSE)
ratings <- read.csv("https://raw.githubusercontent.com/sortega7878/DATA612/master/final/ratings.csv")

A summary of movies is given below:

summary(movies)

##     movieId          title              genres         
##  Min.   :     1   Length:10329       Length:10329      
##  1st Qu.:  3240   Class :character   Class :character  
##  Median :  7088   Mode  :character   Mode  :character  
##  Mean   : 31924                                        
##  3rd Qu.: 59900                                        
##  Max.   :149532

head(movies)

##   movieId                              title
## 1       1                   Toy Story (1995)
## 2       2                     Jumanji (1995)
## 3       3            Grumpier Old Men (1995)
## 4       4           Waiting to Exhale (1995)
## 5       5 Father of the Bride Part II (1995)
## 6       6                        Heat (1995)
##                                        genres
## 1 Adventure|Animation|Children|Comedy|Fantasy
## 2                  Adventure|Children|Fantasy
## 3                              Comedy|Romance
## 4                        Comedy|Drama|Romance
## 5                                      Comedy
## 6                       Action|Crime|Thriller

A summary of ratings:

summary(ratings)

##      userId         movieId           rating        timestamp        
##  Min.   :  1.0   Min.   :     1   Min.   :0.500   Min.   :8.286e+08  
##  1st Qu.:192.0   1st Qu.:  1073   1st Qu.:3.000   1st Qu.:9.711e+08  
##  Median :383.0   Median :  2497   Median :3.500   Median :1.115e+09  
##  Mean   :364.9   Mean   : 13381   Mean   :3.517   Mean   :1.130e+09  
##  3rd Qu.:557.0   3rd Qu.:  5991   3rd Qu.:4.000   3rd Qu.:1.275e+09  
##  Max.   :668.0   Max.   :149532   Max.   :5.000   Max.   :1.452e+09

head(ratings)

##   userId movieId rating  timestamp
## 1      1      16    4.0 1217897793
## 2      1      24    1.5 1217895807
## 3      1      32    4.0 1217896246
## 4      1      47    4.0 1217896556
## 5      1      50    4.0 1217896523
## 6      1     110    4.0 1217896150

Both usersId and movieId are presented as integers and should be changed to factors. Genres of the movies are not easily usable because of their format.

Data Wrangling

Some pre-processing of the data available is required before creating the recommendation system.

First of all, I will re-organize the information of movie genres in such a way that allows future users to search for the movies they like within specific genres in the long format.

Extract a list of genres

I use a one-hot encoding to create a matrix of corresponding genres for each movie.

genres <- as.data.frame(movies$genres, stringsAsFactors=FALSE)

genres2 <- as.data.frame(tstrsplit(genres[,1], '[|]', 
                                   type.convert=TRUE), 
                         stringsAsFactors=FALSE)
colnames(genres2) <- c(1:10)

genre_list <- c("Action", "Adventure", "Animation", "Children", 
                "Comedy", "Crime","Documentary", "Drama", "Fantasy",
                "Film-Noir", "Horror", "Musical", "Mystery","Romance",
                "Sci-Fi", "Thriller", "War", "Western") # we have 18 genres in total

genre_matrix <- matrix(0,10330,18) #empty matrix, 10330=no of movies+1, 18=no of genres
genre_matrix[1,] <- genre_list #set first row to genre list
colnames(genre_matrix) <- genre_list #set column names to genre list

#iterate through matrix
for (i in 1:nrow(genres2)) {
  for (c in 1:ncol(genres2)) {
    genmat_col = which(genre_matrix[1,] == genres2[i,c])
    genre_matrix[i+1,genmat_col] <- 1
  }
}

#convert into dataframe
genre_matrix2 <- as.data.frame(genre_matrix[-1,], stringsAsFactors=FALSE) #remove first row, which was the genre list
for (c in 1:ncol(genre_matrix2)) {
  genre_matrix2[,c] <- as.integer(genre_matrix2[,c])  #convert from characters to integers
} 

head(genre_matrix2)

##   Action Adventure Animation Children Comedy Crime Documentary Drama
## 1      0         1         1        1      1     0           0     0
## 2      0         1         0        1      0     0           0     0
## 3      0         0         0        0      1     0           0     0
## 4      0         0         0        0      1     0           0     1
## 5      0         0         0        0      1     0           0     0
## 6      1         0         0        0      0     1           0     0
##   Fantasy Film-Noir Horror Musical Mystery Romance Sci-Fi Thriller War
## 1       1         0      0       0       0       0      0        0   0
## 2       1         0      0       0       0       0      0        0   0
## 3       0         0      0       0       0       1      0        0   0
## 4       0         0      0       0       0       1      0        0   0
## 5       0         0      0       0       0       0      0        0   0
## 6       0         0      0       0       0       0      0        1   0
##   Western
## 1       0
## 2       0
## 3       0
## 4       0
## 5       0
## 6       0

Create a matrix to search for a movie by genre

Now, I create a search matrix which allows an easy search of a movie by any of its genre. HEre data really begins to grow in size and sparcity.

search_matrix <- cbind(movies[,1:2], genre_matrix2)
head(search_matrix)

##   movieId                              title Action Adventure Animation
## 1       1                   Toy Story (1995)      0         1         1
## 2       2                     Jumanji (1995)      0         1         0
## 3       3            Grumpier Old Men (1995)      0         0         0
## 4       4           Waiting to Exhale (1995)      0         0         0
## 5       5 Father of the Bride Part II (1995)      0         0         0
## 6       6                        Heat (1995)      1         0         0
##   Children Comedy Crime Documentary Drama Fantasy Film-Noir Horror Musical
## 1        1      1     0           0     0       1         0      0       0
## 2        1      0     0           0     0       1         0      0       0
## 3        0      1     0           0     0       0         0      0       0
## 4        0      1     0           0     1       0         0      0       0
## 5        0      1     0           0     0       0         0      0       0
## 6        0      0     1           0     0       0         0      0       0
##   Mystery Romance Sci-Fi Thriller War Western
## 1       0       0      0        0   0       0
## 2       0       0      0        0   0       0
## 3       0       1      0        0   0       0
## 4       0       1      0        0   0       0
## 5       0       0      0        0   0       0
## 6       0       0      0        1   0       0

We can see that each movie can correspond to either one or more than one genre.

Converting ratings matrix in a proper format

In order to use the ratings data for building a recommendation engine with recommenderlab, I convert rating matrix into a sparse matrix of type realRatingMatrix.

#Create ratings matrix. Rows = userId, Columns = movieId

ratingmat <- dcast(ratings, userId~movieId, value.var = "rating", na.rm=FALSE)
ratingmat <- as.matrix(ratingmat[,-1]) #remove userIds

#Convert rating matrix into a recommenderlab sparse matrix
ratingmat <- as(ratingmat, "realRatingMatrix")
ratingmat

## 668 x 10325 rating matrix of class 'realRatingMatrix' with 105339 ratings.

Exploring Parameters of Recommendation Models

The recommenderlab package contains some options for the recommendation algorithm:

recommender_models <- recommenderRegistry$get_entries(dataType = "realRatingMatrix")
names(recommender_models)

## [1] "ALS_realRatingMatrix"          "ALS_implicit_realRatingMatrix"
## [3] "IBCF_realRatingMatrix"         "POPULAR_realRatingMatrix"     
## [5] "RANDOM_realRatingMatrix"       "RERECOMMEND_realRatingMatrix" 
## [7] "SVD_realRatingMatrix"          "SVDF_realRatingMatrix"        
## [9] "UBCF_realRatingMatrix"

lapply(recommender_models, "[[", "description")

## $ALS_realRatingMatrix
## [1] "Recommender for explicit ratings based on latent factors, calculated by alternating least squares algorithm."
## 
## $ALS_implicit_realRatingMatrix
## [1] "Recommender for implicit data based on latent factors, calculated by alternating least squares algorithm."
## 
## $IBCF_realRatingMatrix
## [1] "Recommender based on item-based collaborative filtering."
## 
## $POPULAR_realRatingMatrix
## [1] "Recommender based on item popularity."
## 
## $RANDOM_realRatingMatrix
## [1] "Produce random recommendations (real ratings)."
## 
## $RERECOMMEND_realRatingMatrix
## [1] "Re-recommends highly rated items (real ratings)."
## 
## $SVD_realRatingMatrix
## [1] "Recommender based on SVD approximation with column-mean imputation."
## 
## $SVDF_realRatingMatrix
## [1] "Recommender based on Funk SVD with gradient descend."
## 
## $UBCF_realRatingMatrix
## [1] "Recommender based on user-based collaborative filtering."

I will use IBCF and UBCF models for comparison and performance. Check the parameters of these two models.

recommender_models$IBCF_realRatingMatrix$parameters

## $k
## [1] 30
## 
## $method
## [1] "Cosine"
## 
## $normalize
## [1] "center"
## 
## $normalize_sim_matrix
## [1] FALSE
## 
## $alpha
## [1] 0.5
## 
## $na_as_zero
## [1] FALSE

recommender_models$UBCF_realRatingMatrix$parameters

## $method
## [1] "cosine"
## 
## $nn
## [1] 25
## 
## $sample
## [1] FALSE
## 
## $normalize
## [1] "center"

Exploring Similarity Data

Collaborative filtering algorithms are based on measuring the similarity between users or between items. For this purpose, recommenderlab contains the similarity function. The supported methods to compute similarities are cosine, pearson, and jaccard.

Next, I determine how similar the first four users are with each other by creating and visualizing similarity matrix that uses the cosine distance:

similarity_users <- similarity(ratingmat[1:4, ], 
                               method = "cosine", 
                               which = "users")
as.matrix(similarity_users)

##           1         2         3         4
## 1 0.0000000 0.9760860 0.9641723 0.9914398
## 2 0.9760860 0.0000000 0.9925732 0.9374253
## 3 0.9641723 0.9925732 0.0000000 0.9888968
## 4 0.9914398 0.9374253 0.9888968 0.0000000

image(as.matrix(similarity_users), main = "User similarity")

In the given matrix, each row and each column corresponds to a user, and each cell corresponds to the similarity between two users. The more red the cell is, the more similar two users are. Note that the diagonal is red, since it’s comparing each user with itself.

Using the same approach, I compute similarity between the first four movies.

similarity_items <- similarity(ratingmat[, 1:4], method =
                                 "cosine", which = "items")
as.matrix(similarity_items)

##           1         2         3         4
## 1 0.0000000 0.9669732 0.9559341 0.9101276
## 2 0.9669732 0.0000000 0.9658757 0.9412416
## 3 0.9559341 0.9658757 0.0000000 0.9864877
## 4 0.9101276 0.9412416 0.9864877 0.0000000

image(as.matrix(similarity_items), main = "Movies similarity")

Further data exploration

Now, I explore values of ratings.

vector_ratings <- as.vector(ratingmat@data)
unique(vector_ratings) # what are unique values of ratings

##  [1] 0.0 5.0 4.0 3.0 4.5 1.5 2.0 3.5 1.0 2.5 0.5

table_ratings <- table(vector_ratings) # what is the count of each rating value
table_ratings

## vector_ratings
##       0     0.5       1     1.5       2     2.5       3     3.5       4 
## 6791761    1198    3258    1567    7943    5484   21729   12237   28880 
##     4.5       5 
##    8187   14856

There are 11 unique score values. The lower values mean lower ratings and vice versa.

Distribution of the ratings

According to the documentation, a rating equal to 0 represents a missing value, so I remove them from the dataset before visualizing the results.

vector_ratings <- vector_ratings[vector_ratings != 0] # rating == 0 are NA values
vector_ratings <- factor(vector_ratings)

qplot(vector_ratings) + 
  ggtitle("Distribution of the ratings")

As we see, there are less low (less than 3) rating scores, the majority of movies are rated with a score of 3 or higher. The most common rating is 4.

Number of views of the top movies

Now, let’s see what are the most viewed movies.

views_per_movie <- colCounts(ratingmat) # count views for each movie

table_views <- data.frame(movie = names(views_per_movie),
                          views = views_per_movie) # create dataframe of views
table_views <- table_views[order(table_views$views, 
                                 decreasing = TRUE), ] # sort by number of views
table_views$title <- NA
for (i in 1:10325){
  table_views[i,3] <- as.character(subset(movies, 
                                         movies$movieId == table_views[i,1])$title)
}

table_views[1:6,]

##     movie views                                     title
## 296   296   325                       Pulp Fiction (1994)
## 356   356   311                       Forrest Gump (1994)
## 318   318   308          Shawshank Redemption, The (1994)
## 480   480   294                      Jurassic Park (1993)
## 593   593   290          Silence of the Lambs, The (1991)
## 260   260   273 Star Wars: Episode IV - A New Hope (1977)

ggplot(table_views[1:6, ], aes(x = title, y = views)) +
  geom_bar(stat="identity") + 
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) + 
ggtitle("Number of views of the top movies")

We see that “Pulp Fiction (1994)” is the most viewed movie, exceeding the second-most-viewed “Forrest Gump (1994)” by 14 views.

Distribution of the average movie rating

Now I identify the top-rated movies by computing the average rating of each of them.

average_ratings <- colMeans(ratingmat)

qplot(average_ratings) + 
  stat_bin(binwidth = 0.1) +
  ggtitle("Distribution of the average movie rating")

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

average_ratings_relevant <- average_ratings[views_per_movie > 50] 
qplot(average_ratings_relevant) + 
  stat_bin(binwidth = 0.1) +
  ggtitle(paste("Distribution of the relevant average ratings"))

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

The first image above shows the distribution of the average movie rating. The highest value is around 3, and there are a few movies whose rating is either 1 or 5. Probably, the reason is that these movies received a rating from a few people only, so we shouldn’t take them into account.

I remove the movies whose number of views is below a defined threshold of 50, creating a subset of only relevant movies. The second image above shows the distribution of the relevant average ratings. All the rankings are between 2.16 and 4.45. As expected, the extremes were removed. The highest value changes, and now it is around 4.

Heatmap of the rating matrix

I visualize the whole matrix of ratings by building a heat map whose colors represent the ratings. Each row of the matrix corresponds to a user, each column to a movie, and each cell to its rating.

image(ratingmat, main = "Heatmap of the rating matrix") # hard to read-too many dimensions

image(ratingmat[1:20, 1:25], main = "Heatmap of the first 20 rows and 25 columns")

Since there are too many users and items, the first chart is hard to read. The second chart is built zooming in on the first rows and columns.

Some users saw more movies than the others. So, instead of displaying some random users and items, I should select the most relevant users and items. Thus I visualize only the users who have seen many movies and the movies that have been seen by many users. To identify and select the most relevant users and movies, I follow these steps:

Determine the minimum number of movies per user.
Determine the minimum number of users per movie.
Select the users and movies matching these criteria.

min_n_movies <- quantile(rowCounts(ratingmat), 0.99)
min_n_users <- quantile(colCounts(ratingmat), 0.99)
print("Minimum number of movies per user:")

## [1] "Minimum number of movies per user:"

min_n_movies

##     99% 
## 1198.17

print("Minimum number of users per movie:")

## [1] "Minimum number of users per movie:"

min_n_users

## 99% 
## 115

image(ratingmat[rowCounts(ratingmat) > min_n_movies,
                 colCounts(ratingmat) > min_n_users], 
main = "Heatmap of the top users and movies")

Let’s take account of the users having watched more movies. Most of them have seen all the top movies, and this is not surprising. Some columns of the heatmap are darker than the others, meaning that these columns represent the highest-rated movies.Conversely, darker rows represent users giving higher ratings. Because of this, it might be useful to normalize the data, which I will do in the next step.

Data Preparation

The data preparation process consists of the following steps:

Select the relevant data.
Normalize the data.
Binarize the data.

In order to select the most relevant data, I define the minimum number of users per rated movie as 50 and the minimum views number per movie as 50:

ratings_movies <- ratingmat[rowCounts(ratingmat) > 50,
                             colCounts(ratingmat) > 50]
ratings_movies

## 420 x 447 rating matrix of class 'realRatingMatrix' with 38341 ratings.

#ratingmat

Such a selection of the most relevant data contains 420 users and 447 movies, compared to previous 668 users and 10325 movies in the total dataset.

Using the same approach as previously, I visualize the top 2 percent of users and movies in the new matrix of the most relevant data:

min_movies <- quantile(rowCounts(ratings_movies), 0.98)
min_users <- quantile(colCounts(ratings_movies), 0.98)
image(ratings_movies[rowCounts(ratings_movies) > min_movies,
                     colCounts(ratings_movies) > min_users], 
main = "Heatmap of the top users and movies")

average_ratings_per_user <- rowMeans(ratings_movies)
qplot(average_ratings_per_user) + stat_bin(binwidth = 0.1) +
  ggtitle("Distribution of the average rating per user")

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

In the heatmap, some rows are darker than the others. This might mean that some users give higher ratings to all the movies. The distribution of the average rating per user across all the users varies a lot, as the second chart above shows.

Normalizing data

Having users who give high (or low) ratings to all their movies might bias the results. In order to remove this effect, I normalize the data in such a way that the average rating of each user is 0. As a quick check, I calculate the average rating by users, and it is equal to 0, as expected:

ratings_movies_norm <- normalize(ratings_movies)
sum(rowMeans(ratings_movies_norm) > 0.00001)

## [1] 0

Now, I visualize the normalized matrix for the top movies. It is colored now because the data is continuous:

image(ratings_movies_norm[rowCounts(ratings_movies_norm) > min_movies,
                          colCounts(ratings_movies_norm) > min_users], 
main = "Heatmap of the top users and movies")

There are still some lines that seem to be more blue or more red. The reason is that I am visualizing only the top movies. I have already checked that the average rating is 0 for each user.

Binarizing data

Some recommendation models work on binary data, so it might be useful to binarize the data, that is, define a table containing only 0s and 1s. The 0s will be either treated as missing values or as bad ratings.

In our case, I can either:

Define a matrix having 1 if the user rated the movie, and 0 otherwise. In this case, the information about the rating is lost.
Define a matrix having 1 if the rating is above or equal to a definite threshold (for example, 3), and 0 otherwise. In this case, giving a bad rating to a movie is equivalent to not having rated it.

Depending on the context, one choice may be more appropriate than the other.

As a next step, I define two matrices following the two different approaches and visualize a 5 percent portion of each of binarized matrices.

1st option: define a matrix equal to 1 if the movie has been watched

ratings_movies_watched <- binarize(ratings_movies, minRating = 1)
min_movies_binary <- quantile(rowCounts(ratings_movies), 0.95)
min_users_binary <- quantile(colCounts(ratings_movies), 0.95)
image(ratings_movies_watched[rowCounts(ratings_movies) > min_movies_binary,
                             colCounts(ratings_movies) > min_users_binary], 
main = "Heatmap of the top users and movies")

2nd option: define a matrix equal to 1 if the cell has a rating above the threshold

ratings_movies_good <- binarize(ratings_movies, minRating = 3)
image(ratings_movies_good[rowCounts(ratings_movies) > min_movies_binary, 
colCounts(ratings_movies) > min_users_binary], 
main = "Heatmap of the top users and movies")

There are more white cells in the second heatmap, which shows that there are more movies with no or bad ratings than those that were not watched by raters.

ITEM-based Collaborative Filtering Model

Collaborative filtering is a branch of recommendation that takes account of the information about different users.

The starting point is a rating matrix in which rows correspond to users and columns correspond to items. The core algorithm is based on these steps:

For each two items, measure how similar they are in terms of having received similar ratings by similar users
For each item, identify the k most similar items
For each user, identify the items that are most similar to the user’s purchases

Defining training/test sets

I build the model using 80% of the whole dataset as a training set, and 20% - as a test set.

which_train <- sample(x = c(TRUE, FALSE), 
                      size = nrow(ratings_movies),
                      replace = TRUE, 
                      prob = c(0.8, 0.2))

recc_data_train <- ratings_movies[which_train, ]
recc_data_test <- ratings_movies[!which_train, ]

Building the recommendation model

Let’s have a look at the default parameters of IBCF model. Here, k is the number of items to compute the similarities among them in the first step. After, for each item, the algorithm identifies its k most similar items and stores the number. method is a similarity funtion, which is Cosine by default, may also be pearson. I create the model using the default parameters of method = Cosine and k=30.

recommender_models <- recommenderRegistry$get_entries(dataType ="realRatingMatrix")
recommender_models$IBCF_realRatingMatrix$parameters

## $k
## [1] 30
## 
## $method
## [1] "Cosine"
## 
## $normalize
## [1] "center"
## 
## $normalize_sim_matrix
## [1] FALSE
## 
## $alpha
## [1] 0.5
## 
## $na_as_zero
## [1] FALSE

recc_model <- Recommender(data = recc_data_train, 
                          method = "IBCF",
                          parameter = list(k = 30))

recc_model

## Recommender of type 'IBCF' for 'realRatingMatrix' 
## learned using 334 users.

class(recc_model)

## [1] "Recommender"
## attr(,"package")
## [1] "recommenderlab"

Exploring the recommender model:

model_details <- getModel(recc_model)


class(model_details$sim) # this contains a similarity matrix

## [1] "dgCMatrix"
## attr(,"package")
## [1] "Matrix"

dim(model_details$sim)

## [1] 447 447

n_items_top <- 20
image(model_details$sim[1:n_items_top, 1:n_items_top],
      main = "Heatmap of the first rows and columns")

row_sums <- rowSums(model_details$sim > 0)
table(row_sums)

## row_sums
##  30 
## 447

col_sums <- colSums(model_details$sim > 0)
qplot(col_sums) + stat_bin(binwidth = 1) + ggtitle("Distribution of the column count")

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

dgCMatrix is a similarity matrix created by the model. Its dimensions are 447 x 447, which is equal to the number of items. The heatmap of 20 first items show that many values are equal to 0. The reason is that each row contains only k (30) elements that are greater than 0. The number of non-null elements for each column depends on how many times the corresponding movie was included in the top k of another movie. Thus, the matrix is not neccessarily simmetric, which is also the case in our model.

The chart of the distribution of the number of elements by column shows there are a few movies that are similar to many others.

Applying recommender system on the dataset:

Now, it is possible to recommend movies to the users in the test set. I define n_recommended equal to 10 that specifies the number of movies to recommend to each user.

For each user, the algorithm extracts its rated movies. For each movie, it identifies all its similar items, starting from the similarity matrix. Then, the algorithm ranks each similar item in this way:

Extract the user rating of each purchase associated with this item. The rating is used as a weight.
Extract the similarity of the item with each purchase associated with this item.
Multiply each weight with the related similarity.
Sum everything up.

Then, the algorithm identifies the top 10 recommendations:

n_recommended <- 10 # the number of items to recommend to each user

recc_predicted <- predict(object = recc_model, 
                          newdata = recc_data_test, 
                          n = n_recommended)
recc_predicted

## Recommendations as 'topNList' with n = 10 for 86 users.

Let’s explore the results of the recommendations for the first user:

recc_user_1 <- recc_predicted@items[[1]] # recommendation for the first user
movies_user_1 <- recc_predicted@itemLabels[recc_user_1]
movies_user_2 <- movies_user_1
for (i in 1:10){
  movies_user_2[i] <- as.character(subset(movies, 
                                         movies$movieId == movies_user_1[i])$title)
}
movies_user_2

##  [1] "Toy Story (1995)"                  
##  [2] "Grumpier Old Men (1995)"           
##  [3] "Get Shorty (1995)"                 
##  [4] "Seven (a.k.a. Se7en) (1995)"       
##  [5] "Crimson Tide (1995)"               
##  [6] "French Kiss (1995)"                
##  [7] "Little Women (1994)"               
##  [8] "Natural Born Killers (1994)"       
##  [9] "Clear and Present Danger (1994)"   
## [10] "Four Weddings and a Funeral (1994)"

It’s also possible to define a matrix with the recommendations for each user. I visualize the recommendations for the first four users:

recc_matrix <- sapply(recc_predicted@items, 
                      function(x){ as.integer(colnames(ratings_movies)[x]) }) # matrix with the recommendations for each user
#dim(recc_matrix)
recc_matrix[,1:4]

##       [,1]  [,2] [,3] [,4]
##  [1,]    1 54286   47  150
##  [2,]    3  2542   70  163
##  [3,]   21  2424  145  552
##  [4,]   47  8665  165  923
##  [5,]  161  3471  288 1036
##  [6,]  236  1275  329 1089
##  [7,]  261  3703  745 1221
##  [8,]  288   784  919 1263
##  [9,]  349  3421 1047 1544
## [10,]  357   594 1204 1721

Here, the columns represent the first 4 users, and the rows are the movieId values of recommended 10 movies.

Now, let’s identify the most recommended movies. The following image shows the distribution of the number of items for IBCF:

number_of_items <- factor(table(recc_matrix))

chart_title <- "Distribution of the number of items for IBCF"
qplot(number_of_items) + ggtitle(chart_title)

number_of_items_sorted <- sort(number_of_items, decreasing = TRUE)
number_of_items_top <- head(number_of_items_sorted, n = 4)
table_top <- data.frame(as.integer(names(number_of_items_top)),
                       number_of_items_top)

for (i in 1:4){
  table_top[i,1] <- as.character(subset(movies, 
                                         movies$movieId == table_top[i,1])$title)
}

colnames(table_top) <- c("Movie title", "No of items")
head(table_top)

##                              Movie title No of items
## 19 Ace Ventura: When Nature Calls (1995)          13
## 5     Father of the Bride Part II (1995)          12
## 3                Grumpier Old Men (1995)          11
## 7                         Sabrina (1995)          11

Most of the movies have been recommended only a few times, and a few movies have been recommended more than 5 times.

IBCF recommends items on the basis of the similarity matrix. It’s an eager-learning model, that is, once it’s built, it doesn’t need to access the initial data. For each item, the model stores the k-most similar, so the amount of information is small once the model is built. This is an advantage in the presence of lots of data.

In addition, this algorithm is efficient and scalable, so it works well with big rating matrices.

USER-based Collaborative Filtering Model

Now, I will use the user-based approach. According to this approach, given a new user, its similar users are first identified. Then, the top-rated items rated by similar users are recommended.

For each new user, these are the steps:

Measure how similar each user is to the new one. Like IBCF, popular similarity measures are correlation and cosine.
Identify the most similar users. The options are:

Take account of the top k users (k-nearest_neighbors)
Take account of the users whose similarity is above a defined threshold

Rate the movies rated by the most similar users. The rating is the average rating among similar users and the approaches are:

Average rating
Weighted average rating, using the similarities as weights

Pick the top-rated movies.

Building the recommendation system:

Again, let’s first check the default parameters of UBCF model. Here, nn is a number of similar users, and method is a similarity function, which is cosine by default. I build a recommender model leaving the parameters to their defaults and using the training set.

recommender_models <- recommenderRegistry$get_entries(dataType ="realRatingMatrix")
recommender_models$UBCF_realRatingMatrix$parameters

## $method
## [1] "cosine"
## 
## $nn
## [1] 25
## 
## $sample
## [1] FALSE
## 
## $normalize
## [1] "center"

recc_model <- Recommender(data = recc_data_train, method = "UBCF")
recc_model

## Recommender of type 'UBCF' for 'realRatingMatrix' 
## learned using 334 users.

model_details <- getModel(recc_model)
#names(model_details)
model_details$data

## 334 x 447 rating matrix of class 'realRatingMatrix' with 30549 ratings.
## Normalized using center on rows.

Applying the recommender model on the test set

In the same way as the IBCF, I now determine the top ten recommendations for each new user in the test set.

n_recommended <- 10
recc_predicted <- predict(object = recc_model,
                          newdata = recc_data_test, 
                          n = n_recommended) 
recc_predicted

## Recommendations as 'topNList' with n = 10 for 86 users.

Explore results

Let’s take a look at the first four users:

recc_matrix <- sapply(recc_predicted@items, 
                      function(x){ as.integer(colnames(ratings_movies)[x]) })
#dim(recc_matrix)
recc_matrix[, 1:4]

##       [,1] [,2] [,3] [,4]
##  [1,]  356  318  296  356
##  [2,]  589   50  356  457
##  [3,]   62 1183  318  587
##  [4,]  590  141  260   11
##  [5,]   50 2268   50  316
##  [6,]  377  265  593 1230
##  [7,] 5010 1704   47 1641
##  [8,] 3578  497   34    6
##  [9,] 5418   17  457  480
## [10,] 2571 1193 1193  141

The above matrix contain movieId of each recommended movie (rows) for the first four users (columns) in our test dataset.

I also calculate how many times each movie got recommended and build the related frequency histogram:

number_of_items <- factor(table(recc_matrix))

chart_title <- "Distribution of the number of items for UBCF"
qplot(number_of_items) + ggtitle(chart_title)

Compared with the IBCF, the distribution has a longer tail. This means that there are some movies that are recommended much more often than the others. The maximum is more than 30, compared to 10-ish for IBCF.

Let’s take a look at the top titles:

number_of_items_sorted <- sort(number_of_items, decreasing = TRUE)
number_of_items_top <- head(number_of_items_sorted, n = 4)
table_top <- data.frame(as.integer(names(number_of_items_top)), number_of_items_top)

for (i in 1:4){
  table_top[i,1] <- as.character(subset(movies, 
                                         movies$movieId == table_top[i,1])$title)
}
colnames(table_top) <- c("Movie title", "No of items")
head(table_top)

##                           Movie title No of items
## 318  Shawshank Redemption, The (1994)          33
## 858             Godfather, The (1972)          31
## 2571               Matrix, The (1999)          28
## 50         Usual Suspects, The (1995)          25

Comparing the results of UBCF with IBCF helps find some useful insight on different algorithms. UBCF needs to access the initial data. Since it needs to keep the entire database in memory, it doesn’t work well in the presence of a big rating matrix. Also, building the similarity matrix requires a lot of computing power and time.

However, UBCF’s accuracy is proven to be slightly more accurate than IBCF (I will also discuss it in the next section), so it’s a good option if the dataset is not too big.

Evaluating the Recommender Systems

There are a few options to choose from when deciding to create a recommendation engine. In order to compare their performances and choose the most appropriate model, I follow these steps:

Prepare the data to evaluate performance
Evaluate the performance of some models
Choose the best performing models
Optimize model parameters

Preparing the data to evaluate models

We need two trainig and testing data to evaluate the model. There are several methods to create them: 1) splitting the data into training and test sets, 2) bootstrapping, 3) using k-fold.

Splitting the data

Splitting the data into training and test sets is often done using a 80/20 proportion.

percentage_training <- 0.8

For each user in the test set, we need to define how many items to use to generate recommendations. For this, I first check the minimum number of items rated by users to be sure there will be no users with no items to test.

min(rowCounts(ratings_movies))

## [1] 8

items_to_keep <- 5 #number of items to generate recommendations
rating_threshold <- 3 # threshold with the minimum rating that is considered good
n_eval <- 1 #number of times to run evaluation

eval_sets <- evaluationScheme(data = ratings_movies, 
                              method = "split",
                              train = percentage_training, 
                              given = items_to_keep, 
                              goodRating = rating_threshold, 
                              k = n_eval) 
eval_sets

## Evaluation scheme with 5 items given
## Method: 'split' with 1 run(s).
## Training set proportion: 0.800
## Good ratings: >=3.000000
## Data set: 420 x 447 rating matrix of class 'realRatingMatrix' with 38341 ratings.

getData(eval_sets, "train") # training set

## 336 x 447 rating matrix of class 'realRatingMatrix' with 30858 ratings.

getData(eval_sets, "known") # set with the items used to build the recommendations

## 84 x 447 rating matrix of class 'realRatingMatrix' with 420 ratings.

getData(eval_sets, "unknown") # set with the items used to test the recommendations

## 84 x 447 rating matrix of class 'realRatingMatrix' with 7063 ratings.

qplot(rowCounts(getData(eval_sets, "unknown"))) + 
  geom_histogram(binwidth = 10) + 
  ggtitle("unknown items by the users")

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

The above image displays the unknown items by the users, which varies a lot.

Bootstrapping the data

Bootrstrapping is another approach to split the data. The same user can be sampled more than once and, if the training set has the same size as it did earlier, there will be more users in the test set.

eval_sets <- evaluationScheme(data = ratings_movies, 
                              method = "bootstrap", 
                              train = percentage_training, 
                              given = items_to_keep,
                              goodRating = rating_threshold, 
                              k = n_eval)

table_train <- table(eval_sets@runsTrain[[1]])
n_repetitions <- factor(as.vector(table_train))
qplot(n_repetitions) + 
  ggtitle("Number of repetitions in the training set")

The above chart shows that most of the users have been sampled fewer than four times.

Using cross-validation to validate models

The k-fold cross-validation approach is the most accurate one, although it’s computationally heavier.

Using this approach, we split the data into some chunks, take a chunk out as the test set, and evaluate the accuracy. Then, we can do the same with each other chunk and compute the average accuracy.

n_fold <- 4
eval_sets <- evaluationScheme(data = ratings_movies, 
                              method = "cross-validation",
                              k = n_fold, 
                              given = items_to_keep, 
                              goodRating = rating_threshold)
size_sets <- sapply(eval_sets@runsTrain, length)
size_sets

## [1] 315 315 315 315

Using 4-fold approach, we get four sets of the same size 315.

Evaluating the ratings

I use the k-fold approach for evaluation.

First, I re-define the evaluation sets, build IBCF model and create a matrix with predicted ratings.

eval_sets <- evaluationScheme(data = ratings_movies, 
                              method = "cross-validation",
                              k = n_fold, 
                              given = items_to_keep, 
                              goodRating = rating_threshold)

model_to_evaluate <- "IBCF"
model_parameters <- NULL

eval_recommender <- Recommender(data = getData(eval_sets, "train"),
                                method = model_to_evaluate, 
                                parameter = model_parameters)

items_to_recommend <- 10
eval_prediction <- predict(object = eval_recommender, 
                           newdata = getData(eval_sets, "known"), 
                           n = items_to_recommend, 
                           type = "ratings")

qplot(rowCounts(eval_prediction)) + 
  geom_histogram(binwidth = 10) +
  ggtitle("Distribution of movies per user")

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

The above image displays the distribution of movies per user in the matrix of predicted ratings.

Now, I compute the accuracy measures for each user. Most of the RMSEs (Root mean square errors) are in the range of 0.5 to 1.8:

eval_accuracy <- calcPredictionAccuracy(x = eval_prediction, 
                                        data = getData(eval_sets, "unknown"), 
                                        byUser = TRUE)
head(eval_accuracy)

##           RMSE      MSE      MAE
## [1,] 1.0000000 1.000000 0.500000
## [2,] 1.0000000 1.000000 1.000000
## [3,] 0.7071068 0.500000 0.400000
## [4,] 0.3872983 0.150000 0.300000
## [5,] 1.5813559 2.500687 1.265576
## [6,] 0.5994789 0.359375 0.406250

qplot(eval_accuracy[, "RMSE"]) + 
  geom_histogram(binwidth = 0.1) +
  ggtitle("Distribution of the RMSE by user")

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

## Warning: Removed 5 rows containing non-finite values (stat_bin).

## Warning: Removed 5 rows containing non-finite values (stat_bin).

In order to have a performance index for the whole model, I specify byUser as FALSE and compute the average indices:

eval_accuracy <- calcPredictionAccuracy(x = eval_prediction, 
                                        data = getData(eval_sets, "unknown"), 
                                        byUser = FALSE) 
eval_accuracy

##      RMSE       MSE       MAE 
## 1.3017464 1.6945436 0.9429728

The measures of accuracy are useful to compare the performance of different models on the same data.

Evaluating the recommendations

Another way to measure accuracy is by comparing the recommendations with the views having a positive rating. For this, I use of a prebuilt evaluate function in recommenderlab library. The function evaluate the recommender performance depending on the number n of items to recommend to each user. I use n as a sequence n = seq(10, 100, 10). The first rows of the resulting performance matrix is presented below:

results <- evaluate(x = eval_sets, 
                    method = model_to_evaluate, 
                    n = seq(10, 100, 10))

## IBCF run fold/sample [model time/prediction time]
##   1  [0.64sec/0.03sec] 
##   2  [0.56sec/0.04sec] 
##   3  [0.67sec/0.03sec] 
##   4  [1.18sec/0.04sec]

head(getConfusionMatrix(results)[[1]])

##          TP        FP       FN       TN precision     recall        TPR
## 10 1.628571  8.142857 71.91429 360.3143 0.1679612 0.02247695 0.02247695
## 20 3.352381 16.133333 70.19048 352.3238 0.1742718 0.04660752 0.04660752
## 30 4.752381 24.219048 68.79048 344.2381 0.1671920 0.06631919 0.06631919
## 40 6.057143 31.933333 67.48571 336.5238 0.1629849 0.08352235 0.08352235
## 50 7.285714 39.219048 66.25714 329.2381 0.1598182 0.09892825 0.09892825
## 60 8.657143 45.685714 64.88571 322.7714 0.1620503 0.11759670 0.11759670
##           FPR
## 10 0.02215015
## 20 0.04382082
## 30 0.06584012
## 40 0.08687656
## 50 0.10690669
## 60 0.12465163

In order to have a look at all the splits at the same time, I sum up the indices of columns TP, FP, FN and TN in the confusion MAtrix:

columns_to_sum <- c("TP", "FP", "FN", "TN")
indices_summed <- Reduce("+", getConfusionMatrix(results))[, columns_to_sum]
head(indices_summed)

##           TP        FP       FN       TN
## 10  6.380952  32.88571 295.8667 1432.867
## 20 12.971429  65.27619 289.2762 1400.476
## 30 18.866667  97.49524 283.3810 1368.257
## 40 24.895238 128.52381 277.3524 1337.229
## 50 30.295238 158.63810 271.9524 1307.114
## 60 35.542857 186.17143 266.7048 1279.581

Finally, I plot the ROC and the precision/recall curves:

plot(results, annotate = TRUE, main = "ROC curve")

plot(results, "prec/rec", annotate = TRUE, main = "Precision-recall")

If a small percentage of rated movies is recommended, the precision decreases. On the other hand, the higher percentage of rated movies is recommended the higher is the recall.

Comparing models

In order to compare different models, I define them as a following list:

Item-based collaborative filtering, using the Cosine as the distance function
Item-based collaborative filtering, using the Pearson correlation as the distance function
User-based collaborative filtering, using the Cosine as the distance function
User-based collaborative filtering, using the Pearson correlation as the distance function
Random recommendations to have a base line

models_to_evaluate <- list(
IBCF_cos = list(name = "IBCF", 
                param = list(method = "cosine")),
IBCF_cor = list(name = "IBCF", 
                param = list(method = "pearson")),
UBCF_cos = list(name = "UBCF", 
                param = list(method = "cosine")),
UBCF_cor = list(name = "UBCF", 
                param = list(method = "pearson")),
random = list(name = "RANDOM", param=NULL)
)

Then, I define a different set of numbers for recommended movies (n_recommendations <- c(1, 5, seq(10, 100, 10))), run and evaluate the models:

n_recommendations <- c(1, 5, seq(10, 100, 10))
list_results <- evaluate(x = eval_sets, 
                         method = models_to_evaluate, 
                         n = n_recommendations)

## IBCF run fold/sample [model time/prediction time]
##   1  [0.75sec/0.03sec] 
##   2  [0.68sec/0.04sec] 
##   3  [0.63sec/0.03sec] 
##   4  [0.63sec/0.05sec] 
## IBCF run fold/sample [model time/prediction time]
##   1  [0.8sec/0.05sec] 
##   2  [0.8sec/0.05sec] 
##   3  [0.81sec/0.03sec] 
##   4  [0.71sec/0.06sec] 
## UBCF run fold/sample [model time/prediction time]
##   1  [0sec/0.22sec] 
##   2  [0.02sec/0.2sec] 
##   3  [0sec/0.2sec] 
##   4  [0sec/0.21sec] 
## UBCF run fold/sample [model time/prediction time]
##   1  [0sec/0.22sec] 
##   2  [0.01sec/0.21sec] 
##   3  [0sec/0.2sec] 
##   4  [0sec/0.21sec] 
## RANDOM run fold/sample [model time/prediction time]
##   1  [0sec/0.06sec] 
##   2  [0sec/0.08sec] 
##   3  [0.02sec/0.08sec] 
##   4  [0sec/0.07sec]

sapply(list_results, class) == "evaluationResults"

## IBCF_cos IBCF_cor UBCF_cos UBCF_cor   random 
##     TRUE     TRUE     TRUE     TRUE     TRUE

The following table presents as an example the first rows of the performance evaluation matrix for the IBCF with Cosine distance:

avg_matrices <- lapply(list_results, avg)
head(avg_matrices$IBCF_cos[, 5:8])

##    precision      recall         TPR         FPR
## 1  0.1767177 0.002480753 0.002480753 0.002253165
## 5  0.1704724 0.010838140 0.010838140 0.011158374
## 10 0.1628164 0.020968060 0.020968060 0.022549614
## 20 0.1663171 0.041658531 0.041658531 0.044589703
## 30 0.1635622 0.060454829 0.060454829 0.066570514
## 40 0.1642683 0.080014074 0.080014074 0.087718032

Identifying the most suitable model

I compare the models by building a chart displaying their ROC curves and Precision/recall curves.

plot(list_results, annotate = 1, legend = "topleft") 
title("ROC curve")

plot(list_results, "prec/rec", annotate = 1, legend = "bottomright")
title("Precision-recall")

A good performance index is the area under the curve (AUC), that is, the area under the ROC curve. Even without computing it, the chart shows that the highest is UBCF with cosine distance, so it’s the best-performing technique.

The UBCF with cosine distance is still the top model. Depending on what is the main purpose of the system, an appropriate number of items to recommend should be defined.

Optimizing a numeric parameter

IBCF takes account of the k-closest items. I will explore more values, ranging between 5 and 40, in order to tune this parameter:

vector_k <- c(5, 10, 20, 30, 40)
models_to_evaluate <- lapply(vector_k, function(k){
  list(name = "IBCF",
       param = list(method = "cosine", k = k))
})
names(models_to_evaluate) <- paste0("IBCF_k_", vector_k)

Now I build and evaluate the same IBCF/cosine models with different values of the k-closest items:

n_recommendations <- c(1, 5, seq(10, 100, 10))
list_results <- evaluate(x = eval_sets, 
                         method = models_to_evaluate, 
                         n = n_recommendations)

## IBCF run fold/sample [model time/prediction time]
##   1  [0.77sec/0.03sec] 
##   2  [0.67sec/0.03sec] 
##   3  [0.71sec/0.03sec] 
##   4  [0.59sec/0.03sec] 
## IBCF run fold/sample [model time/prediction time]
##   1  [0.69sec/0.04sec] 
##   2  [0.64sec/0.03sec] 
##   3  [0.67sec/0.04sec] 
##   4  [0.53sec/0.03sec] 
## IBCF run fold/sample [model time/prediction time]
##   1  [0.63sec/0.05sec] 
##   2  [0.61sec/0.03sec] 
##   3  [0.72sec/0.03sec] 
##   4  [0.67sec/0.03sec] 
## IBCF run fold/sample [model time/prediction time]
##   1  [0.66sec/0.03sec] 
##   2  [0.56sec/0.04sec] 
##   3  [0.78sec/0.05sec] 
##   4  [0.67sec/0.03sec] 
## IBCF run fold/sample [model time/prediction time]
##   1  [0.69sec/0.05sec] 
##   2  [0.68sec/0.04sec] 
##   3  [0.64sec/0.05sec] 
##   4  [0.58sec/0.03sec]

plot(list_results, annotate = 1, legend = "topleft") 
title("ROC curve")

plot(list_results, "prec/rec", annotate = 1, legend = "bottomright")
title("Precision-recall")

Based on the ROC curve’s plot, the k having the biggest AUC is 10. Another good candidate is 5, but it can never have a high TPR. This means that, even if we set a very high n value, the algorithm won’t be able to recommend a big percentage of items that the user liked. The IBCF with k = 5 recommends only a few items similar to the purchases. Therefore, it can’t be used to recommend many items.

Based on the precision/recall plot, k should be set to 10 to achieve the highest recall. If we are more interested in the precision, we set k to 5.

Shiny APP

To see the implementation of the model visit the following URL:

[Recommovie app] (https://sortega78.shinyapps.io/final/)

Conslusion

In this project, I have developed and evaluated a collaborative filtering recommender (CFR) system for recommending movies. The online app was created to demonstrate the User-based Collaborative Filtering approach for recommendation model.

Let’s discuss the strengths and weaknesses of the User-based Collaborative Filtering approach in general.

Strengths: User-based Collaborative Filtering gives recommendations that can be complements to the item the user was interacting with. This might be a stronger recommendation than what a item-based recommender can provide as users might not be looking for direct substitutes to a movie they had just viewed or previously watched.

Weaknesses: User-based Collaborative Filtering is a type of Memory-based Collaborative Filtering that uses all user data in the database to create recommendations. Comparing the pairwise correlation of every user in your dataset is not scalable. If there were millions of users, this computation would be very time consuming. Possible ways to get around this would be to implement some form of dimensionality reduction, such as Principal Component Analysis, or to use a model-based algorithm instead. Also, user-based collaborative filtering relies on past user choices to make future recommendations. The implications of this is that it assumes that a user’s taste and preference remains more or less constant over time, which might not be true and makes it difficult to pre-compute user similarities offline.

The point of being so prescriptive in the final project was demonstrating that despite all the ready routines there’sa number of parameters to take care and evaluate and even that you can think your model is optimal , tunning os a constant need.