Introduction

The purpose of this project was to to build a recommender system and produce quality recommendations by extracting insights from a large dataset.In this project, I developed a collaborative filtering recommender system for recommending movies.In order to recommend movies I will use a large set of users preferences towards the movies from a publicly available movie rating dataset.

Dataset

The data was collected through the MovieLens web site (movielens.umn.edu). This dataset contains 105339 ratings and 6138 tag applications across 10329 movies. These data were created by 668 users.

movies <- read.csv("https://raw.githubusercontent.com/ErindaB/Data-612/master/movies.csv",stringsAsFactors=FALSE)
ratings <- read.csv("https://raw.githubusercontent.com/ErindaB/Data-612/master/ratings.csv")
summary(movies)
##     movieId          title              genres         
##  Min.   :     1   Length:10329       Length:10329      
##  1st Qu.:  3240   Class :character   Class :character  
##  Median :  7088   Mode  :character   Mode  :character  
##  Mean   : 31924                                        
##  3rd Qu.: 59900                                        
##  Max.   :149532
kable(head(movies))
movieId title genres
1 Toy Story (1995) Adventure|Animation|Children|Comedy|Fantasy
2 Jumanji (1995) Adventure|Children|Fantasy
3 Grumpier Old Men (1995) Comedy|Romance
4 Waiting to Exhale (1995) Comedy|Drama|Romance
5 Father of the Bride Part II (1995) Comedy
6 Heat (1995) Action|Crime|Thriller
summary(ratings)
##      userId         movieId           rating        timestamp        
##  Min.   :  1.0   Min.   :     1   Min.   :0.500   Min.   :8.286e+08  
##  1st Qu.:192.0   1st Qu.:  1073   1st Qu.:3.000   1st Qu.:9.711e+08  
##  Median :383.0   Median :  2497   Median :3.500   Median :1.115e+09  
##  Mean   :364.9   Mean   : 13381   Mean   :3.517   Mean   :1.130e+09  
##  3rd Qu.:557.0   3rd Qu.:  5991   3rd Qu.:4.000   3rd Qu.:1.275e+09  
##  Max.   :668.0   Max.   :149532   Max.   :5.000   Max.   :1.452e+09
kable(head(ratings))
userId movieId rating timestamp
1 16 4.0 1217897793
1 24 1.5 1217895807
1 32 4.0 1217896246
1 47 4.0 1217896556
1 50 4.0 1217896523
1 110 4.0 1217896150

Data Wrangling

Let’s select a list of genre and re-organize the movie genres in order to allow future users to search for the movies they like within specific genres.

genres <- as.data.frame(movies$genres, stringsAsFactors=FALSE)

genres2 <- as.data.frame(tstrsplit(genres[,1], '[|]', 
                                   type.convert=TRUE), 
                         stringsAsFactors=FALSE)
colnames(genres2) <- c(1:10)

genre_list <- c("Action", "Adventure", "Animation", "Children", 
                "Comedy", "Crime","Documentary", "Drama", "Fantasy",
                "Film-Noir", "Horror", "Musical", "Mystery","Romance",
                "Sci-Fi", "Thriller", "War", "Western") 

genre_matrix <- matrix(0,10330,18) #empty matrix, 10330=no of movies+1, 18=no of genres
genre_matrix[1,] <- genre_list #set first row to genre list
colnames(genre_matrix) <- genre_list #set column names to genre list

#iterate through matrix
for (i in 1:nrow(genres2)) {
  for (c in 1:ncol(genres2)) {
    genmat_col = which(genre_matrix[1,] == genres2[i,c])
    genre_matrix[i+1,genmat_col] <- 1
  }
}

#convert into dataframe
genre_matrix2 <- as.data.frame(genre_matrix[-1,], stringsAsFactors=FALSE) #remove first row, which was the genre list
for (c in 1:ncol(genre_matrix2)) {
  genre_matrix2[,c] <- as.integer(genre_matrix2[,c])  #convert from characters to integers
} 

head(genre_matrix2)
##   Action Adventure Animation Children Comedy Crime Documentary Drama
## 1      0         1         1        1      1     0           0     0
## 2      0         1         0        1      0     0           0     0
## 3      0         0         0        0      1     0           0     0
## 4      0         0         0        0      1     0           0     1
## 5      0         0         0        0      1     0           0     0
## 6      1         0         0        0      0     1           0     0
##   Fantasy Film-Noir Horror Musical Mystery Romance Sci-Fi Thriller War
## 1       1         0      0       0       0       0      0        0   0
## 2       1         0      0       0       0       0      0        0   0
## 3       0         0      0       0       0       1      0        0   0
## 4       0         0      0       0       0       1      0        0   0
## 5       0         0      0       0       0       0      0        0   0
## 6       0         0      0       0       0       0      0        1   0
##   Western
## 1       0
## 2       0
## 3       0
## 4       0
## 5       0
## 6       0

Let’s create a search matrix which allows an easy search of a movie by any of its genre

search_matrix <- cbind(movies[,1:2], genre_matrix2)
head(search_matrix)
##   movieId                              title Action Adventure Animation
## 1       1                   Toy Story (1995)      0         1         1
## 2       2                     Jumanji (1995)      0         1         0
## 3       3            Grumpier Old Men (1995)      0         0         0
## 4       4           Waiting to Exhale (1995)      0         0         0
## 5       5 Father of the Bride Part II (1995)      0         0         0
## 6       6                        Heat (1995)      1         0         0
##   Children Comedy Crime Documentary Drama Fantasy Film-Noir Horror Musical
## 1        1      1     0           0     0       1         0      0       0
## 2        1      0     0           0     0       1         0      0       0
## 3        0      1     0           0     0       0         0      0       0
## 4        0      1     0           0     1       0         0      0       0
## 5        0      1     0           0     0       0         0      0       0
## 6        0      0     1           0     0       0         0      0       0
##   Mystery Romance Sci-Fi Thriller War Western
## 1       0       0      0        0   0       0
## 2       0       0      0        0   0       0
## 3       0       1      0        0   0       0
## 4       0       1      0        0   0       0
## 5       0       0      0        0   0       0
## 6       0       0      0        1   0       0

In order to use the ratings data for building a recommendation engine with recommenderlab, I convert rating matrix into a sparse matrix.

#Create ratings matrix. Rows = userId, Columns = movieId

Sparsemat <- dcast(ratings, userId~movieId, value.var = "rating", na.rm=FALSE)
Sparsemat <- as.matrix(Sparsemat[,-1]) #remove userIds

#Convert rating matrix into a recommenderlab sparse matrix
Sparsemat <- as(Sparsemat, "realRatingMatrix")
Sparsemat
## 668 x 10325 rating matrix of class 'realRatingMatrix' with 105339 ratings.

Exploring Parameters of Recommendation Models

#Let’s explore which functions exist in recommenderlab that can be useful later.
recommender_models <- recommenderRegistry$get_entries(dataType= "realRatingMatrix")
names(recommender_models)
##  [1] "HYBRID_realRatingMatrix"       "ALS_realRatingMatrix"         
##  [3] "ALS_implicit_realRatingMatrix" "IBCF_realRatingMatrix"        
##  [5] "LIBMF_realRatingMatrix"        "POPULAR_realRatingMatrix"     
##  [7] "RANDOM_realRatingMatrix"       "RERECOMMEND_realRatingMatrix" 
##  [9] "SVD_realRatingMatrix"          "SVDF_realRatingMatrix"        
## [11] "UBCF_realRatingMatrix"

Exploring Similarity Data

Collaborative filtering algorithms are based on measuring the similarity between users or between items. For this purpose, I created the similarity matrix that uses the cosine distance:

similar_users <- similarity(Sparsemat[1:4, ], 
                               method = "cosine", 
                               which = "users")
as.matrix(similar_users)
##           1         2         3         4
## 1 0.0000000 0.9760860 0.9641723 0.9914398
## 2 0.9760860 0.0000000 0.9925732 0.9374253
## 3 0.9641723 0.9925732 0.0000000 0.9888968
## 4 0.9914398 0.9374253 0.9888968 0.0000000
image(as.matrix(similar_users), main = "User similarity")

Each row and each column corresponds to a user, and each cell corresponds to the similarity between two users. The more red the cell is, the more similar two users are. Note that the diagonal is red, since it’s comparing each user with itself.

# Similarity between the first four movies.
similarity_items <- similarity(Sparsemat[, 1:4], method =
                                 "cosine", which = "items")
as.matrix(similarity_items)
##           1         2         3         4
## 1 0.0000000 0.9669732 0.9559341 0.9101276
## 2 0.9669732 0.0000000 0.9658757 0.9412416
## 3 0.9559341 0.9658757 0.0000000 0.9864877
## 4 0.9101276 0.9412416 0.9864877 0.0000000
image(as.matrix(similarity_items), main = "Movies similarity")

Top movies

views_per_movie <- colCounts(Sparsemat) 

table_views <- data.frame(movie = names(views_per_movie),
                          views = views_per_movie) # create dataframe of views
table_views <- table_views[order(table_views$views, 
                                 decreasing = TRUE), ] # sort by number of views
table_views$title <- NA
for (i in 1:10325){
  table_views[i,3] <- as.character(subset(movies, 
                                         movies$movieId == table_views[i,1])$title)
}

table_views[1:6,]
##     movie views                                     title
## 296   296   325                       Pulp Fiction (1994)
## 356   356   311                       Forrest Gump (1994)
## 318   318   308          Shawshank Redemption, The (1994)
## 480   480   294                      Jurassic Park (1993)
## 593   593   290          Silence of the Lambs, The (1991)
## 260   260   273 Star Wars: Episode IV - A New Hope (1977)
ggplot(table_views[1:10, ], aes(x = title, y = views)) +
  geom_bar(stat="identity") + 
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) + 
ggtitle("Number of views of the top movies")

As we can see that “Pulp Fiction (1994)” is the most viewed movie.

Data Preparation

I decided to select a minimum number of users per rated movie as 50 and the minimum views number per movie as 50, so I can get the most relevant data.

movies_rat <- Sparsemat[rowCounts(Sparsemat) > 50,
                             colCounts(Sparsemat) > 50]
movies_rat
## 420 x 447 rating matrix of class 'realRatingMatrix' with 38341 ratings.

This new selection contains 420 users and 447 movies, compared to previous 668 users and 10325 movies in the total dataset.

Normalizing data

To avoid biases I need to normalize the data in such a way that the average rating of each user is 0.

movies_rat_norm <- normalize(movies_rat)
sum(rowMeans(movies_rat_norm) > 0.00001)
## [1] 0

The visualization of the normalized matrix for the top movies is colored because the data is continuous:

min_movies <- quantile(rowCounts(movies_rat), 0.98)
min_users <- quantile(colCounts(movies_rat), 0.98)
image(movies_rat_norm[rowCounts(movies_rat_norm) > min_movies,
                          colCounts(movies_rat_norm) > min_users], 
main = "Heatmap of the top users and movies")

ITEM-based Collaborative Filtering Model

Splitting train and test sets

I build the model using 80% of the whole dataset as a training set, and 20% - as a test set.

which_train <- sample(x = c(TRUE, FALSE), 
                      size = nrow(movies_rat),
                      replace = TRUE, 
                      prob = c(0.8, 0.2))

movie_train <- movies_rat[which_train, ]
movie_test <- movies_rat[!which_train, ]

Building the recommendation model

recommender_models <- recommenderRegistry$get_entries(dataType ="realRatingMatrix")
recommender_models$IBCF_realRatingMatrix$parameters
## $k
## [1] 30
## 
## $method
## [1] "Cosine"
## 
## $normalize
## [1] "center"
## 
## $normalize_sim_matrix
## [1] FALSE
## 
## $alpha
## [1] 0.5
## 
## $na_as_zero
## [1] FALSE
IBCF_model <- Recommender(data = movie_train, 
                          method = "IBCF",
                          parameter = list(k = 30))

IBCF_model
## Recommender of type 'IBCF' for 'realRatingMatrix' 
## learned using 326 users.
class(IBCF_model)
## [1] "Recommender"
## attr(,"package")
## [1] "recommenderlab"

Implementation of IBCF Model

n_recommended <- 10 # the number of items to recommend to each user

ibcf_preds <- predict(object = IBCF_model, 
                          newdata = movie_test, 
                          n = n_recommended)
ibcf_preds
## Recommendations as 'topNList' with n = 10 for 94 users.

Let’s explore the results of the recommendations for the first user:

recc_user_1 <- ibcf_preds@items[[1]] # recommendation for the first user
movies_user_1 <- ibcf_preds@itemLabels[recc_user_1]
movies_user_2 <- movies_user_1
for (i in 1:10){
  movies_user_2[i] <- as.character(subset(movies, 
                                         movies$movieId == movies_user_1[i])$title)
}
movies_user_2
##  [1] "Wizard of Oz, The (1939)"                                                   
##  [2] "Blade (1998)"                                                               
##  [3] "Mrs. Doubtfire (1993)"                                                      
##  [4] "Crying Game, The (1992)"                                                    
##  [5] "Dr. Strangelove or: How I Learned to Stop Worrying and Love the Bomb (1964)"
##  [6] "Witness (1985)"                                                             
##  [7] "Home Alone (1990)"                                                          
##  [8] "Broken Arrow (1996)"                                                        
##  [9] "Game, The (1997)"                                                           
## [10] "Congo (1995)"

Visualization of the results

Now, let’s identify the most recommended movies. The following image shows the distribution of the number of items for IBCF:

recc_matrix <- sapply(ibcf_preds@items, 
                      function(x){ as.integer(colnames(movies_rat)[x]) }) 
number_of_items <- factor(table(recc_matrix))

chart_title <- "Distribution of the number of items for IBCF"
qplot(number_of_items) + ggtitle(chart_title)

number_of_items_sorted <- sort(number_of_items, decreasing = TRUE)
number_of_items_top <- head(number_of_items_sorted, n = 4)
table_top <- data.frame(as.integer(names(number_of_items_top)),
                       number_of_items_top)

for (i in 1:4){
  table_top[i,1] <- as.character(subset(movies, 
                                         movies$movieId == table_top[i,1])$title)
}

colnames(table_top) <- c("Movie title", "No of items")
head(table_top)
##                            Movie title No of items
## 261                Little Women (1994)          14
## 5   Father of the Bride Part II (1995)          13
## 3              Grumpier Old Men (1995)          12
## 168                First Knight (1995)          12

IBCF recommends items on the basis of the similarity matrix. For each item, the model stores the k-most similar, so the amount of information is small once the model is built. This is an advantage in the presence of lots of data. In addition, this algorithm is efficient and scalable, so it works well with big rating matrices.

USER-based Collaborative Filtering Model

Building the recommendation system:

recommender_models <- recommenderRegistry$get_entries(dataType ="realRatingMatrix")
recommender_models$UBCF_realRatingMatrix$parameters
## $method
## [1] "cosine"
## 
## $nn
## [1] 25
## 
## $sample
## [1] FALSE
## 
## $weighted
## [1] TRUE
## 
## $normalize
## [1] "center"
## 
## $min_matching_items
## [1] 0
## 
## $min_predictive_items
## [1] 0
UBCF_model <- Recommender(data = movie_train, method = "UBCF")
UBCF_model
## Recommender of type 'UBCF' for 'realRatingMatrix' 
## learned using 326 users.
model_details <- getModel(UBCF_model)
#names(model_details)
model_details$data
## 326 x 447 rating matrix of class 'realRatingMatrix' with 28870 ratings.
## Normalized using center on rows.

Implementing the recommender model on the test set

In the same way as the IBCF, I now determine the top ten recommendations for each new user in the test set.

n_recommended <- 10
ubcf_preds <- predict(object = UBCF_model,
                          newdata = movie_test, 
                          n = n_recommended) 
ubcf_preds
## Recommendations as 'topNList' with n = 10 for 94 users.

Visualization of the results

recc_matrix <- sapply(ubcf_preds@items, 
                      function(x){ as.integer(colnames(movies_rat)[x]) })
number_of_items <- factor(table(recc_matrix))

chart_title <- "Distribution of the number of items for UBCF"
qplot(number_of_items) + ggtitle(chart_title)

Compared with the IBCF, the distribution has a longer tail. This means that there are some movies that are recommended much more often than the others. The maximum is more than 30, compared to 10-ish for IBCF.

Evaluating the Recommender Systems

There are a few options to choose from when deciding to create a recommendation engine. In order to compare their performances and choose the most appropriate model, I follow these steps:

  • Prepare the data to evaluate performance
  • Evaluate and comparing models
  • Identifying the most suitable model
  • Optimize model parameters

Preparing the data

In order to define number of items to use for each user to generate recommentations, need to check the min number of items rated by users.

min(rowCounts(movies_rat)) 
## [1] 8
keep <- 5 
threshold <- 3
n_eval <- 1 #number of times to run evaluation

evaluation <- evaluationScheme(data = movies_rat, 
                              method = "split",
                              train = 0.8, #Splitting the data into training and test sets is often done using a 80/20 proportion
                              given = keep, 
                              goodRating = threshold, 
                              k = n_eval) 
evaluation
## Evaluation scheme with 5 items given
## Method: 'split' with 1 run(s).
## Training set proportion: 0.800
## Good ratings: >=3.000000
## Data set: 420 x 447 rating matrix of class 'realRatingMatrix' with 38341 ratings.
ev_train = getData(evaluation, "train")
ev_known = getData(evaluation, "known")
ev_unknown = getData(evaluation, "unknown")

Comparing models

models_ev <- list(
IBCF_cosine = list(name = "IBCF", 
                param = list(method = "cosine")),#IBCF using the Cosine as the distance function
IBCF_pearson = list(name = "IBCF", 
                param = list(method = "pearson")),#IBCF using the Pearson correlation as the distance function
UBCF_cosine = list(name = "UBCF", 
                param = list(method = "cosine")),#UBCF  using the Cosine as the distance function
UBCF_pearson = list(name = "UBCF", 
                param = list(method = "pearson")),#UBCF using the Pearson correlation as the distance function
Random = list(name = "RANDOM", param=NULL)#Random recommendations to have a base line
)

Then, I define a different set of numbers for recommended movies (n_recommendations <- c(1, 5, seq(10, 100, 10))), run and evaluate the models:

n_recommendations <- c(1, 5, seq(10, 100, 10))
mod_results <- evaluate(x = evaluation, 
                         method = models_ev, 
                         n = n_recommendations)
## IBCF run fold/sample [model time/prediction time]
##   1  [0.32sec/0.01sec] 
## IBCF run fold/sample [model time/prediction time]
##   1  [0.37sec/0.02sec] 
## UBCF run fold/sample [model time/prediction time]
##   1  [0sec/0.12sec] 
## UBCF run fold/sample [model time/prediction time]
##   1
## Timing stopped at: 0.02 0 0.03
## Error in neighbors[, x] : incorrect number of dimensions
## RANDOM run fold/sample [model time/prediction time]
##   1  [0sec/0.03sec]
## Warning in .local(x, method, ...): 
##   Recommender 'UBCF_pearson' has failed and has been removed from the results!
sapply(mod_results, class) == "evaluationResults"
##  IBCF_cosine IBCF_pearson  UBCF_cosine       Random 
##         TRUE         TRUE         TRUE         TRUE

The following table presents as an example the first rows of the performance evaluation matrix for the IBCF with Cosine distance:

avg_matrices <- lapply(mod_results, avg)
head(avg_matrices$IBCF_cos[, 5:8])
##    precision      recall         TPR         FPR
## 1  0.2530120 0.003823288 0.003823288 0.002058619
## 5  0.2024096 0.014229671 0.014229671 0.011065612
## 10 0.1783133 0.024691577 0.024691577 0.022706521
## 20 0.1649598 0.043287610 0.043287610 0.045825795
## 30 0.1694606 0.066064346 0.066064346 0.067333567
## 40 0.1700283 0.088997435 0.088997435 0.088916393

Identifying the most suitable model

plot(mod_results, annotate = 1, legend = "topleft") 
title("ROC curve")

plot(mod_results, "prec/rec", annotate = 1, legend = "bottomright")
title("Precision-recall")

A good performance index is the area under the curve (AUC), that is, the area under the ROC curve. Even without computing it, the chart shows that the highest is UBCF with cosine distance, so it’s the best-performing technique.

The UBCF with cosine distance is still the top model. Depending on what is the main purpose of the system, an appropriate number of items to recommend should be defined.

Optimizing a numeric parameter

IBCF takes account of the k-closest items. I will explore more values, ranging between 5 and 40, in order to tune this parameter:

vector_k <- c(5, 10, 20, 30, 40)
models_to_evaluate <- lapply(vector_k, function(k){
  list(name = "IBCF",
       param = list(method = "cosine", k = k))
})
names(models_to_evaluate) <- paste0("IBCF_k_", vector_k)

Now I build and evaluate the same IBCF/cosine models with different values of the k-closest items:

n_recommendations <- c(1, 5, seq(10, 100, 10))
list_results <- evaluate(x = evaluation, 
                         method = models_to_evaluate, 
                         n = n_recommendations)
## IBCF run fold/sample [model time/prediction time]
##   1  [0.44sec/0.02sec] 
## IBCF run fold/sample [model time/prediction time]
##   1  [0.29sec/0.02sec] 
## IBCF run fold/sample [model time/prediction time]
##   1  [0.31sec/0.02sec] 
## IBCF run fold/sample [model time/prediction time]
##   1  [0.32sec/0.01sec] 
## IBCF run fold/sample [model time/prediction time]
##   1  [0.28sec/0.01sec]
plot(list_results, annotate = 1, legend = "topleft") 
title("ROC curve")

plot(list_results, "prec/rec", annotate = 1, legend = "bottomright")
title("Precision-recall")

Based on the ROC curve’s plot, the k having the biggest AUC is 10. Another good candidate is 5, but it can never have a high TPR. This means that, even if we set a very high n value, the algorithm won’t be able to recommend a big percentage of items that the user liked. The IBCF with k = 5 recommends only a few items similar to the purchases. Therefore, it can’t be used to recommend many items.

Based on the precision/recall plot, k should be set to 10 to achieve the highest recall. If we are more interested in the precision, we set k to 5.

Conslusion

User-based Collaborative Filtering is a type of Memory-based Collaborative Filtering that uses all user data in the database to create recommendations.If there were millions of users, this computation would be very time consuming. UBCF needs to access the initial data. Since it needs to keep the entire database in memory, it doesn’t work well in the presence of a big rating matrix. Also, building the similarity matrix requires a lot of computing power and time.

However, UBCF’s accuracy is proven to be slightly more accurate than IBCF, so it’s a good option if the dataset is not too big. Result showed that User-based collaborative filtering with pearson is the best model out of the rest.

Shiny APP

To see the implementation of the model visit the following URL:

https://erinda.shinyapps.io/Movie_Recommender/