Building a Movie Recommendation System

Project Overview

In this project, we developed a collaborative filtering recommender (CFR) system for recommending movies.

The basic idea of CFR systems is that, if two users share the same interests in the past, e.g. they liked the same book or the same movie, they will also have similar tastes in the future. If, for example, user A and user B have a similar purchase history and user A recently bought a book that user B has not yet seen, the basic idea is to propose this book to user B. In this project, in order to recommend movies we will use a large set of users preferences towards the movies from a publicly available movie rating dataset.

Used Libraries

The following libraries were used in this project:

library(recommenderlab)
library(ggplot2)
library(data.table)
library(reshape2)

Dataset

The dataset used was from MovieLens, and is publicly available at http://grouplens.org/datasets/movielens/latest. In order to keep the recommender simple, I used the smallest dataset available (ml-latest-small.zip), which at the time of download contaied 105339 ratings and 6138 tag applications across 10329 movies. These data were created by 668 users between April 03, 1996 and January 09, 2016. This dataset was generated on January 11, 2016.

The data are contained in four files: links.csv, movies.csv, ratings.csv and tags.csv. We only use the files movies.csv and ratings.csv to build a recommendation system.

A summary of movies is given below, togeher with several first rows of a dataframe:

##     movieId          title              genres         
##  Min.   :     1   Length:10329       Length:10329      
##  1st Qu.:  3240   Class :character   Class :character  
##  Median :  7088   Mode  :character   Mode  :character  
##  Mean   : 31924                                        
##  3rd Qu.: 59900                                        
##  Max.   :149532

##   movieId                              title
## 1       1                   Toy Story (1995)
## 2       2                     Jumanji (1995)
## 3       3            Grumpier Old Men (1995)
## 4       4           Waiting to Exhale (1995)
## 5       5 Father of the Bride Part II (1995)
## 6       6                        Heat (1995)
##                                        genres
## 1 Adventure|Animation|Children|Comedy|Fantasy
## 2                  Adventure|Children|Fantasy
## 3                              Comedy|Romance
## 4                        Comedy|Drama|Romance
## 5                                      Comedy
## 6                       Action|Crime|Thriller

here is a summary and a head of ratings:

##      userId         movieId           rating        timestamp        
##  Min.   :  1.0   Min.   :     1   Min.   :0.500   Min.   :8.286e+08  
##  1st Qu.:192.0   1st Qu.:  1073   1st Qu.:3.000   1st Qu.:9.711e+08  
##  Median :383.0   Median :  2497   Median :3.500   Median :1.115e+09  
##  Mean   :364.9   Mean   : 13381   Mean   :3.517   Mean   :1.130e+09  
##  3rd Qu.:557.0   3rd Qu.:  5991   3rd Qu.:4.000   3rd Qu.:1.275e+09  
##  Max.   :668.0   Max.   :149532   Max.   :5.000   Max.   :1.452e+09

##   userId movieId rating  timestamp
## 1      1      16    4.0 1217897793
## 2      1      24    1.5 1217895807
## 3      1      32    4.0 1217896246
## 4      1      47    4.0 1217896556
## 5      1      50    4.0 1217896523
## 6      1     110    4.0 1217896150

Data Pre-processing

Some pre-processing of the data available is required before creating the recommendation system.

We re-organize the information of movie genres in such a way that allows future users to search for the movies they like within specific genres. From the design perspective, this is much easier for the user compared to selecting a movie from a single very long list of all the available movies.

Extract a list of genres

We use a one-hot encoding to create a matrix of corresponding genres for each movie.

##   Action Adventure Animation Children Comedy Crime Documentary Drama
## 1      0         1         1        1      1     0           0     0
## 2      0         1         0        1      0     0           0     0
## 3      0         0         0        0      1     0           0     0
## 4      0         0         0        0      1     0           0     1
## 5      0         0         0        0      1     0           0     0
## 6      1         0         0        0      0     1           0     0
##   Fantasy Film-Noir Horror Musical Mystery Romance Sci-Fi Thriller War
## 1       1         0      0       0       0       0      0        0   0
## 2       1         0      0       0       0       0      0        0   0
## 3       0         0      0       0       0       1      0        0   0
## 4       0         0      0       0       0       1      0        0   0
## 5       0         0      0       0       0       0      0        0   0
## 6       0         0      0       0       0       0      0        1   0
##   Western
## 1       0
## 2       0
## 3       0
## 4       0
## 5       0
## 6       0

Create a matrix to search for a movie by genre

Now, I create a search matrix which allows an easy search of a movie by any of its genre.

##   movieId                              title Action Adventure Animation
## 1       1                   Toy Story (1995)      0         1         1
## 2       2                     Jumanji (1995)      0         1         0
## 3       3            Grumpier Old Men (1995)      0         0         0
## 4       4           Waiting to Exhale (1995)      0         0         0
## 5       5 Father of the Bride Part II (1995)      0         0         0
## 6       6                        Heat (1995)      1         0         0
##   Children Comedy Crime Documentary Drama Fantasy Film-Noir Horror Musical
## 1        1      1     0           0     0       1         0      0       0
## 2        1      0     0           0     0       1         0      0       0
## 3        0      1     0           0     0       0         0      0       0
## 4        0      1     0           0     1       0         0      0       0
## 5        0      1     0           0     0       0         0      0       0
## 6        0      0     1           0     0       0         0      0       0
##   Mystery Romance Sci-Fi Thriller War Western
## 1       0       0      0        0   0       0
## 2       0       0      0        0   0       0
## 3       0       1      0        0   0       0
## 4       0       1      0        0   0       0
## 5       0       0      0        0   0       0
## 6       0       0      0        1   0       0

We can see that each movie can correspond to either one or more than one genre.

Converting ratings matrix in a proper format

In order to use the ratings data for building a recommendation engine with recommenderlab, I convert rating matrix into a sparse matrix of type realRatingMatrix.

## 668 x 10325 rating matrix of class 'realRatingMatrix' with 105339 ratings.

Exploring Parameters of Recommendation Models

The recommenderlab package contains some options for the recommendation algorithm:

## [1] "ALS_realRatingMatrix"          "ALS_implicit_realRatingMatrix"
## [3] "IBCF_realRatingMatrix"         "POPULAR_realRatingMatrix"     
## [5] "RANDOM_realRatingMatrix"       "RERECOMMEND_realRatingMatrix" 
## [7] "SVD_realRatingMatrix"          "SVDF_realRatingMatrix"        
## [9] "UBCF_realRatingMatrix"

## $ALS_realRatingMatrix
## [1] "Recommender for explicit ratings based on latent factors, calculated by alternating least squares algorithm."
## 
## $ALS_implicit_realRatingMatrix
## [1] "Recommender for implicit data based on latent factors, calculated by alternating least squares algorithm."
## 
## $IBCF_realRatingMatrix
## [1] "Recommender based on item-based collaborative filtering."
## 
## $POPULAR_realRatingMatrix
## [1] "Recommender based on item popularity."
## 
## $RANDOM_realRatingMatrix
## [1] "Produce random recommendations (real ratings)."
## 
## $RERECOMMEND_realRatingMatrix
## [1] "Re-recommends highly rated items (real ratings)."
## 
## $SVD_realRatingMatrix
## [1] "Recommender based on SVD approximation with column-mean imputation."
## 
## $SVDF_realRatingMatrix
## [1] "Recommender based on Funk SVD with gradient descend."
## 
## $UBCF_realRatingMatrix
## [1] "Recommender based on user-based collaborative filtering."

we will use IBCF and UBCF models. Check the parameters of these two models.

recommender_models$IBCF_realRatingMatrix$parameters

## $k
## [1] 30
## 
## $method
## [1] "Cosine"
## 
## $normalize
## [1] "center"
## 
## $normalize_sim_matrix
## [1] FALSE
## 
## $alpha
## [1] 0.5
## 
## $na_as_zero
## [1] FALSE

recommender_models$UBCF_realRatingMatrix$parameters

## $method
## [1] "cosine"
## 
## $nn
## [1] 25
## 
## $sample
## [1] FALSE
## 
## $normalize
## [1] "center"

Further data exploration

Now, we explore values of ratings.

vector_ratings <- as.vector(ratingmat@data)
unique(vector_ratings) # what are unique values of ratings

##  [1] 0.0 5.0 4.0 3.0 4.5 1.5 2.0 3.5 1.0 2.5 0.5

table_ratings <- table(vector_ratings) # what is the count of each rating value
table_ratings

## vector_ratings
##       0     0.5       1     1.5       2     2.5       3     3.5       4 
## 6791761    1198    3258    1567    7943    5484   21729   12237   28880 
##     4.5       5 
##    8187   14856

There are 11 unique score values. The lower values mean lower ratings and vice versa.

Distribution of the ratings

According to the documentation, a rating equal to 0 represents a missing value, so we remove them from the dataset before visualizing the results.

As we see, tehre are less low (less than 3) rating scores, the majority of movies are rated with a score of 3 or higher. The most common rating is 4.

Number of views of the top movies

Now, let’s see what are the most viewed movies.

##     movie views                                     title
## 296   296   325                       Pulp Fiction (1994)
## 356   356   311                       Forrest Gump (1994)
## 318   318   308          Shawshank Redemption, The (1994)
## 480   480   294                      Jurassic Park (1993)
## 593   593   290          Silence of the Lambs, The (1991)
## 260   260   273 Star Wars: Episode IV - A New Hope (1977)

We see that “Pulp Fiction (1994)” is the most viewed movie, exceeding the second-most-viewed “Forrest Gump (1994)” by 14 views.

Distribution of the average movie rating

We identify the top-rated movies by computing the average rating of each of them.