In this project, we developed a collaborative filtering recommender (CFR) system for recommending movies.
The basic idea of CFR systems is that, if two users share the same interests in the past, e.g. they liked the same book or the same movie, they will also have similar tastes in the future. If, for example, user A and user B have a similar purchase history and user A recently bought a book that user B has not yet seen, the basic idea is to propose this book to user B. In this project, in order to recommend movies we will use a large set of users preferences towards the movies from a publicly available movie rating dataset.
The following libraries were used in this project:
library(recommenderlab)
library(ggplot2)
library(data.table)
library(reshape2)
The dataset used was from MovieLens, and is publicly available at http://grouplens.org/datasets/movielens/latest. In order to keep the recommender simple, I used the smallest dataset available (ml-latest-small.zip), which at the time of download contaied 105339 ratings and 6138 tag applications across 10329 movies. These data were created by 668 users between April 03, 1996 and January 09, 2016. This dataset was generated on January 11, 2016.
The data are contained in four files: links.csv, movies.csv, ratings.csv and tags.csv. We only use the files movies.csv and ratings.csv to build a recommendation system.
A summary of movies is given below, togeher with several first rows of a dataframe:
## movieId title genres
## Min. : 1 Length:10329 Length:10329
## 1st Qu.: 3240 Class :character Class :character
## Median : 7088 Mode :character Mode :character
## Mean : 31924
## 3rd Qu.: 59900
## Max. :149532
## movieId title
## 1 1 Toy Story (1995)
## 2 2 Jumanji (1995)
## 3 3 Grumpier Old Men (1995)
## 4 4 Waiting to Exhale (1995)
## 5 5 Father of the Bride Part II (1995)
## 6 6 Heat (1995)
## genres
## 1 Adventure|Animation|Children|Comedy|Fantasy
## 2 Adventure|Children|Fantasy
## 3 Comedy|Romance
## 4 Comedy|Drama|Romance
## 5 Comedy
## 6 Action|Crime|Thriller
here is a summary and a head of ratings:
## userId movieId rating timestamp
## Min. : 1.0 Min. : 1 Min. :0.500 Min. :8.286e+08
## 1st Qu.:192.0 1st Qu.: 1073 1st Qu.:3.000 1st Qu.:9.711e+08
## Median :383.0 Median : 2497 Median :3.500 Median :1.115e+09
## Mean :364.9 Mean : 13381 Mean :3.517 Mean :1.130e+09
## 3rd Qu.:557.0 3rd Qu.: 5991 3rd Qu.:4.000 3rd Qu.:1.275e+09
## Max. :668.0 Max. :149532 Max. :5.000 Max. :1.452e+09
## userId movieId rating timestamp
## 1 1 16 4.0 1217897793
## 2 1 24 1.5 1217895807
## 3 1 32 4.0 1217896246
## 4 1 47 4.0 1217896556
## 5 1 50 4.0 1217896523
## 6 1 110 4.0 1217896150
Some pre-processing of the data available is required before creating the recommendation system.
We re-organize the information of movie genres in such a way that allows future users to search for the movies they like within specific genres. From the design perspective, this is much easier for the user compared to selecting a movie from a single very long list of all the available movies.
We use a one-hot encoding to create a matrix of corresponding genres for each movie.
## Action Adventure Animation Children Comedy Crime Documentary Drama
## 1 0 1 1 1 1 0 0 0
## 2 0 1 0 1 0 0 0 0
## 3 0 0 0 0 1 0 0 0
## 4 0 0 0 0 1 0 0 1
## 5 0 0 0 0 1 0 0 0
## 6 1 0 0 0 0 1 0 0
## Fantasy Film-Noir Horror Musical Mystery Romance Sci-Fi Thriller War
## 1 1 0 0 0 0 0 0 0 0
## 2 1 0 0 0 0 0 0 0 0
## 3 0 0 0 0 0 1 0 0 0
## 4 0 0 0 0 0 1 0 0 0
## 5 0 0 0 0 0 0 0 0 0
## 6 0 0 0 0 0 0 0 1 0
## Western
## 1 0
## 2 0
## 3 0
## 4 0
## 5 0
## 6 0
Now, I create a search matrix which allows an easy search of a movie by any of its genre.
## movieId title Action Adventure Animation
## 1 1 Toy Story (1995) 0 1 1
## 2 2 Jumanji (1995) 0 1 0
## 3 3 Grumpier Old Men (1995) 0 0 0
## 4 4 Waiting to Exhale (1995) 0 0 0
## 5 5 Father of the Bride Part II (1995) 0 0 0
## 6 6 Heat (1995) 1 0 0
## Children Comedy Crime Documentary Drama Fantasy Film-Noir Horror Musical
## 1 1 1 0 0 0 1 0 0 0
## 2 1 0 0 0 0 1 0 0 0
## 3 0 1 0 0 0 0 0 0 0
## 4 0 1 0 0 1 0 0 0 0
## 5 0 1 0 0 0 0 0 0 0
## 6 0 0 1 0 0 0 0 0 0
## Mystery Romance Sci-Fi Thriller War Western
## 1 0 0 0 0 0 0
## 2 0 0 0 0 0 0
## 3 0 1 0 0 0 0
## 4 0 1 0 0 0 0
## 5 0 0 0 0 0 0
## 6 0 0 0 1 0 0
We can see that each movie can correspond to either one or more than one genre.
In order to use the ratings data for building a recommendation engine with recommenderlab, I convert rating matrix into a sparse matrix of type realRatingMatrix.
## 668 x 10325 rating matrix of class 'realRatingMatrix' with 105339 ratings.
The recommenderlab package contains some options for the recommendation algorithm:
## [1] "ALS_realRatingMatrix" "ALS_implicit_realRatingMatrix"
## [3] "IBCF_realRatingMatrix" "POPULAR_realRatingMatrix"
## [5] "RANDOM_realRatingMatrix" "RERECOMMEND_realRatingMatrix"
## [7] "SVD_realRatingMatrix" "SVDF_realRatingMatrix"
## [9] "UBCF_realRatingMatrix"
## $ALS_realRatingMatrix
## [1] "Recommender for explicit ratings based on latent factors, calculated by alternating least squares algorithm."
##
## $ALS_implicit_realRatingMatrix
## [1] "Recommender for implicit data based on latent factors, calculated by alternating least squares algorithm."
##
## $IBCF_realRatingMatrix
## [1] "Recommender based on item-based collaborative filtering."
##
## $POPULAR_realRatingMatrix
## [1] "Recommender based on item popularity."
##
## $RANDOM_realRatingMatrix
## [1] "Produce random recommendations (real ratings)."
##
## $RERECOMMEND_realRatingMatrix
## [1] "Re-recommends highly rated items (real ratings)."
##
## $SVD_realRatingMatrix
## [1] "Recommender based on SVD approximation with column-mean imputation."
##
## $SVDF_realRatingMatrix
## [1] "Recommender based on Funk SVD with gradient descend."
##
## $UBCF_realRatingMatrix
## [1] "Recommender based on user-based collaborative filtering."
we will use IBCF and UBCF models. Check the parameters of these two models.
recommender_models$IBCF_realRatingMatrix$parameters
## $k
## [1] 30
##
## $method
## [1] "Cosine"
##
## $normalize
## [1] "center"
##
## $normalize_sim_matrix
## [1] FALSE
##
## $alpha
## [1] 0.5
##
## $na_as_zero
## [1] FALSE
recommender_models$UBCF_realRatingMatrix$parameters
## $method
## [1] "cosine"
##
## $nn
## [1] 25
##
## $sample
## [1] FALSE
##
## $normalize
## [1] "center"
Now, we explore values of ratings.
vector_ratings <- as.vector(ratingmat@data)
unique(vector_ratings) # what are unique values of ratings
## [1] 0.0 5.0 4.0 3.0 4.5 1.5 2.0 3.5 1.0 2.5 0.5
table_ratings <- table(vector_ratings) # what is the count of each rating value
table_ratings
## vector_ratings
## 0 0.5 1 1.5 2 2.5 3 3.5 4
## 6791761 1198 3258 1567 7943 5484 21729 12237 28880
## 4.5 5
## 8187 14856
There are 11 unique score values. The lower values mean lower ratings and vice versa.
According to the documentation, a rating equal to 0 represents a missing value, so we remove them from the dataset before visualizing the results.
As we see, tehre are less low (less than 3) rating scores, the majority of movies are rated with a score of 3 or higher. The most common rating is 4.
Now, let’s see what are the most viewed movies.
## movie views title
## 296 296 325 Pulp Fiction (1994)
## 356 356 311 Forrest Gump (1994)
## 318 318 308 Shawshank Redemption, The (1994)
## 480 480 294 Jurassic Park (1993)
## 593 593 290 Silence of the Lambs, The (1991)
## 260 260 273 Star Wars: Episode IV - A New Hope (1977)
We see that “Pulp Fiction (1994)” is the most viewed movie, exceeding the second-most-viewed “Forrest Gump (1994)” by 14 views.
We identify the top-rated movies by computing the average rating of each of them.