MovieLense Recommendation System in R

Item Based Collaborative Filtering (IBCF) recommends items on the basis of the similarity matrix. this algorithm is efficient and scalable. In this project we will use the demo MovieLens dataset.

Identify which items are similar in terms of having been purchased by the same people Recommend to a new user the items that are similar to its purchases

Loading required package: Matrix
Loading required package: arules

Attaching package: <U+393C><U+3E31>arules<U+393C><U+3E32>

The following object is masked from <U+393C><U+3E31>package:dplyr<U+393C><U+3E32>:

    recode

The following object is masked from <U+393C><U+3E31>package:car<U+393C><U+3E32>:

    recode

The following objects are masked from <U+393C><U+3E31>package:base<U+393C><U+3E32>:

    abbreviate, write

Loading required package: proxy

Attaching package: <U+393C><U+3E31>proxy<U+393C><U+3E32>

The following object is masked from <U+393C><U+3E31>package:Matrix<U+393C><U+3E32>:

    as.matrix

The following objects are masked from <U+393C><U+3E31>package:stats<U+393C><U+3E32>:

    as.dist, dist

The following object is masked from <U+393C><U+3E31>package:base<U+393C><U+3E32>:

    as.matrix

Loading required package: registry
943 x 1664 rating matrix of class ‘realRatingMatrix’ with 99392 ratings.

Each row of MovieLense corresponds to a user, and each column corresponds to a movie. There are more than 943 x 1664 = 1,500,000 combinations between a user and a movie. Therefore, storing the complete matrix would require more than 1,500,000 cells. However, not every user has watched every movie. Therefore, there are fewer than 100,000 ratings, and the matrix is sparse.

Let’s explore in detail.

[1] "realRatingMatrix"
attr(,"package")
[1] "recommenderlab"

Let’s take a look at the methods that we can apply on the objects of this class:

 [1] [                      [<-                    binarize              
 [4] calcPredictionAccuracy coerce                 colCounts             
 [7] colMeans               colSds                 colSums               
[10] denormalize            dim                    dimnames              
[13] dimnames<-             dissimilarity          evaluationScheme      
[16] getData.frame          getList                getNormalize          
[19] getRatingMatrix        getRatings             getTopNLists          
[22] image                  normalize              nratings              
[25] Recommender            removeKnownRatings     rowCounts             
[28] rowMeans               rowSds                 rowSums               
[31] sample                 show                   similarity            
see '?methods' for accessing help and source code
1388448 bytes
12740464 bytes
9.17604692433566 bytes

MovieLense occupies much less space than the equivalent standard R matrix. The rate is about 1:9, and the reason is the sparsity of MovieLense. A standard R matrix object stores all the missing values as 0s, so it stores 15 times more cells.

Computing the similarity matrix

Determine how similar the first five users are with each other. Let’s compute this using the cosine distance

[1] "dist"

dist is a base R class, we can use it in different ways.

Let’s convert similarity_users into a matrix to visualize it.

           1          2          3          4
1 0.00000000 0.16893670 0.03827203 0.06634975
2 0.16893670 0.00000000 0.09706862 0.15310468
3 0.03827203 0.09706862 0.00000000 0.33343036
4 0.06634975 0.15310468 0.33343036 0.00000000

The more red the cell is, the more similar two users are. Note that the diagonal is red, since it’s comparing each user with itself:

                  Toy Story (1995) GoldenEye (1995) Four Rooms (1995) Get Shorty (1995)
Toy Story (1995)         0.0000000        0.4023822         0.3302448         0.4549379
GoldenEye (1995)         0.4023822        0.0000000         0.2730692         0.5025708
Four Rooms (1995)        0.3302448        0.2730692         0.0000000         0.3248664
Get Shorty (1995)        0.4549379        0.5025708         0.3248664         0.0000000

Similar to the preceding screenshot, we can visualize the matrix using this image:

The similarity is the base of collaborative filtering models.

[1] "ALS_realRatingMatrix"          "ALS_implicit_realRatingMatrix"
[3] "IBCF_realRatingMatrix"         "POPULAR_realRatingMatrix"     
[5] "RANDOM_realRatingMatrix"       "RERECOMMEND_realRatingMatrix" 
[7] "SVD_realRatingMatrix"          "SVDF_realRatingMatrix"        
[9] "UBCF_realRatingMatrix"        

Descriptions

$ALS_realRatingMatrix
[1] "Recommender for explicit ratings based on latent factors, calculated by alternating least squares algorithm."

$ALS_implicit_realRatingMatrix
[1] "Recommender for implicit data based on latent factors, calculated by alternating least squares algorithm."

$IBCF_realRatingMatrix
[1] "Recommender based on item-based collaborative filtering."

$POPULAR_realRatingMatrix
[1] "Recommender based on item popularity."

$RANDOM_realRatingMatrix
[1] "Produce random recommendations (real ratings)."

$RERECOMMEND_realRatingMatrix
[1] "Re-recommends highly rated items (real ratings)."

$SVD_realRatingMatrix
[1] "Recommender based on SVD approximation with column-mean imputation."

$SVDF_realRatingMatrix
[1] "Recommender based on Funk SVD with gradient descend."

$UBCF_realRatingMatrix
[1] "Recommender based on user-based collaborative filtering."
$k
[1] 30

$method
[1] "Cosine"

$normalize
[1] "center"

$normalize_sim_matrix
[1] FALSE

$alpha
[1] 0.5

$na_as_zero
[1] FALSE

Data exploration

[1] "realRatingMatrix"
attr(,"package")
[1] "recommenderlab"

MovieLense is a realRatingMatrix object containing a dataset about movie ratings. Each row corresponds to a user, each column to a movie, and each value to a rating.

[1]  943 1664

There are 943 users and 1664 movies. realRatingMatrix is an S4 class

[1] "data"      "normalize"
[1] "dgCMatrix"
attr(,"package")
[1] "Matrix"
[1]  943 1664

MovieLense(@)data belongs to the dgCMatrix class that inherits from Matrix. In order to perform custom data exploration, we might need to access this slot.

Exploring the values of the rating

[1] 5 4 0 3 1 2

The ratings are integers in the range 0-5. Let’s count the occurrences of each of them.

vector_ratings
      0       1       2       3       4       5 
1469760    6059   11307   27002   33947   21077 

According to the documentation, a rating equal to 0 represents a missing value, so we can remove them from vector_ratings. We can also build a frequency plot of the ratings. In order to visualize a bar plot with frequencies, we can use ggplot2. Let’s convert them into categories using factor and build a quick chart:

Let’s go ahead and visualize. The following image shows the distribution of the ratings. Most of the ratings are above 2, and the most common is 4.

Exploring which movies have been viewed

colCounts: This is the number of non-missing values for each column colMeans: This is the average value for each column

*Which are the most viewed movies (TOP 10)?

Sort the movies by number of views:

Let’s visualize the first six rows and build a histogram:

In the preceding chart, you can notice that Star Wars (1977) is the most viewed movie, exceeding the others by about 100 views.

*Which are the least viewed movies (BOTTOM 10)?

Explore Average Ratings

Let’s visualize by creating a chart. The following image shows the distribution of the average movie rating:

The highest value is around 3, and there are a few movies whose rating is either 1 or 5. Probably, the reason is that these movies received a rating from a few people only, so we shouldn’t take them into account.

Let’s remove the movies whose number of views is below a defined threshold, for instance, below 100:

The following image shows the distribution of the relevant average ratings:

All the rankings are between 2.3 and 4.5. As expected, we removed the extremes. The highest value changes, and now, it is around 4.

Let’s build the heatmap using image. The following image displays the heatmap of the rating matrix:

The above heatmap is a bit difficult to read. Let’s build the heat map using image(). The following image shows the heatmap of the first rows and columns:

Top percentile of users and movies, let’s use quantile function

   99% 
440.96 
   99% 
371.07 

Another heat map:

Questions:

Users who have rated at least 50 movies Movies that have been watched at least 100 times

560 x 332 rating matrix of class ‘realRatingMatrix’ with 55298 ratings.

The ratings_movies object contains about half of the users and a fifth of the movies in comparison with MovieLense.

Let’s visualize the top 2 percent of users and movies in the new matrix:

Let’s build the heatmap:

As we already noticed, some rows are darker than the others. This might mean that some users give higher ratings to all the movies. However, we have visualized the top movies only. In order to have an overview of all the users, let’s take a look at the distribution of the average rating by user:

Let’s visualize the distribution. As suspected, the average rating varies a lot across different users.

Normalize the data.

Let’s take a look at the average rating by users:

[1] 0

Let’s visualize the normalized matrix

The first difference that we can notice is the colors, and this is because the data is continuous. Previously, the rating was an integer between 1 and 5. After the normalization, the rating can be any number between -5 and 5.

There are still some lines that are more blue and some that are more red. The reason is that we are visualizing only the top movies. We already checked that the average rating is 0 for each user.

Binarizing the data

Let’s select this 5 percent using quantile. The row and column counts are the same as the original matrix, so we can still apply rowCounts and colCounts on ratings_movies:

Let’s build the heat map:

Only a few cells contain unwatched movies. This is just because we selected the top users and movies.

Let’s use the same approach to compute and visualize the other binary matrix The cells having a rating above the threshold will have their value equal to 1 and the other cells will be 0s:

Let’s build the heat map:

As expected, we have more white cells now. Depending on the model, we can leave the ratings matrix as it is or normalize/binarize it.

In this section, we prepared the data to perform recommendations. In the upcoming sections, we will build collaborative filtering models.

Training and Test Sets

First, we randomly define the which_train vector that is TRUE for users in the training set and FALSE for the others. We will set the probability in the training set as 80 percent:

[1]  TRUE  TRUE  TRUE FALSE  TRUE FALSE

Let’s define the training and the test sets:

Sample Code:

Recommendation model

Data: This is the training set Method: This is the name of the technique *Parameters: These are some optional parameters of the technique

IBCF, which stands for item-based collaborative filtering. Below outputs are the parameters.

$k
[1] 30

$method
[1] "Cosine"

$normalize
[1] "center"

$normalize_sim_matrix
[1] FALSE

$alpha
[1] 0.5

$na_as_zero
[1] FALSE

So let’s build it.

Recommender of type ‘IBCF’ for ‘realRatingMatrix’ 
learned using 111 users.
[1] "Recommender"
attr(,"package")
[1] "recommenderlab"

We’ll extract some of the details (description and parameters).

[1] "IBCF: Reduced similarity matrix"

The model_details$sim component contains the similarity matrix. Let’s check its structure:

[1] "dgCMatrix"
attr(,"package")
[1] "Matrix"
[1] 332 332

model_details$sim is a square matrix whose size is equal to the number of items. Let’s build heat map.

Most of the values are equal to 0. The reason is that each row contains only k elements.

[1] 30
row_sums
 30 
332 

So each row has 30 elements greater than 0. However, the matrix is not supposed to be symmetric. In fact, the number of non-null elements for each column depends on how many times the corresponding movie was included in the top k of another movie. Let’s check the distribution of the number of elements by column:

Let’s build the distribution chart:

As expected, there are a few movies that are similar to many others. Let’s see which are the movies with the most elements:

[1] "Usual Suspects, The (1995)"             "Sling Blade (1996)"                    
[3] "Star Wars (1977)"                       "Fargo (1996)"                          
[5] "Monty Python and the Holy Grail (1974)" "Casablanca (1942)"                     
Recommendations as ‘topNList’ with n = 6 for 449 users. 

The recc_predicted object contains the recommendations

[1] "topNList"
attr(,"package")
[1] "recommenderlab"
[1] "items"      "ratings"    "itemLabels" "n"         

For instance, these are the recommendations for the first user:

[1] 285 326 327 279 185 201

We would need to extract the recommended movies from recc_predicted(@)item labels:

[1] "Speed (1994)"            "Peacemaker, The (1997)"  "Scream 2 (1997)"        
[4] "Shine (1996)"            "Evita (1996)"            "Schindler's List (1993)"

Let’s define a function of a matrix with the recommendations for each user:

[1]   6 449

Let’s visualize the recommendations for the first four users:

     1                         2                                       
[1,] "Speed (1994)"            "While You Were Sleeping (1995)"        
[2,] "Peacemaker, The (1997)"  "Much Ado About Nothing (1993)"         
[3,] "Scream 2 (1997)"         "Cold Comfort Farm (1995)"              
[4,] "Shine (1996)"            "Citizen Kane (1941)"                   
[5,] "Evita (1996)"            "Ghost and the Darkness, The (1996)"    
[6,] "Schindler's List (1993)" "Monty Python and the Holy Grail (1974)"
     3                               5                                
[1,] "Aladdin (1992)"                "American President, The (1995)" 
[2,] "In & Out (1997)"               "Full Monty, The (1997)"         
[3,] "Mission: Impossible (1996)"    "Pulp Fiction (1994)"            
[4,] "Room with a View, A (1986)"    "Welcome to the Dollhouse (1995)"
[5,] "Magnificent Seven, The (1954)" "Peacemaker, The (1997)"         
[6,] "Aliens (1986)"                 "Air Force One (1997)"           

Now, we can identify the most recommended movies. For this purpose, we will define a vector with all the recommendations, and we will build a frequency plot:

The distribution chart that shows the distribution of the number of items for IBCF:

Let’s see which are the most popular recommended movies:

As you can see from the preceding table, the movie “Mr. Smith Goes to Washington” has been recommended the most times.

IBCF recommends items on the basis of the similarity matrix. this algorithm is efficient and scalable,

Building the recommendation model

$method
[1] "cosine"

$nn
[1] 25

$sample
[1] FALSE

$normalize
[1] "center"

Model with default parameters:

Recommender of type ‘UBCF’ for ‘realRatingMatrix’ 
learned using 111 users.

Components of the model:

[1] "description" "data"        "method"      "nn"          "sample"      "normalize"  
[7] "verbose"    

The below object contains the rating matrix. UBCF is a lazy-learning technique, which means that it needs to access all the data to perform a prediction.

111 x 332 rating matrix of class ‘realRatingMatrix’ with 10507 ratings.
Normalized using center on rows.

Apply model model to test set

Let’s find the top six recommendations for each new user

Recommendations as ‘topNList’ with n = 6 for 449 users. 

Let’s define a funtionc of a matrix with the recommendations to the test set users:

Let’s take a look at the first four users:

     1                         2                                       
[1,] "Speed (1994)"            "While You Were Sleeping (1995)"        
[2,] "Peacemaker, The (1997)"  "Much Ado About Nothing (1993)"         
[3,] "Scream 2 (1997)"         "Cold Comfort Farm (1995)"              
[4,] "Shine (1996)"            "Citizen Kane (1941)"                   
[5,] "Evita (1996)"            "Ghost and the Darkness, The (1996)"    
[6,] "Schindler's List (1993)" "Monty Python and the Holy Grail (1974)"
     3                               5                                
[1,] "Aladdin (1992)"                "American President, The (1995)" 
[2,] "In & Out (1997)"               "Full Monty, The (1997)"         
[3,] "Mission: Impossible (1996)"    "Pulp Fiction (1994)"            
[4,] "Room with a View, A (1986)"    "Welcome to the Dollhouse (1995)"
[5,] "Magnificent Seven, The (1954)" "Peacemaker, The (1997)"         
[6,] "Aliens (1986)"                 "Air Force One (1997)"           

Frequency Chart

We will compute how many times each movie got recommended and build the related frequency histogram

number_of_items
 1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 30 
 9 22 28 34 29 20 32 27 21 16 12 13 14  5  5  4  8  3  3  5  4  1  1  1  3  1  1  2 

Compared with the IBCF, the distribution has a longer tail. This means that there are some movies that are recommended much more often than the others. The maximum is 34, compared with 11 for IBCF.

Let’s take a look at the top titles:

The Godfather is the top movie title.

Collaborative filtering on binary data

Data preparation

Let’s build ratings_movies_watched using the binarize method as follows:

1 if the user purchased (or liked) the item, and 0 otherwise. This case is different from the previous cases, so it should be treated separately. Similar to the other cases, the techniques are item-based and user-based.

In our case, starting from ratings_movies, we can build a ratings_movies_watched matrix whose values will be 1 if the user viewed the movie, and 0 otherwise. We built it in one of the Binarizing the data sections.

Binarizing method as as before with IBCF

So, we can answer that on the average, each user watched about 100 movies, and only a few watched more than 200 movies.

Let’s define our training and test sets:

Item-based collaborative filtering on binary data

Same as before in exception to input parameter method equal to Jaccard

Same as before, let’s recommend six items to each of the users in the test set:

Let’s further examine the recommendations for the first four users.

Note: The approach is similar to IBCF using a rating matrix. Since we are not taking account of the ratings, the result will be less accurate.

EOF

