Project 2

MovieLense Recommendation System in R

Collaborative filtering is a technique used by recommender systems for predicting the interests of one user based on the preference information of other users. This project is an implementation of a Movie Recommender System that uses the following techniques:

  • Item-Item Collaborative Filtering (IBCF)
  • User-User Collaborative Filtering (UBCF)

I have chosen to use the Movielense data that is available with Recommederlab

Load Data

## 943 x 1664 rating matrix of class 'realRatingMatrix' with 99392 ratings.

Data Exploration

Each row of MovieLense corresponds to a user, and each column corresponds to a movie. There are more than 1569152 combinations between a user and a movie.

Let’s explore in detail.

## [1] "realRatingMatrix"
## attr(,"package")
## [1] "recommenderlab"

Look at the first few ratings of the first user

##                                     Toy Story (1995) 
##                                                    5 
##                                     GoldenEye (1995) 
##                                                    3 
##                                    Four Rooms (1995) 
##                                                    4 
##                                    Get Shorty (1995) 
##                                                    3 
##                                       Copycat (1995) 
##                                                    3 
## Shanghai Triad (Yao a yao yao dao waipo qiao) (1995) 
##                                                    5

Visualize first 100 Users and 100 Movies matrix

Number of ratings per user

Number of ratings per movie

Mean rating (averaged over users)

## [1] 3.587565

Available movie meta information

##                                                  title year
## 1                                     Toy Story (1995) 1995
## 2                                     GoldenEye (1995) 1995
## 3                                    Four Rooms (1995) 1995
## 4                                    Get Shorty (1995) 1995
## 5                                       Copycat (1995) 1995
## 6 Shanghai Triad (Yao a yao yao dao waipo qiao) (1995) 1995
##                                                            url unknown Action
## 1        http://us.imdb.com/M/title-exact?Toy%20Story%20(1995)       0      0
## 2          http://us.imdb.com/M/title-exact?GoldenEye%20(1995)       0      1
## 3       http://us.imdb.com/M/title-exact?Four%20Rooms%20(1995)       0      0
## 4       http://us.imdb.com/M/title-exact?Get%20Shorty%20(1995)       0      1
## 5            http://us.imdb.com/M/title-exact?Copycat%20(1995)       0      0
## 6 http://us.imdb.com/Title?Yao+a+yao+yao+dao+waipo+qiao+(1995)       0      0
##   Adventure Animation Children's Comedy Crime Documentary Drama Fantasy
## 1         0         1          1      1     0           0     0       0
## 2         1         0          0      0     0           0     0       0
## 3         0         0          0      0     0           0     0       0
## 4         0         0          0      1     0           0     1       0
## 5         0         0          0      0     1           0     1       0
## 6         0         0          0      0     0           0     1       0
##   Film-Noir Horror Musical Mystery Romance Sci-Fi Thriller War Western
## 1         0      0       0       0       0      0        0   0       0
## 2         0      0       0       0       0      0        1   0       0
## 3         0      0       0       0       0      0        1   0       0
## 4         0      0       0       0       0      0        0   0       0
## 5         0      0       0       0       0      0        1   0       0
## 6         0      0       0       0       0      0        0   0       0

Determine how similar the first five users are with each other. Let’s compute this using the cosine distance

Let’s convert similarity_users into a matrix to visualize it.

##           1         2         3         4         5
## 1 0.0000000 0.9605820 0.8339504 0.9192637 0.9326136
## 2 0.9605820 0.0000000 0.9268716 0.9370341 0.9848027
## 3 0.8339504 0.9268716 0.0000000 0.9130323 1.0000000
## 4 0.9192637 0.9370341 0.9130323 0.0000000 0.9946918
## 5 0.9326136 0.9848027 1.0000000 0.9946918 0.0000000

The more red the cell is, the more similar two users are. Note that the diagonal is red, since it’s comparing each user with itself:

Determine how similar the first five Movie Ratings are with each other. Let’s compute this using the cosine distance

##                   Toy Story (1995) GoldenEye (1995) Four Rooms (1995)
## Toy Story (1995)         0.0000000        0.9487374         0.9132997
## GoldenEye (1995)         0.9487374        0.0000000         0.9088797
## Four Rooms (1995)        0.9132997        0.9088797         0.0000000
## Get Shorty (1995)        0.9429069        0.9394926         0.8991940
## Copycat (1995)           0.9613638        0.9426876         0.9424719
##                   Get Shorty (1995) Copycat (1995)
## Toy Story (1995)          0.9429069      0.9613638
## GoldenEye (1995)          0.9394926      0.9426876
## Four Rooms (1995)         0.8991940      0.9424719
## Get Shorty (1995)         0.0000000      0.8919936
## Copycat (1995)            0.8919936      0.0000000

Similar to the above screenshot, we can visualize the matrix using this image:

Let’s compute this using the pearson distance

Let’s convert similarity_users into a matrix to visualize it.

##           1         2         3         4         5
## 1 0.0000000 0.5437268 0.5137099 0.6666667 0.6332355
## 2 0.5437268 0.0000000 0.5000000 0.5317799 0.5978454
## 3 0.5137099 0.5000000 0.0000000 0.5000000        NA
## 4 0.6666667 0.5317799 0.5000000 0.0000000 1.0000000
## 5 0.6332355 0.5978454        NA 1.0000000 0.0000000

The more red the cell is, the more similar two users are. Note that the diagonal is red, since it’s comparing each user with itself:

Determine how similar the first five Movie Ratings are with each other. Let’s compute this using the pearson distance

##                   Toy Story (1995) GoldenEye (1995) Four Rooms (1995)
## Toy Story (1995)         0.0000000        0.5623614         0.5481838
## GoldenEye (1995)         0.5623614        0.0000000         0.5651428
## Four Rooms (1995)        0.5481838        0.5651428         0.0000000
## Get Shorty (1995)        0.5271857        0.5696564         0.5000000
## Copycat (1995)           0.6197611        0.5610285         0.5508465
##                   Get Shorty (1995) Copycat (1995)
## Toy Story (1995)          0.5271857      0.6197611
## GoldenEye (1995)          0.5696564      0.5610285
## Four Rooms (1995)         0.5000000      0.5508465
## Get Shorty (1995)         0.0000000      0.5000000
## Copycat (1995)            0.5000000      0.0000000

Similar to the above screenshot, we can visualize the matrix using this image:

When users and items tend to have very differing sets of items, pearson would perform worse. As we can see from the heat map, cosine one seems to show more relations between different users and items.

Exploring the values of the rating

## [1] 5 4 0 3 1 2

The ratings are integers in the range 0-5. Let’s count the occurrences of each of them.

## vector_ratings
##       0       1       2       3       4       5 
## 1469760    6059   11307   27002   33947   21077

According to the documentation, a rating equal to 0 represents a missing value, so we can remove them from vector_ratings. We can also build a frequency plot of the ratings. In order to visualize a bar plot with frequencies, we can use ggplot2. Let’s convert them into categories using factor and build a quick chart:

The following image shows the distribution of the ratings. Most of the ratings are above 2, and the most common is 4.

##                                                   movie views
## Star Wars (1977)                       Star Wars (1977)   583
## Contact (1997)                           Contact (1997)   509
## Fargo (1996)                               Fargo (1996)   508
## Return of the Jedi (1983)     Return of the Jedi (1983)   507
## Liar Liar (1997)                       Liar Liar (1997)   485
## English Patient, The (1996) English Patient, The (1996)   481

Let’s visualize the first six rows and build a histogram:

In the above chart, you can notice that Star Wars (1977) is the most viewed

Explore Average Ratings

Let’s visualize by creating a chart. The following image shows the distribution of the average movie rating:

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

The highest value is around 3, and there are a few movies whose rating is either 1 or 5. Probably, the reason is that these movies received a rating from a few people only, so we shouldn’t take them into account.

Let’s remove the movies whose number of views is below a defined threshold, for instance, below 100:

The following image shows the distribution of the relevant average ratings:

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

All the rankings are between 2.3 and 4.5.

Let’s build the heatmap. The following image displays the heatmap of the rating first few rows and columns of rating matrix:

Users who have rated at least 50 movies Movies that have been watched at least 100 times

## 560 x 332 rating matrix of class 'realRatingMatrix' with 55298 ratings.

Training and Test Sets

First, we randomly define the which_train vector that is TRUE for users in the training set and FALSE for the others. We will set the probability in the training set as 70 percent:

## [1] FALSE FALSE  TRUE FALSE FALSE  TRUE

Let’s define the training and the test sets:

Recommendation model

IBCF Based Model, which stands for item-based collaborative filtering. Below outputs are the parameters.

## $k
## [1] 30
## 
## $method
## [1] "Cosine"
## 
## $normalize
## [1] "center"
## 
## $normalize_sim_matrix
## [1] FALSE
## 
## $alpha
## [1] 0.5
## 
## $na_as_zero
## [1] FALSE

Build Model IBCF

## Recommender of type 'IBCF' for 'realRatingMatrix' 
## learned using 409 users.

Test the IBCF Model

## Recommendations as 'topNList' with n = 6 for 151 users.

Build recommendation matrix

## [1]   6 151

Let’s visualize the recommendations for the first five users:

##      1                                     2                              
## [1,] "Fried Green Tomatoes (1991)"         "Usual Suspects, The (1995)"   
## [2,] "Dumbo (1941)"                        "Braveheart (1995)"            
## [3,] "True Lies (1994)"                    "Batman Forever (1995)"        
## [4,] "Ransom (1996)"                       "Much Ado About Nothing (1993)"
## [5,] "Leaving Las Vegas (1995)"            "Cold Comfort Farm (1995)"     
## [6,] "Hunchback of Notre Dame, The (1996)" "Twister (1996)"               
##      5                                 6                          
## [1,] "Waterworld (1995)"               "Dead Poets Society (1989)"
## [2,] "Army of Darkness (1993)"         "Devil's Own, The (1997)"  
## [3,] "My Best Friend's Wedding (1997)" "Pretty Woman (1990)"      
## [4,] "Full Metal Jacket (1987)"        "Broken Arrow (1996)"      
## [5,] "Cape Fear (1991)"                "Primal Fear (1996)"       
## [6,] "Outbreak (1995)"                 "Saint, The (1997)"        
##      8                             
## [1,] "Natural Born Killers (1994)" 
## [2,] "Fargo (1996)"                
## [3,] "Clockwork Orange, A (1971)"  
## [4,] "Young Guns (1988)"           
## [5,] "Air Force One (1997)"        
## [6,] "Devil's Advocate, The (1997)"

Now, we can identify the most recommended movies. For this purpose, we will define a vector with all the recommendations, and we will build a frequency plot:

Let’s see which are the most popular recommended movies:

##                                                    names.number_of_items_top.
## Ace Ventura: Pet Detective (1994)           Ace Ventura: Pet Detective (1994)
## Beavis and Butt-head Do America (1996) Beavis and Butt-head Do America (1996)
## Outbreak (1995)                                               Outbreak (1995)
## Kiss the Girls (1997)                                   Kiss the Girls (1997)
##                                        number_of_items_top
## Ace Ventura: Pet Detective (1994)                       14
## Beavis and Butt-head Do America (1996)                  13
## Outbreak (1995)                                         13
## Kiss the Girls (1997)                                   10

Now, Lets build a UBCF Based Model, which stands for user-based collaborative filtering. Below outputs are the parameters.

## $method
## [1] "cosine"
## 
## $nn
## [1] 25
## 
## $sample
## [1] FALSE
## 
## $normalize
## [1] "center"

Build Model UBCF

## Recommender of type 'UBCF' for 'realRatingMatrix' 
## learned using 409 users.

Apply model model to test set

Let’s find the top six recommendations for each new user

## Recommendations as 'topNList' with n = 6 for 151 users.

Build recommendation matrix

## [1]   6 151

Let’s take a look at the first five users:

##      1                         
## [1,] "L.A. Confidential (1997)"
## [2,] "Trainspotting (1996)"    
## [3,] "Close Shave, A (1995)"   
## [4,] "Titanic (1997)"          
## [5,] "Donnie Brasco (1997)"    
## [6,] "Leaving Las Vegas (1995)"
##      2                                                                            
## [1,] "Blade Runner (1982)"                                                        
## [2,] "Shawshank Redemption, The (1994)"                                           
## [3,] "Silence of the Lambs, The (1991)"                                           
## [4,] "Casablanca (1942)"                                                          
## [5,] "Schindler's List (1993)"                                                    
## [6,] "Dr. Strangelove or: How I Learned to Stop Worrying and Love the Bomb (1963)"
##      5                           6                                
## [1,] "Titanic (1997)"            "Empire Strikes Back, The (1980)"
## [2,] "L.A. Confidential (1997)"  "Clockwork Orange, A (1971)"     
## [3,] "Good Will Hunting (1997)"  "Return of the Jedi (1983)"      
## [4,] "Cop Land (1997)"           "Chinatown (1974)"               
## [5,] "As Good As It Gets (1997)" "Aliens (1986)"                  
## [6,] "Conspiracy Theory (1997)"  "Nikita (La Femme Nikita) (1990)"
##      8                         
## [1,] "Titanic (1997)"          
## [2,] "L.A. Confidential (1997)"
## [3,] "Good Will Hunting (1997)"
## [4,] "Full Monty, The (1997)"  
## [5,] "Apt Pupil (1998)"        
## [6,] "Trainspotting (1996)"

Now, we can identify the most recommended movies. For this purpose, we will define a vector with all the recommendations, and we will build a frequency plot:

Compared with the IBCF, the distribution has a longer tail. This means that there are some movies that are recommended much more often than the others.

##                          names.number_of_items_top. number_of_items_top
## Good Will Hunting (1997)   Good Will Hunting (1997)                  65
## Titanic (1997)                       Titanic (1997)                  56
## L.A. Confidential (1997)   L.A. Confidential (1997)                  53
## Godfather, The (1972)         Godfather, The (1972)                  37

IBCF Seems to be more accurate in recommending the Movies.

The computational cost for UBCF in the worst case is O(NM) because it requires examining N customers and up to M items for each customer. However, due to the sparsity of the user-item matrix, the actual computational cost would be close to O(N + M) because for most customers, they only rated a small number of movies which results in a computational cost of O(N), and for a handful customers who rated a significant amount of movies, the computational cost is close to O(M).

By comparing the computational cost of these two methods, it seems that IBCF requires more expensive computational cost, but building the item-item similarity matrix are calculated offline and the online prediction needs very little computational cost; However, for UBCF, there is no offline calculation, so all of the computational cost is online, which turns out that the predictions are literally very slow even for middle-size datasets due to the heavy online computational cost.