# Project 2

### MovieLense Recommendation System in R

Collaborative filtering is a technique used by recommender systems for predicting the interests of one user based on the preference information of other users. This project is an implementation of a Movie Recommender System that uses the following techniques:

• Item-Item Collaborative Filtering (IBCF)
• User-User Collaborative Filtering (UBCF)

I have chosen to use the Movielense data that is available with Recommederlab

library(recommenderlab)

data(MovieLense)
MovieLense
## 943 x 1664 rating matrix of class 'realRatingMatrix' with 99392 ratings.

#### Data Exploration

Each row of MovieLense corresponds to a user, and each column corresponds to a movie. There are more than 1569152 combinations between a user and a movie.

Let’s explore in detail.

class(MovieLense)
## [1] "realRatingMatrix"
## attr(,"package")
## [1] "recommenderlab"

#### Look at the first few ratings of the first user

head(as(MovieLense[1,], "list")[[1]])
##                                     Toy Story (1995)
##                                                    5
##                                     GoldenEye (1995)
##                                                    3
##                                    Four Rooms (1995)
##                                                    4
##                                    Get Shorty (1995)
##                                                    3
##                                       Copycat (1995)
##                                                    3
## Shanghai Triad (Yao a yao yao dao waipo qiao) (1995)
##                                                    5

#### Visualize first 100 Users and 100 Movies matrix

image(MovieLense[1:100,1:100])

#### Number of ratings per user

hist(rowCounts(MovieLense))

#### Number of ratings per movie

hist(colCounts(MovieLense))

#### Mean rating (averaged over users)

mean(rowMeans(MovieLense))
## [1] 3.587565

#### Available movie meta information

head(MovieLenseMeta)
##                                                  title year
## 1                                     Toy Story (1995) 1995
## 2                                     GoldenEye (1995) 1995
## 3                                    Four Rooms (1995) 1995
## 4                                    Get Shorty (1995) 1995
## 5                                       Copycat (1995) 1995
## 6 Shanghai Triad (Yao a yao yao dao waipo qiao) (1995) 1995
##                                                            url unknown Action
## 1        http://us.imdb.com/M/title-exact?Toy%20Story%20(1995)       0      0
## 2          http://us.imdb.com/M/title-exact?GoldenEye%20(1995)       0      1
## 3       http://us.imdb.com/M/title-exact?Four%20Rooms%20(1995)       0      0
## 4       http://us.imdb.com/M/title-exact?Get%20Shorty%20(1995)       0      1
## 5            http://us.imdb.com/M/title-exact?Copycat%20(1995)       0      0
## 6 http://us.imdb.com/Title?Yao+a+yao+yao+dao+waipo+qiao+(1995)       0      0
##   Adventure Animation Children's Comedy Crime Documentary Drama Fantasy
## 1         0         1          1      1     0           0     0       0
## 2         1         0          0      0     0           0     0       0
## 3         0         0          0      0     0           0     0       0
## 4         0         0          0      1     0           0     1       0
## 5         0         0          0      0     1           0     1       0
## 6         0         0          0      0     0           0     1       0
##   Film-Noir Horror Musical Mystery Romance Sci-Fi Thriller War Western
## 1         0      0       0       0       0      0        0   0       0
## 2         0      0       0       0       0      0        1   0       0
## 3         0      0       0       0       0      0        1   0       0
## 4         0      0       0       0       0      0        0   0       0
## 5         0      0       0       0       0      0        1   0       0
## 6         0      0       0       0       0      0        0   0       0

Determine how similar the first five users are with each other. Let’s compute this using the cosine distance

similarity_users <- similarity(MovieLense[1:5, ], method = "cosine", which = "users")

Let’s convert similarity_users into a matrix to visualize it.

as.matrix(similarity_users)
##           1         2         3         4         5
## 1 0.0000000 0.9605820 0.8339504 0.9192637 0.9326136
## 2 0.9605820 0.0000000 0.9268716 0.9370341 0.9848027
## 3 0.8339504 0.9268716 0.0000000 0.9130323 1.0000000
## 4 0.9192637 0.9370341 0.9130323 0.0000000 0.9946918
## 5 0.9326136 0.9848027 1.0000000 0.9946918 0.0000000

The more red the cell is, the more similar two users are. Note that the diagonal is red, since it’s comparing each user with itself:

image(as.matrix(similarity_users), main = "User similarity")

Determine how similar the first five Movie Ratings are with each other. Let’s compute this using the cosine distance

similarity_items <- similarity(MovieLense[, 1:5], method = "cosine", which = "items")
as.matrix(similarity_items)
##                   Toy Story (1995) GoldenEye (1995) Four Rooms (1995)
## Toy Story (1995)         0.0000000        0.9487374         0.9132997
## GoldenEye (1995)         0.9487374        0.0000000         0.9088797
## Four Rooms (1995)        0.9132997        0.9088797         0.0000000
## Get Shorty (1995)        0.9429069        0.9394926         0.8991940
## Copycat (1995)           0.9613638        0.9426876         0.9424719
##                   Get Shorty (1995) Copycat (1995)
## Toy Story (1995)          0.9429069      0.9613638
## GoldenEye (1995)          0.9394926      0.9426876
## Four Rooms (1995)         0.8991940      0.9424719
## Get Shorty (1995)         0.0000000      0.8919936
## Copycat (1995)            0.8919936      0.0000000

Similar to the above screenshot, we can visualize the matrix using this image:

image(as.matrix(similarity_items), main = "Item similarity")

Let’s compute this using the pearson distance

similarity_users <- similarity(MovieLense[1:5, ], method = "pearson", which = "users")

Let’s convert similarity_users into a matrix to visualize it.

as.matrix(similarity_users)
##           1         2         3         4         5
## 1 0.0000000 0.5437268 0.5137099 0.6666667 0.6332355
## 2 0.5437268 0.0000000 0.5000000 0.5317799 0.5978454
## 3 0.5137099 0.5000000 0.0000000 0.5000000        NA
## 4 0.6666667 0.5317799 0.5000000 0.0000000 1.0000000
## 5 0.6332355 0.5978454        NA 1.0000000 0.0000000

The more red the cell is, the more similar two users are. Note that the diagonal is red, since it’s comparing each user with itself:

image(as.matrix(similarity_users), main = "User similarity")

Determine how similar the first five Movie Ratings are with each other. Let’s compute this using the pearson distance

similarity_items <- similarity(MovieLense[, 1:5], method = "pearson", which = "items")
as.matrix(similarity_items)
##                   Toy Story (1995) GoldenEye (1995) Four Rooms (1995)
## Toy Story (1995)         0.0000000        0.5623614         0.5481838
## GoldenEye (1995)         0.5623614        0.0000000         0.5651428
## Four Rooms (1995)        0.5481838        0.5651428         0.0000000
## Get Shorty (1995)        0.5271857        0.5696564         0.5000000
## Copycat (1995)           0.6197611        0.5610285         0.5508465
##                   Get Shorty (1995) Copycat (1995)
## Toy Story (1995)          0.5271857      0.6197611
## GoldenEye (1995)          0.5696564      0.5610285
## Four Rooms (1995)         0.5000000      0.5508465
## Get Shorty (1995)         0.0000000      0.5000000
## Copycat (1995)            0.5000000      0.0000000

Similar to the above screenshot, we can visualize the matrix using this image:

image(as.matrix(similarity_items), main = "Item similarity")

When users and items tend to have very differing sets of items, pearson would perform worse. As we can see from the heat map, cosine one seems to show more relations between different users and items.

#### Exploring the values of the rating

vector_ratings <- as.vector(MovieLense@data)
unique(vector_ratings)
## [1] 5 4 0 3 1 2

The ratings are integers in the range 0-5. Let’s count the occurrences of each of them.

table_ratings <- table(vector_ratings)
table_ratings
## vector_ratings
##       0       1       2       3       4       5
## 1469760    6059   11307   27002   33947   21077

According to the documentation, a rating equal to 0 represents a missing value, so we can remove them from vector_ratings. We can also build a frequency plot of the ratings. In order to visualize a bar plot with frequencies, we can use ggplot2. Let’s convert them into categories using factor and build a quick chart:

vector_ratings <- vector_ratings[vector_ratings != 0]
vector_ratings <- factor(vector_ratings)

The following image shows the distribution of the ratings. Most of the ratings are above 2, and the most common is 4.

library(ggplot2)
qplot(vector_ratings) + ggtitle("Distribution of the ratings")

views_per_movie <- colCounts(MovieLense)

table_views <- data.frame(
movie = names(views_per_movie),
views = views_per_movie)

table_views <- table_views[order(table_views$views, decreasing = TRUE), ] head(table_views) ## movie views ## Star Wars (1977) Star Wars (1977) 583 ## Contact (1997) Contact (1997) 509 ## Fargo (1996) Fargo (1996) 508 ## Return of the Jedi (1983) Return of the Jedi (1983) 507 ## Liar Liar (1997) Liar Liar (1997) 485 ## English Patient, The (1996) English Patient, The (1996) 481 Let’s visualize the first six rows and build a histogram: ggplot(table_views[1:6, ], aes(x = movie, y = views)) + geom_bar(stat="identity") + theme(axis.text.x = element_text(angle = 45, hjust = 1)) + ggtitle("Number of views of the top movies") In the above chart, you can notice that Star Wars (1977) is the most viewed #### Explore Average Ratings average_ratings <- colMeans(MovieLense) Let’s visualize by creating a chart. The following image shows the distribution of the average movie rating: qplot(average_ratings) + stat_bin(binwidth = 0.1) + ggtitle("Distribution of the average movie rating") ## stat_bin() using bins = 30. Pick better value with binwidth. The highest value is around 3, and there are a few movies whose rating is either 1 or 5. Probably, the reason is that these movies received a rating from a few people only, so we shouldn’t take them into account. Let’s remove the movies whose number of views is below a defined threshold, for instance, below 100: The following image shows the distribution of the relevant average ratings: average_ratings <- average_ratings[views_per_movie > 100] qplot(average_ratings) + stat_bin(binwidth = 0.1) + ggtitle(paste("Distribution of the relevant average ratings")) ## stat_bin() using bins = 30. Pick better value with binwidth. All the rankings are between 2.3 and 4.5. Let’s build the heatmap. The following image displays the heatmap of the rating first few rows and columns of rating matrix: image(MovieLense[1:10, 1:15], main = "Heatmap of the first rows and columns") Users who have rated at least 50 movies Movies that have been watched at least 100 times ratings_movies <- MovieLense[rowCounts(MovieLense) > 50, colCounts(MovieLense) > 100] ratings_movies ## 560 x 332 rating matrix of class 'realRatingMatrix' with 55298 ratings. #### Training and Test Sets First, we randomly define the which_train vector that is TRUE for users in the training set and FALSE for the others. We will set the probability in the training set as 70 percent: is_train <- sample(x = c(TRUE, FALSE), size = nrow(ratings_movies), replace = TRUE, prob = c(0.7, 0.3)) head(is_train) ## [1] FALSE FALSE TRUE FALSE FALSE TRUE Let’s define the training and the test sets: recc_data_train <- ratings_movies[is_train, ] recc_data_test <- ratings_movies[!is_train, ] #### Recommendation model IBCF Based Model, which stands for item-based collaborative filtering. Below outputs are the parameters. recommender_models <- recommenderRegistry$get_entries(dataType = "realRatingMatrix")
recommender_models$IBCF_realRatingMatrix$parameters
## $k ## [1] 30 ## ##$method
## [1] "Cosine"
##
## $normalize ## [1] "center" ## ##$normalize_sim_matrix
## [1] FALSE
##
## $alpha ## [1] 0.5 ## ##$na_as_zero
## [1] FALSE

### Build Model IBCF

ibcf_model <- Recommender(data = recc_data_train, method = "IBCF", parameter = list(k = 30))
ibcf_model
## Recommender of type 'IBCF' for 'realRatingMatrix'
## learned using 409 users.
model_details <- getModel(ibcf_model)
n_items_top <- 20
image(model_details$sim[1:n_items_top, 1:n_items_top], main = "Heatmap of the first rows and columns") Test the IBCF Model n_recommended <- 6 recc_predicted <- predict(object = ibcf_model, newdata = recc_data_test, n = n_recommended) recc_predicted ## Recommendations as 'topNList' with n = 6 for 151 users. Build recommendation matrix recc_matrix <- sapply(recc_predicted@items, function(x) { colnames(ratings_movies)[x] }) dim(recc_matrix) ## [1] 6 151 Let’s visualize the recommendations for the first five users: recc_matrix[, 1:5] ## 1 2 ## [1,] "Fried Green Tomatoes (1991)" "Usual Suspects, The (1995)" ## [2,] "Dumbo (1941)" "Braveheart (1995)" ## [3,] "True Lies (1994)" "Batman Forever (1995)" ## [4,] "Ransom (1996)" "Much Ado About Nothing (1993)" ## [5,] "Leaving Las Vegas (1995)" "Cold Comfort Farm (1995)" ## [6,] "Hunchback of Notre Dame, The (1996)" "Twister (1996)" ## 5 6 ## [1,] "Waterworld (1995)" "Dead Poets Society (1989)" ## [2,] "Army of Darkness (1993)" "Devil's Own, The (1997)" ## [3,] "My Best Friend's Wedding (1997)" "Pretty Woman (1990)" ## [4,] "Full Metal Jacket (1987)" "Broken Arrow (1996)" ## [5,] "Cape Fear (1991)" "Primal Fear (1996)" ## [6,] "Outbreak (1995)" "Saint, The (1997)" ## 8 ## [1,] "Natural Born Killers (1994)" ## [2,] "Fargo (1996)" ## [3,] "Clockwork Orange, A (1971)" ## [4,] "Young Guns (1988)" ## [5,] "Air Force One (1997)" ## [6,] "Devil's Advocate, The (1997)" Now, we can identify the most recommended movies. For this purpose, we will define a vector with all the recommendations, and we will build a frequency plot: number_of_items <- factor(table(recc_matrix)) chart_title <- "Distribution of the number of items for IBCF" qplot(number_of_items) + ggtitle(chart_title) Let’s see which are the most popular recommended movies: number_of_items_sorted <- sort(number_of_items, decreasing = TRUE) number_of_items_top <- head(number_of_items_sorted, n = 4) table_top <- data.frame(names(number_of_items_top), number_of_items_top) table_top ## names.number_of_items_top. ## Ace Ventura: Pet Detective (1994) Ace Ventura: Pet Detective (1994) ## Beavis and Butt-head Do America (1996) Beavis and Butt-head Do America (1996) ## Outbreak (1995) Outbreak (1995) ## Kiss the Girls (1997) Kiss the Girls (1997) ## number_of_items_top ## Ace Ventura: Pet Detective (1994) 14 ## Beavis and Butt-head Do America (1996) 13 ## Outbreak (1995) 13 ## Kiss the Girls (1997) 10 Now, Lets build a UBCF Based Model, which stands for user-based collaborative filtering. Below outputs are the parameters. recommender_models <- recommenderRegistry$get_entries(dataType = "realRatingMatrix")
recommender_models$UBCF_realRatingMatrix$parameters
## $method ## [1] "cosine" ## ##$nn
## [1] 25
##
## $sample ## [1] FALSE ## ##$normalize
## [1] "center"

### Build Model UBCF

ubcf_model <- Recommender(data = recc_data_train, method = "UBCF")
ubcf_model
## Recommender of type 'UBCF' for 'realRatingMatrix'
## learned using 409 users.

Apply model model to test set

Let’s find the top six recommendations for each new user

n_recommended <- 6
recc_predicted <- predict(object = ubcf_model, newdata = recc_data_test, n = n_recommended)
recc_predicted
## Recommendations as 'topNList' with n = 6 for 151 users.

Build recommendation matrix

recc_matrix <- sapply(recc_predicted@items, function(x){  colnames(ratings_movies)[x] })
dim(recc_matrix)
## [1]   6 151

Let’s take a look at the first five users:

recc_matrix[, 1:5]
##      1
## [1,] "L.A. Confidential (1997)"
## [2,] "Trainspotting (1996)"
## [3,] "Close Shave, A (1995)"
## [4,] "Titanic (1997)"
## [5,] "Donnie Brasco (1997)"
## [6,] "Leaving Las Vegas (1995)"
##      2
## [2,] "Shawshank Redemption, The (1994)"
## [3,] "Silence of the Lambs, The (1991)"
## [4,] "Casablanca (1942)"
## [5,] "Schindler's List (1993)"
## [6,] "Dr. Strangelove or: How I Learned to Stop Worrying and Love the Bomb (1963)"
##      5                           6
## [1,] "Titanic (1997)"            "Empire Strikes Back, The (1980)"
## [2,] "L.A. Confidential (1997)"  "Clockwork Orange, A (1971)"
## [3,] "Good Will Hunting (1997)"  "Return of the Jedi (1983)"
## [4,] "Cop Land (1997)"           "Chinatown (1974)"
## [5,] "As Good As It Gets (1997)" "Aliens (1986)"
## [6,] "Conspiracy Theory (1997)"  "Nikita (La Femme Nikita) (1990)"
##      8
## [1,] "Titanic (1997)"
## [2,] "L.A. Confidential (1997)"
## [3,] "Good Will Hunting (1997)"
## [4,] "Full Monty, The (1997)"
## [5,] "Apt Pupil (1998)"
## [6,] "Trainspotting (1996)"

Now, we can identify the most recommended movies. For this purpose, we will define a vector with all the recommendations, and we will build a frequency plot:

number_of_items <- factor(table(recc_matrix))
chart_title <- "Distribution of the number of items for UBCF"
qplot(number_of_items) +
ggtitle(chart_title)

Compared with the IBCF, the distribution has a longer tail. This means that there are some movies that are recommended much more often than the others.

number_of_items_sorted <- sort(number_of_items, decreasing = TRUE)
number_of_items_top <- head(number_of_items_sorted, n = 4)
table_top <- data.frame(names(number_of_items_top), number_of_items_top)
table_top
##                          names.number_of_items_top. number_of_items_top
## Good Will Hunting (1997)   Good Will Hunting (1997)                  65
## Titanic (1997)                       Titanic (1997)                  56
## L.A. Confidential (1997)   L.A. Confidential (1997)                  53
## Godfather, The (1972)         Godfather, The (1972)                  37

IBCF Seems to be more accurate in recommending the Movies.

The computational cost for UBCF in the worst case is O(NM) because it requires examining N customers and up to M items for each customer. However, due to the sparsity of the user-item matrix, the actual computational cost would be close to O(N + M) because for most customers, they only rated a small number of movies which results in a computational cost of O(N), and for a handful customers who rated a significant amount of movies, the computational cost is close to O(M).

By comparing the computational cost of these two methods, it seems that IBCF requires more expensive computational cost, but building the item-item similarity matrix are calculated offline and the online prediction needs very little computational cost; However, for UBCF, there is no offline calculation, so all of the computational cost is online, which turns out that the predictions are literally very slow even for middle-size datasets due to the heavy online computational cost.