In this project, I develop a collaborative filtering recommender (CFR) system for recommending movies.
The collaborative filtering approach considers only user preferences and does not take into account the features or contents of the items (books or movies) being recommended. In this project, in order to recommend movies I will use a large set of users preferences towards the movies from a publicly available movie rating dataset.
Once the CFR is “tunned” will develop a shinny app with the implementation of the trained model
The dataset used was from MovieLens, and is publicly available at http://grouplens.org/datasets/movielens/latest. This dataset contains 105339 ratings and 6138 tag applications across 10329 movies. These data were created by 668 users.The data are contained in four files: links.csv, movies.csv, ratings.csv and tags.csv. I only use the files movies.csv and ratings.csv to build a recommendation system.
movies <- read.csv("https://raw.githubusercontent.com/sortega7878/DATA612/master/final/movies.csv",stringsAsFactors=FALSE)
ratings <- read.csv("https://raw.githubusercontent.com/sortega7878/DATA612/master/final/ratings.csv")
A summary of movies is given below:
summary(movies)
## movieId title genres
## Min. : 1 Length:10329 Length:10329
## 1st Qu.: 3240 Class :character Class :character
## Median : 7088 Mode :character Mode :character
## Mean : 31924
## 3rd Qu.: 59900
## Max. :149532
head(movies)
## movieId title
## 1 1 Toy Story (1995)
## 2 2 Jumanji (1995)
## 3 3 Grumpier Old Men (1995)
## 4 4 Waiting to Exhale (1995)
## 5 5 Father of the Bride Part II (1995)
## 6 6 Heat (1995)
## genres
## 1 Adventure|Animation|Children|Comedy|Fantasy
## 2 Adventure|Children|Fantasy
## 3 Comedy|Romance
## 4 Comedy|Drama|Romance
## 5 Comedy
## 6 Action|Crime|Thriller
A summary of ratings:
summary(ratings)
## userId movieId rating timestamp
## Min. : 1.0 Min. : 1 Min. :0.500 Min. :8.286e+08
## 1st Qu.:192.0 1st Qu.: 1073 1st Qu.:3.000 1st Qu.:9.711e+08
## Median :383.0 Median : 2497 Median :3.500 Median :1.115e+09
## Mean :364.9 Mean : 13381 Mean :3.517 Mean :1.130e+09
## 3rd Qu.:557.0 3rd Qu.: 5991 3rd Qu.:4.000 3rd Qu.:1.275e+09
## Max. :668.0 Max. :149532 Max. :5.000 Max. :1.452e+09
head(ratings)
## userId movieId rating timestamp
## 1 1 16 4.0 1217897793
## 2 1 24 1.5 1217895807
## 3 1 32 4.0 1217896246
## 4 1 47 4.0 1217896556
## 5 1 50 4.0 1217896523
## 6 1 110 4.0 1217896150
Both usersId and movieId are presented as integers and should be changed to factors. Genres of the movies are not easily usable because of their format.
Some pre-processing of the data available is required before creating the recommendation system.
First of all, I will re-organize the information of movie genres in such a way that allows future users to search for the movies they like within specific genres in the long format.
I use a one-hot encoding to create a matrix of corresponding genres for each movie.
genres <- as.data.frame(movies$genres, stringsAsFactors=FALSE)
genres2 <- as.data.frame(tstrsplit(genres[,1], '[|]',
type.convert=TRUE),
stringsAsFactors=FALSE)
colnames(genres2) <- c(1:10)
genre_list <- c("Action", "Adventure", "Animation", "Children",
"Comedy", "Crime","Documentary", "Drama", "Fantasy",
"Film-Noir", "Horror", "Musical", "Mystery","Romance",
"Sci-Fi", "Thriller", "War", "Western") # we have 18 genres in total
genre_matrix <- matrix(0,10330,18) #empty matrix, 10330=no of movies+1, 18=no of genres
genre_matrix[1,] <- genre_list #set first row to genre list
colnames(genre_matrix) <- genre_list #set column names to genre list
#iterate through matrix
for (i in 1:nrow(genres2)) {
for (c in 1:ncol(genres2)) {
genmat_col = which(genre_matrix[1,] == genres2[i,c])
genre_matrix[i+1,genmat_col] <- 1
}
}
#convert into dataframe
genre_matrix2 <- as.data.frame(genre_matrix[-1,], stringsAsFactors=FALSE) #remove first row, which was the genre list
for (c in 1:ncol(genre_matrix2)) {
genre_matrix2[,c] <- as.integer(genre_matrix2[,c]) #convert from characters to integers
}
head(genre_matrix2)
## Action Adventure Animation Children Comedy Crime Documentary Drama
## 1 0 1 1 1 1 0 0 0
## 2 0 1 0 1 0 0 0 0
## 3 0 0 0 0 1 0 0 0
## 4 0 0 0 0 1 0 0 1
## 5 0 0 0 0 1 0 0 0
## 6 1 0 0 0 0 1 0 0
## Fantasy Film-Noir Horror Musical Mystery Romance Sci-Fi Thriller War
## 1 1 0 0 0 0 0 0 0 0
## 2 1 0 0 0 0 0 0 0 0
## 3 0 0 0 0 0 1 0 0 0
## 4 0 0 0 0 0 1 0 0 0
## 5 0 0 0 0 0 0 0 0 0
## 6 0 0 0 0 0 0 0 1 0
## Western
## 1 0
## 2 0
## 3 0
## 4 0
## 5 0
## 6 0
Now, I create a search matrix which allows an easy search of a movie by any of its genre. HEre data really begins to grow in size and sparcity.
search_matrix <- cbind(movies[,1:2], genre_matrix2)
head(search_matrix)
## movieId title Action Adventure Animation
## 1 1 Toy Story (1995) 0 1 1
## 2 2 Jumanji (1995) 0 1 0
## 3 3 Grumpier Old Men (1995) 0 0 0
## 4 4 Waiting to Exhale (1995) 0 0 0
## 5 5 Father of the Bride Part II (1995) 0 0 0
## 6 6 Heat (1995) 1 0 0
## Children Comedy Crime Documentary Drama Fantasy Film-Noir Horror Musical
## 1 1 1 0 0 0 1 0 0 0
## 2 1 0 0 0 0 1 0 0 0
## 3 0 1 0 0 0 0 0 0 0
## 4 0 1 0 0 1 0 0 0 0
## 5 0 1 0 0 0 0 0 0 0
## 6 0 0 1 0 0 0 0 0 0
## Mystery Romance Sci-Fi Thriller War Western
## 1 0 0 0 0 0 0
## 2 0 0 0 0 0 0
## 3 0 1 0 0 0 0
## 4 0 1 0 0 0 0
## 5 0 0 0 0 0 0
## 6 0 0 0 1 0 0
We can see that each movie can correspond to either one or more than one genre.
In order to use the ratings data for building a recommendation engine with recommenderlab, I convert rating matrix into a sparse matrix of type realRatingMatrix.
#Create ratings matrix. Rows = userId, Columns = movieId
ratingmat <- dcast(ratings, userId~movieId, value.var = "rating", na.rm=FALSE)
ratingmat <- as.matrix(ratingmat[,-1]) #remove userIds
#Convert rating matrix into a recommenderlab sparse matrix
ratingmat <- as(ratingmat, "realRatingMatrix")
ratingmat
## 668 x 10325 rating matrix of class 'realRatingMatrix' with 105339 ratings.
The recommenderlab package contains some options for the recommendation algorithm:
recommender_models <- recommenderRegistry$get_entries(dataType = "realRatingMatrix")
names(recommender_models)
## [1] "ALS_realRatingMatrix" "ALS_implicit_realRatingMatrix"
## [3] "IBCF_realRatingMatrix" "POPULAR_realRatingMatrix"
## [5] "RANDOM_realRatingMatrix" "RERECOMMEND_realRatingMatrix"
## [7] "SVD_realRatingMatrix" "SVDF_realRatingMatrix"
## [9] "UBCF_realRatingMatrix"
lapply(recommender_models, "[[", "description")
## $ALS_realRatingMatrix
## [1] "Recommender for explicit ratings based on latent factors, calculated by alternating least squares algorithm."
##
## $ALS_implicit_realRatingMatrix
## [1] "Recommender for implicit data based on latent factors, calculated by alternating least squares algorithm."
##
## $IBCF_realRatingMatrix
## [1] "Recommender based on item-based collaborative filtering."
##
## $POPULAR_realRatingMatrix
## [1] "Recommender based on item popularity."
##
## $RANDOM_realRatingMatrix
## [1] "Produce random recommendations (real ratings)."
##
## $RERECOMMEND_realRatingMatrix
## [1] "Re-recommends highly rated items (real ratings)."
##
## $SVD_realRatingMatrix
## [1] "Recommender based on SVD approximation with column-mean imputation."
##
## $SVDF_realRatingMatrix
## [1] "Recommender based on Funk SVD with gradient descend."
##
## $UBCF_realRatingMatrix
## [1] "Recommender based on user-based collaborative filtering."
I will use IBCF and UBCF models for comparison and performance. Check the parameters of these two models.
recommender_models$IBCF_realRatingMatrix$parameters
## $k
## [1] 30
##
## $method
## [1] "Cosine"
##
## $normalize
## [1] "center"
##
## $normalize_sim_matrix
## [1] FALSE
##
## $alpha
## [1] 0.5
##
## $na_as_zero
## [1] FALSE
recommender_models$UBCF_realRatingMatrix$parameters
## $method
## [1] "cosine"
##
## $nn
## [1] 25
##
## $sample
## [1] FALSE
##
## $normalize
## [1] "center"
Collaborative filtering algorithms are based on measuring the similarity between users or between items. For this purpose, recommenderlab contains the similarity function. The supported methods to compute similarities are cosine, pearson, and jaccard.
Next, I determine how similar the first four users are with each other by creating and visualizing similarity matrix that uses the cosine distance:
similarity_users <- similarity(ratingmat[1:4, ],
method = "cosine",
which = "users")
as.matrix(similarity_users)
## 1 2 3 4
## 1 0.0000000 0.9760860 0.9641723 0.9914398
## 2 0.9760860 0.0000000 0.9925732 0.9374253
## 3 0.9641723 0.9925732 0.0000000 0.9888968
## 4 0.9914398 0.9374253 0.9888968 0.0000000
image(as.matrix(similarity_users), main = "User similarity")
In the given matrix, each row and each column corresponds to a user, and each cell corresponds to the similarity between two users. The more red the cell is, the more similar two users are. Note that the diagonal is red, since it’s comparing each user with itself.
Using the same approach, I compute similarity between the first four movies.
similarity_items <- similarity(ratingmat[, 1:4], method =
"cosine", which = "items")
as.matrix(similarity_items)
## 1 2 3 4
## 1 0.0000000 0.9669732 0.9559341 0.9101276
## 2 0.9669732 0.0000000 0.9658757 0.9412416
## 3 0.9559341 0.9658757 0.0000000 0.9864877
## 4 0.9101276 0.9412416 0.9864877 0.0000000
image(as.matrix(similarity_items), main = "Movies similarity")
Now, I explore values of ratings.
vector_ratings <- as.vector(ratingmat@data)
unique(vector_ratings) # what are unique values of ratings
## [1] 0.0 5.0 4.0 3.0 4.5 1.5 2.0 3.5 1.0 2.5 0.5
table_ratings <- table(vector_ratings) # what is the count of each rating value
table_ratings
## vector_ratings
## 0 0.5 1 1.5 2 2.5 3 3.5 4
## 6791761 1198 3258 1567 7943 5484 21729 12237 28880
## 4.5 5
## 8187 14856
There are 11 unique score values. The lower values mean lower ratings and vice versa.
According to the documentation, a rating equal to 0 represents a missing value, so I remove them from the dataset before visualizing the results.
vector_ratings <- vector_ratings[vector_ratings != 0] # rating == 0 are NA values
vector_ratings <- factor(vector_ratings)
qplot(vector_ratings) +
ggtitle("Distribution of the ratings")
As we see, there are less low (less than 3) rating scores, the majority of movies are rated with a score of 3 or higher. The most common rating is 4.
Now, let’s see what are the most viewed movies.
views_per_movie <- colCounts(ratingmat) # count views for each movie
table_views <- data.frame(movie = names(views_per_movie),
views = views_per_movie) # create dataframe of views
table_views <- table_views[order(table_views$views,
decreasing = TRUE), ] # sort by number of views
table_views$title <- NA
for (i in 1:10325){
table_views[i,3] <- as.character(subset(movies,
movies$movieId == table_views[i,1])$title)
}
table_views[1:6,]
## movie views title
## 296 296 325 Pulp Fiction (1994)
## 356 356 311 Forrest Gump (1994)
## 318 318 308 Shawshank Redemption, The (1994)
## 480 480 294 Jurassic Park (1993)
## 593 593 290 Silence of the Lambs, The (1991)
## 260 260 273 Star Wars: Episode IV - A New Hope (1977)
ggplot(table_views[1:6, ], aes(x = title, y = views)) +
geom_bar(stat="identity") +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
ggtitle("Number of views of the top movies")
We see that “Pulp Fiction (1994)” is the most viewed movie, exceeding the second-most-viewed “Forrest Gump (1994)” by 14 views.
Now I identify the top-rated movies by computing the average rating of each of them.
average_ratings <- colMeans(ratingmat)
qplot(average_ratings) +
stat_bin(binwidth = 0.1) +
ggtitle("Distribution of the average movie rating")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
average_ratings_relevant <- average_ratings[views_per_movie > 50]
qplot(average_ratings_relevant) +
stat_bin(binwidth = 0.1) +
ggtitle(paste("Distribution of the relevant average ratings"))
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
The first image above shows the distribution of the average movie rating. The highest value is around 3, and there are a few movies whose rating is either 1 or 5. Probably, the reason is that these movies received a rating from a few people only, so we shouldn’t take them into account.
I remove the movies whose number of views is below a defined threshold of 50, creating a subset of only relevant movies. The second image above shows the distribution of the relevant average ratings. All the rankings are between 2.16 and 4.45. As expected, the extremes were removed. The highest value changes, and now it is around 4.
I visualize the whole matrix of ratings by building a heat map whose colors represent the ratings. Each row of the matrix corresponds to a user, each column to a movie, and each cell to its rating.
image(ratingmat, main = "Heatmap of the rating matrix") # hard to read-too many dimensions
image(ratingmat[1:20, 1:25], main = "Heatmap of the first 20 rows and 25 columns")
Since there are too many users and items, the first chart is hard to read. The second chart is built zooming in on the first rows and columns.
Some users saw more movies than the others. So, instead of displaying some random users and items, I should select the most relevant users and items. Thus I visualize only the users who have seen many movies and the movies that have been seen by many users. To identify and select the most relevant users and movies, I follow these steps:
min_n_movies <- quantile(rowCounts(ratingmat), 0.99)
min_n_users <- quantile(colCounts(ratingmat), 0.99)
print("Minimum number of movies per user:")
## [1] "Minimum number of movies per user:"
min_n_movies
## 99%
## 1198.17
print("Minimum number of users per movie:")
## [1] "Minimum number of users per movie:"
min_n_users
## 99%
## 115
image(ratingmat[rowCounts(ratingmat) > min_n_movies,
colCounts(ratingmat) > min_n_users],
main = "Heatmap of the top users and movies")
Let’s take account of the users having watched more movies. Most of them have seen all the top movies, and this is not surprising. Some columns of the heatmap are darker than the others, meaning that these columns represent the highest-rated movies.Conversely, darker rows represent users giving higher ratings. Because of this, it might be useful to normalize the data, which I will do in the next step.
The data preparation process consists of the following steps:
In order to select the most relevant data, I define the minimum number of users per rated movie as 50 and the minimum views number per movie as 50:
ratings_movies <- ratingmat[rowCounts(ratingmat) > 50,
colCounts(ratingmat) > 50]
ratings_movies
## 420 x 447 rating matrix of class 'realRatingMatrix' with 38341 ratings.
#ratingmat
Such a selection of the most relevant data contains 420 users and 447 movies, compared to previous 668 users and 10325 movies in the total dataset.
Using the same approach as previously, I visualize the top 2 percent of users and movies in the new matrix of the most relevant data:
min_movies <- quantile(rowCounts(ratings_movies), 0.98)
min_users <- quantile(colCounts(ratings_movies), 0.98)
image(ratings_movies[rowCounts(ratings_movies) > min_movies,
colCounts(ratings_movies) > min_users],
main = "Heatmap of the top users and movies")
average_ratings_per_user <- rowMeans(ratings_movies)
qplot(average_ratings_per_user) + stat_bin(binwidth = 0.1) +
ggtitle("Distribution of the average rating per user")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
In the heatmap, some rows are darker than the others. This might mean that some users give higher ratings to all the movies. The distribution of the average rating per user across all the users varies a lot, as the second chart above shows.
Having users who give high (or low) ratings to all their movies might bias the results. In order to remove this effect, I normalize the data in such a way that the average rating of each user is 0. As a quick check, I calculate the average rating by users, and it is equal to 0, as expected:
ratings_movies_norm <- normalize(ratings_movies)
sum(rowMeans(ratings_movies_norm) > 0.00001)
## [1] 0
Now, I visualize the normalized matrix for the top movies. It is colored now because the data is continuous:
image(ratings_movies_norm[rowCounts(ratings_movies_norm) > min_movies,
colCounts(ratings_movies_norm) > min_users],
main = "Heatmap of the top users and movies")
There are still some lines that seem to be more blue or more red. The reason is that I am visualizing only the top movies. I have already checked that the average rating is 0 for each user.
Some recommendation models work on binary data, so it might be useful to binarize the data, that is, define a table containing only 0s and 1s. The 0s will be either treated as missing values or as bad ratings.
In our case, I can either:
Depending on the context, one choice may be more appropriate than the other.
As a next step, I define two matrices following the two different approaches and visualize a 5 percent portion of each of binarized matrices.
ratings_movies_watched <- binarize(ratings_movies, minRating = 1)
min_movies_binary <- quantile(rowCounts(ratings_movies), 0.95)
min_users_binary <- quantile(colCounts(ratings_movies), 0.95)
image(ratings_movies_watched[rowCounts(ratings_movies) > min_movies_binary,
colCounts(ratings_movies) > min_users_binary],
main = "Heatmap of the top users and movies")
ratings_movies_good <- binarize(ratings_movies, minRating = 3)
image(ratings_movies_good[rowCounts(ratings_movies) > min_movies_binary,
colCounts(ratings_movies) > min_users_binary],
main = "Heatmap of the top users and movies")
There are more white cells in the second heatmap, which shows that there are more movies with no or bad ratings than those that were not watched by raters.
Collaborative filtering is a branch of recommendation that takes account of the information about different users.
The starting point is a rating matrix in which rows correspond to users and columns correspond to items. The core algorithm is based on these steps:
I build the model using 80% of the whole dataset as a training set, and 20% - as a test set.
which_train <- sample(x = c(TRUE, FALSE),
size = nrow(ratings_movies),
replace = TRUE,
prob = c(0.8, 0.2))
recc_data_train <- ratings_movies[which_train, ]
recc_data_test <- ratings_movies[!which_train, ]
Let’s have a look at the default parameters of IBCF model. Here, k is the number of items to compute the similarities among them in the first step. After, for each item, the algorithm identifies its k most similar items and stores the number. method is a similarity funtion, which is Cosine by default, may also be pearson. I create the model using the default parameters of method = Cosine and k=30.
recommender_models <- recommenderRegistry$get_entries(dataType ="realRatingMatrix")
recommender_models$IBCF_realRatingMatrix$parameters
## $k
## [1] 30
##
## $method
## [1] "Cosine"
##
## $normalize
## [1] "center"
##
## $normalize_sim_matrix
## [1] FALSE
##
## $alpha
## [1] 0.5
##
## $na_as_zero
## [1] FALSE
recc_model <- Recommender(data = recc_data_train,
method = "IBCF",
parameter = list(k = 30))
recc_model
## Recommender of type 'IBCF' for 'realRatingMatrix'
## learned using 334 users.
class(recc_model)
## [1] "Recommender"
## attr(,"package")
## [1] "recommenderlab"
Exploring the recommender model:
model_details <- getModel(recc_model)
class(model_details$sim) # this contains a similarity matrix
## [1] "dgCMatrix"
## attr(,"package")
## [1] "Matrix"
dim(model_details$sim)
## [1] 447 447
n_items_top <- 20
image(model_details$sim[1:n_items_top, 1:n_items_top],
main = "Heatmap of the first rows and columns")
row_sums <- rowSums(model_details$sim > 0)
table(row_sums)
## row_sums
## 30
## 447
col_sums <- colSums(model_details$sim > 0)
qplot(col_sums) + stat_bin(binwidth = 1) + ggtitle("Distribution of the column count")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
dgCMatrix is a similarity matrix created by the model. Its dimensions are 447 x 447, which is equal to the number of items. The heatmap of 20 first items show that many values are equal to 0. The reason is that each row contains only k (30) elements that are greater than 0. The number of non-null elements for each column depends on how many times the corresponding movie was included in the top k of another movie. Thus, the matrix is not neccessarily simmetric, which is also the case in our model.
The chart of the distribution of the number of elements by column shows there are a few movies that are similar to many others.
Now, it is possible to recommend movies to the users in the test set. I define n_recommended equal to 10 that specifies the number of movies to recommend to each user.
For each user, the algorithm extracts its rated movies. For each movie, it identifies all its similar items, starting from the similarity matrix. Then, the algorithm ranks each similar item in this way:
Then, the algorithm identifies the top 10 recommendations:
n_recommended <- 10 # the number of items to recommend to each user
recc_predicted <- predict(object = recc_model,
newdata = recc_data_test,
n = n_recommended)
recc_predicted
## Recommendations as 'topNList' with n = 10 for 86 users.
Let’s explore the results of the recommendations for the first user:
recc_user_1 <- recc_predicted@items[[1]] # recommendation for the first user
movies_user_1 <- recc_predicted@itemLabels[recc_user_1]
movies_user_2 <- movies_user_1
for (i in 1:10){
movies_user_2[i] <- as.character(subset(movies,
movies$movieId == movies_user_1[i])$title)
}
movies_user_2
## [1] "Toy Story (1995)"
## [2] "Grumpier Old Men (1995)"
## [3] "Get Shorty (1995)"
## [4] "Seven (a.k.a. Se7en) (1995)"
## [5] "Crimson Tide (1995)"
## [6] "French Kiss (1995)"
## [7] "Little Women (1994)"
## [8] "Natural Born Killers (1994)"
## [9] "Clear and Present Danger (1994)"
## [10] "Four Weddings and a Funeral (1994)"
It’s also possible to define a matrix with the recommendations for each user. I visualize the recommendations for the first four users:
recc_matrix <- sapply(recc_predicted@items,
function(x){ as.integer(colnames(ratings_movies)[x]) }) # matrix with the recommendations for each user
#dim(recc_matrix)
recc_matrix[,1:4]
## [,1] [,2] [,3] [,4]
## [1,] 1 54286 47 150
## [2,] 3 2542 70 163
## [3,] 21 2424 145 552
## [4,] 47 8665 165 923
## [5,] 161 3471 288 1036
## [6,] 236 1275 329 1089
## [7,] 261 3703 745 1221
## [8,] 288 784 919 1263
## [9,] 349 3421 1047 1544
## [10,] 357 594 1204 1721
Here, the columns represent the first 4 users, and the rows are the movieId values of recommended 10 movies.
Now, let’s identify the most recommended movies. The following image shows the distribution of the number of items for IBCF:
number_of_items <- factor(table(recc_matrix))
chart_title <- "Distribution of the number of items for IBCF"
qplot(number_of_items) + ggtitle(chart_title)
number_of_items_sorted <- sort(number_of_items, decreasing = TRUE)
number_of_items_top <- head(number_of_items_sorted, n = 4)
table_top <- data.frame(as.integer(names(number_of_items_top)),
number_of_items_top)
for (i in 1:4){
table_top[i,1] <- as.character(subset(movies,
movies$movieId == table_top[i,1])$title)
}
colnames(table_top) <- c("Movie title", "No of items")
head(table_top)
## Movie title No of items
## 19 Ace Ventura: When Nature Calls (1995) 13
## 5 Father of the Bride Part II (1995) 12
## 3 Grumpier Old Men (1995) 11
## 7 Sabrina (1995) 11
Most of the movies have been recommended only a few times, and a few movies have been recommended more than 5 times.
IBCF recommends items on the basis of the similarity matrix. It’s an eager-learning model, that is, once it’s built, it doesn’t need to access the initial data. For each item, the model stores the k-most similar, so the amount of information is small once the model is built. This is an advantage in the presence of lots of data.
In addition, this algorithm is efficient and scalable, so it works well with big rating matrices.
Now, I will use the user-based approach. According to this approach, given a new user, its similar users are first identified. Then, the top-rated items rated by similar users are recommended.
For each new user, these are the steps:
Again, let’s first check the default parameters of UBCF model. Here, nn is a number of similar users, and method is a similarity function, which is cosine by default. I build a recommender model leaving the parameters to their defaults and using the training set.
recommender_models <- recommenderRegistry$get_entries(dataType ="realRatingMatrix")
recommender_models$UBCF_realRatingMatrix$parameters
## $method
## [1] "cosine"
##
## $nn
## [1] 25
##
## $sample
## [1] FALSE
##
## $normalize
## [1] "center"
recc_model <- Recommender(data = recc_data_train, method = "UBCF")
recc_model
## Recommender of type 'UBCF' for 'realRatingMatrix'
## learned using 334 users.
model_details <- getModel(recc_model)
#names(model_details)
model_details$data
## 334 x 447 rating matrix of class 'realRatingMatrix' with 30549 ratings.
## Normalized using center on rows.
In the same way as the IBCF, I now determine the top ten recommendations for each new user in the test set.
n_recommended <- 10
recc_predicted <- predict(object = recc_model,
newdata = recc_data_test,
n = n_recommended)
recc_predicted
## Recommendations as 'topNList' with n = 10 for 86 users.
Let’s take a look at the first four users:
recc_matrix <- sapply(recc_predicted@items,
function(x){ as.integer(colnames(ratings_movies)[x]) })
#dim(recc_matrix)
recc_matrix[, 1:4]
## [,1] [,2] [,3] [,4]
## [1,] 356 318 296 356
## [2,] 589 50 356 457
## [3,] 62 1183 318 587
## [4,] 590 141 260 11
## [5,] 50 2268 50 316
## [6,] 377 265 593 1230
## [7,] 5010 1704 47 1641
## [8,] 3578 497 34 6
## [9,] 5418 17 457 480
## [10,] 2571 1193 1193 141
The above matrix contain movieId of each recommended movie (rows) for the first four users (columns) in our test dataset.
I also calculate how many times each movie got recommended and build the related frequency histogram:
number_of_items <- factor(table(recc_matrix))
chart_title <- "Distribution of the number of items for UBCF"
qplot(number_of_items) + ggtitle(chart_title)
Compared with the IBCF, the distribution has a longer tail. This means that there are some movies that are recommended much more often than the others. The maximum is more than 30, compared to 10-ish for IBCF.
Let’s take a look at the top titles:
number_of_items_sorted <- sort(number_of_items, decreasing = TRUE)
number_of_items_top <- head(number_of_items_sorted, n = 4)
table_top <- data.frame(as.integer(names(number_of_items_top)), number_of_items_top)
for (i in 1:4){
table_top[i,1] <- as.character(subset(movies,
movies$movieId == table_top[i,1])$title)
}
colnames(table_top) <- c("Movie title", "No of items")
head(table_top)
## Movie title No of items
## 318 Shawshank Redemption, The (1994) 33
## 858 Godfather, The (1972) 31
## 2571 Matrix, The (1999) 28
## 50 Usual Suspects, The (1995) 25
Comparing the results of UBCF with IBCF helps find some useful insight on different algorithms. UBCF needs to access the initial data. Since it needs to keep the entire database in memory, it doesn’t work well in the presence of a big rating matrix. Also, building the similarity matrix requires a lot of computing power and time.
However, UBCF’s accuracy is proven to be slightly more accurate than IBCF (I will also discuss it in the next section), so it’s a good option if the dataset is not too big.
There are a few options to choose from when deciding to create a recommendation engine. In order to compare their performances and choose the most appropriate model, I follow these steps:
We need two trainig and testing data to evaluate the model. There are several methods to create them: 1) splitting the data into training and test sets, 2) bootstrapping, 3) using k-fold.
Splitting the data into training and test sets is often done using a 80/20 proportion.
percentage_training <- 0.8
For each user in the test set, we need to define how many items to use to generate recommendations. For this, I first check the minimum number of items rated by users to be sure there will be no users with no items to test.
min(rowCounts(ratings_movies))
## [1] 8
items_to_keep <- 5 #number of items to generate recommendations
rating_threshold <- 3 # threshold with the minimum rating that is considered good
n_eval <- 1 #number of times to run evaluation
eval_sets <- evaluationScheme(data = ratings_movies,
method = "split",
train = percentage_training,
given = items_to_keep,
goodRating = rating_threshold,
k = n_eval)
eval_sets
## Evaluation scheme with 5 items given
## Method: 'split' with 1 run(s).
## Training set proportion: 0.800
## Good ratings: >=3.000000
## Data set: 420 x 447 rating matrix of class 'realRatingMatrix' with 38341 ratings.
getData(eval_sets, "train") # training set
## 336 x 447 rating matrix of class 'realRatingMatrix' with 30858 ratings.
getData(eval_sets, "known") # set with the items used to build the recommendations
## 84 x 447 rating matrix of class 'realRatingMatrix' with 420 ratings.
getData(eval_sets, "unknown") # set with the items used to test the recommendations
## 84 x 447 rating matrix of class 'realRatingMatrix' with 7063 ratings.
qplot(rowCounts(getData(eval_sets, "unknown"))) +
geom_histogram(binwidth = 10) +
ggtitle("unknown items by the users")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
The above image displays the unknown items by the users, which varies a lot.
Bootrstrapping is another approach to split the data. The same user can be sampled more than once and, if the training set has the same size as it did earlier, there will be more users in the test set.
eval_sets <- evaluationScheme(data = ratings_movies,
method = "bootstrap",
train = percentage_training,
given = items_to_keep,
goodRating = rating_threshold,
k = n_eval)
table_train <- table(eval_sets@runsTrain[[1]])
n_repetitions <- factor(as.vector(table_train))
qplot(n_repetitions) +
ggtitle("Number of repetitions in the training set")
The above chart shows that most of the users have been sampled fewer than four times.
The k-fold cross-validation approach is the most accurate one, although it’s computationally heavier.
Using this approach, we split the data into some chunks, take a chunk out as the test set, and evaluate the accuracy. Then, we can do the same with each other chunk and compute the average accuracy.
n_fold <- 4
eval_sets <- evaluationScheme(data = ratings_movies,
method = "cross-validation",
k = n_fold,
given = items_to_keep,
goodRating = rating_threshold)
size_sets <- sapply(eval_sets@runsTrain, length)
size_sets
## [1] 315 315 315 315
Using 4-fold approach, we get four sets of the same size 315.
I use the k-fold approach for evaluation.
First, I re-define the evaluation sets, build IBCF model and create a matrix with predicted ratings.
eval_sets <- evaluationScheme(data = ratings_movies,
method = "cross-validation",
k = n_fold,
given = items_to_keep,
goodRating = rating_threshold)
model_to_evaluate <- "IBCF"
model_parameters <- NULL
eval_recommender <- Recommender(data = getData(eval_sets, "train"),
method = model_to_evaluate,
parameter = model_parameters)
items_to_recommend <- 10
eval_prediction <- predict(object = eval_recommender,
newdata = getData(eval_sets, "known"),
n = items_to_recommend,
type = "ratings")
qplot(rowCounts(eval_prediction)) +
geom_histogram(binwidth = 10) +
ggtitle("Distribution of movies per user")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
The above image displays the distribution of movies per user in the matrix of predicted ratings.
Now, I compute the accuracy measures for each user. Most of the RMSEs (Root mean square errors) are in the range of 0.5 to 1.8:
eval_accuracy <- calcPredictionAccuracy(x = eval_prediction,
data = getData(eval_sets, "unknown"),
byUser = TRUE)
head(eval_accuracy)
## RMSE MSE MAE
## [1,] 1.0000000 1.000000 0.500000
## [2,] 1.0000000 1.000000 1.000000
## [3,] 0.7071068 0.500000 0.400000
## [4,] 0.3872983 0.150000 0.300000
## [5,] 1.5813559 2.500687 1.265576
## [6,] 0.5994789 0.359375 0.406250
qplot(eval_accuracy[, "RMSE"]) +
geom_histogram(binwidth = 0.1) +
ggtitle("Distribution of the RMSE by user")
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 5 rows containing non-finite values (stat_bin).
## Warning: Removed 5 rows containing non-finite values (stat_bin).
In order to have a performance index for the whole model, I specify byUser as FALSE and compute the average indices:
eval_accuracy <- calcPredictionAccuracy(x = eval_prediction,
data = getData(eval_sets, "unknown"),
byUser = FALSE)
eval_accuracy
## RMSE MSE MAE
## 1.3017464 1.6945436 0.9429728
The measures of accuracy are useful to compare the performance of different models on the same data.
Another way to measure accuracy is by comparing the recommendations with the views having a positive rating. For this, I use of a prebuilt evaluate function in recommenderlab library. The function evaluate the recommender performance depending on the number n of items to recommend to each user. I use n as a sequence n = seq(10, 100, 10). The first rows of the resulting performance matrix is presented below:
results <- evaluate(x = eval_sets,
method = model_to_evaluate,
n = seq(10, 100, 10))
## IBCF run fold/sample [model time/prediction time]
## 1 [0.64sec/0.03sec]
## 2 [0.56sec/0.04sec]
## 3 [0.67sec/0.03sec]
## 4 [1.18sec/0.04sec]
head(getConfusionMatrix(results)[[1]])
## TP FP FN TN precision recall TPR
## 10 1.628571 8.142857 71.91429 360.3143 0.1679612 0.02247695 0.02247695
## 20 3.352381 16.133333 70.19048 352.3238 0.1742718 0.04660752 0.04660752
## 30 4.752381 24.219048 68.79048 344.2381 0.1671920 0.06631919 0.06631919
## 40 6.057143 31.933333 67.48571 336.5238 0.1629849 0.08352235 0.08352235
## 50 7.285714 39.219048 66.25714 329.2381 0.1598182 0.09892825 0.09892825
## 60 8.657143 45.685714 64.88571 322.7714 0.1620503 0.11759670 0.11759670
## FPR
## 10 0.02215015
## 20 0.04382082
## 30 0.06584012
## 40 0.08687656
## 50 0.10690669
## 60 0.12465163
In order to have a look at all the splits at the same time, I sum up the indices of columns TP, FP, FN and TN in the confusion MAtrix:
columns_to_sum <- c("TP", "FP", "FN", "TN")
indices_summed <- Reduce("+", getConfusionMatrix(results))[, columns_to_sum]
head(indices_summed)
## TP FP FN TN
## 10 6.380952 32.88571 295.8667 1432.867
## 20 12.971429 65.27619 289.2762 1400.476
## 30 18.866667 97.49524 283.3810 1368.257
## 40 24.895238 128.52381 277.3524 1337.229
## 50 30.295238 158.63810 271.9524 1307.114
## 60 35.542857 186.17143 266.7048 1279.581
Finally, I plot the ROC and the precision/recall curves:
plot(results, annotate = TRUE, main = "ROC curve")
plot(results, "prec/rec", annotate = TRUE, main = "Precision-recall")
If a small percentage of rated movies is recommended, the precision decreases. On the other hand, the higher percentage of rated movies is recommended the higher is the recall.
In order to compare different models, I define them as a following list:
models_to_evaluate <- list(
IBCF_cos = list(name = "IBCF",
param = list(method = "cosine")),
IBCF_cor = list(name = "IBCF",
param = list(method = "pearson")),
UBCF_cos = list(name = "UBCF",
param = list(method = "cosine")),
UBCF_cor = list(name = "UBCF",
param = list(method = "pearson")),
random = list(name = "RANDOM", param=NULL)
)
Then, I define a different set of numbers for recommended movies (n_recommendations <- c(1, 5, seq(10, 100, 10))), run and evaluate the models:
n_recommendations <- c(1, 5, seq(10, 100, 10))
list_results <- evaluate(x = eval_sets,
method = models_to_evaluate,
n = n_recommendations)
## IBCF run fold/sample [model time/prediction time]
## 1 [0.75sec/0.03sec]
## 2 [0.68sec/0.04sec]
## 3 [0.63sec/0.03sec]
## 4 [0.63sec/0.05sec]
## IBCF run fold/sample [model time/prediction time]
## 1 [0.8sec/0.05sec]
## 2 [0.8sec/0.05sec]
## 3 [0.81sec/0.03sec]
## 4 [0.71sec/0.06sec]
## UBCF run fold/sample [model time/prediction time]
## 1 [0sec/0.22sec]
## 2 [0.02sec/0.2sec]
## 3 [0sec/0.2sec]
## 4 [0sec/0.21sec]
## UBCF run fold/sample [model time/prediction time]
## 1 [0sec/0.22sec]
## 2 [0.01sec/0.21sec]
## 3 [0sec/0.2sec]
## 4 [0sec/0.21sec]
## RANDOM run fold/sample [model time/prediction time]
## 1 [0sec/0.06sec]
## 2 [0sec/0.08sec]
## 3 [0.02sec/0.08sec]
## 4 [0sec/0.07sec]
sapply(list_results, class) == "evaluationResults"
## IBCF_cos IBCF_cor UBCF_cos UBCF_cor random
## TRUE TRUE TRUE TRUE TRUE
The following table presents as an example the first rows of the performance evaluation matrix for the IBCF with Cosine distance:
avg_matrices <- lapply(list_results, avg)
head(avg_matrices$IBCF_cos[, 5:8])
## precision recall TPR FPR
## 1 0.1767177 0.002480753 0.002480753 0.002253165
## 5 0.1704724 0.010838140 0.010838140 0.011158374
## 10 0.1628164 0.020968060 0.020968060 0.022549614
## 20 0.1663171 0.041658531 0.041658531 0.044589703
## 30 0.1635622 0.060454829 0.060454829 0.066570514
## 40 0.1642683 0.080014074 0.080014074 0.087718032
I compare the models by building a chart displaying their ROC curves and Precision/recall curves.
plot(list_results, annotate = 1, legend = "topleft")
title("ROC curve")
plot(list_results, "prec/rec", annotate = 1, legend = "bottomright")
title("Precision-recall")
A good performance index is the area under the curve (AUC), that is, the area under the ROC curve. Even without computing it, the chart shows that the highest is UBCF with cosine distance, so it’s the best-performing technique.
The UBCF with cosine distance is still the top model. Depending on what is the main purpose of the system, an appropriate number of items to recommend should be defined.
IBCF takes account of the k-closest items. I will explore more values, ranging between 5 and 40, in order to tune this parameter:
vector_k <- c(5, 10, 20, 30, 40)
models_to_evaluate <- lapply(vector_k, function(k){
list(name = "IBCF",
param = list(method = "cosine", k = k))
})
names(models_to_evaluate) <- paste0("IBCF_k_", vector_k)
Now I build and evaluate the same IBCF/cosine models with different values of the k-closest items:
n_recommendations <- c(1, 5, seq(10, 100, 10))
list_results <- evaluate(x = eval_sets,
method = models_to_evaluate,
n = n_recommendations)
## IBCF run fold/sample [model time/prediction time]
## 1 [0.77sec/0.03sec]
## 2 [0.67sec/0.03sec]
## 3 [0.71sec/0.03sec]
## 4 [0.59sec/0.03sec]
## IBCF run fold/sample [model time/prediction time]
## 1 [0.69sec/0.04sec]
## 2 [0.64sec/0.03sec]
## 3 [0.67sec/0.04sec]
## 4 [0.53sec/0.03sec]
## IBCF run fold/sample [model time/prediction time]
## 1 [0.63sec/0.05sec]
## 2 [0.61sec/0.03sec]
## 3 [0.72sec/0.03sec]
## 4 [0.67sec/0.03sec]
## IBCF run fold/sample [model time/prediction time]
## 1 [0.66sec/0.03sec]
## 2 [0.56sec/0.04sec]
## 3 [0.78sec/0.05sec]
## 4 [0.67sec/0.03sec]
## IBCF run fold/sample [model time/prediction time]
## 1 [0.69sec/0.05sec]
## 2 [0.68sec/0.04sec]
## 3 [0.64sec/0.05sec]
## 4 [0.58sec/0.03sec]
plot(list_results, annotate = 1, legend = "topleft")
title("ROC curve")
plot(list_results, "prec/rec", annotate = 1, legend = "bottomright")
title("Precision-recall")
Based on the ROC curve’s plot, the k having the biggest AUC is 10. Another good candidate is 5, but it can never have a high TPR. This means that, even if we set a very high n value, the algorithm won’t be able to recommend a big percentage of items that the user liked. The IBCF with k = 5 recommends only a few items similar to the purchases. Therefore, it can’t be used to recommend many items.
Based on the precision/recall plot, k should be set to 10 to achieve the highest recall. If we are more interested in the precision, we set k to 5.
To see the implementation of the model visit the following URL:
[Recommovie app] (https://sortega78.shinyapps.io/final/)
In this project, I have developed and evaluated a collaborative filtering recommender (CFR) system for recommending movies. The online app was created to demonstrate the User-based Collaborative Filtering approach for recommendation model.
Let’s discuss the strengths and weaknesses of the User-based Collaborative Filtering approach in general.
Strengths: User-based Collaborative Filtering gives recommendations that can be complements to the item the user was interacting with. This might be a stronger recommendation than what a item-based recommender can provide as users might not be looking for direct substitutes to a movie they had just viewed or previously watched.
Weaknesses: User-based Collaborative Filtering is a type of Memory-based Collaborative Filtering that uses all user data in the database to create recommendations. Comparing the pairwise correlation of every user in your dataset is not scalable. If there were millions of users, this computation would be very time consuming. Possible ways to get around this would be to implement some form of dimensionality reduction, such as Principal Component Analysis, or to use a model-based algorithm instead. Also, user-based collaborative filtering relies on past user choices to make future recommendations. The implications of this is that it assumes that a user’s taste and preference remains more or less constant over time, which might not be true and makes it difficult to pre-compute user similarities offline.
The point of being so prescriptive in the final project was demonstrating that despite all the ready routines there’sa number of parameters to take care and evaluate and even that you can think your model is optimal , tunning os a constant need.