The purpose of this project was to to build a recommender system and produce quality recommendations by extracting insights from a large dataset.In this project, I developed a collaborative filtering recommender system for recommending movies.In order to recommend movies I will use a large set of users preferences towards the movies from a publicly available movie rating dataset.
The data was collected through the MovieLens web site (movielens.umn.edu). This dataset contains 105339 ratings and 6138 tag applications across 10329 movies. These data were created by 668 users.
movies <- read.csv("https://raw.githubusercontent.com/ErindaB/Data-612/master/movies.csv",stringsAsFactors=FALSE)
ratings <- read.csv("https://raw.githubusercontent.com/ErindaB/Data-612/master/ratings.csv")
summary(movies)
## movieId title genres
## Min. : 1 Length:10329 Length:10329
## 1st Qu.: 3240 Class :character Class :character
## Median : 7088 Mode :character Mode :character
## Mean : 31924
## 3rd Qu.: 59900
## Max. :149532
kable(head(movies))
| movieId | title | genres |
|---|---|---|
| 1 | Toy Story (1995) | Adventure|Animation|Children|Comedy|Fantasy |
| 2 | Jumanji (1995) | Adventure|Children|Fantasy |
| 3 | Grumpier Old Men (1995) | Comedy|Romance |
| 4 | Waiting to Exhale (1995) | Comedy|Drama|Romance |
| 5 | Father of the Bride Part II (1995) | Comedy |
| 6 | Heat (1995) | Action|Crime|Thriller |
summary(ratings)
## userId movieId rating timestamp
## Min. : 1.0 Min. : 1 Min. :0.500 Min. :8.286e+08
## 1st Qu.:192.0 1st Qu.: 1073 1st Qu.:3.000 1st Qu.:9.711e+08
## Median :383.0 Median : 2497 Median :3.500 Median :1.115e+09
## Mean :364.9 Mean : 13381 Mean :3.517 Mean :1.130e+09
## 3rd Qu.:557.0 3rd Qu.: 5991 3rd Qu.:4.000 3rd Qu.:1.275e+09
## Max. :668.0 Max. :149532 Max. :5.000 Max. :1.452e+09
kable(head(ratings))
| userId | movieId | rating | timestamp |
|---|---|---|---|
| 1 | 16 | 4.0 | 1217897793 |
| 1 | 24 | 1.5 | 1217895807 |
| 1 | 32 | 4.0 | 1217896246 |
| 1 | 47 | 4.0 | 1217896556 |
| 1 | 50 | 4.0 | 1217896523 |
| 1 | 110 | 4.0 | 1217896150 |
Let’s select a list of genre and re-organize the movie genres in order to allow future users to search for the movies they like within specific genres.
genres <- as.data.frame(movies$genres, stringsAsFactors=FALSE)
genres2 <- as.data.frame(tstrsplit(genres[,1], '[|]',
type.convert=TRUE),
stringsAsFactors=FALSE)
colnames(genres2) <- c(1:10)
genre_list <- c("Action", "Adventure", "Animation", "Children",
"Comedy", "Crime","Documentary", "Drama", "Fantasy",
"Film-Noir", "Horror", "Musical", "Mystery","Romance",
"Sci-Fi", "Thriller", "War", "Western")
genre_matrix <- matrix(0,10330,18) #empty matrix, 10330=no of movies+1, 18=no of genres
genre_matrix[1,] <- genre_list #set first row to genre list
colnames(genre_matrix) <- genre_list #set column names to genre list
#iterate through matrix
for (i in 1:nrow(genres2)) {
for (c in 1:ncol(genres2)) {
genmat_col = which(genre_matrix[1,] == genres2[i,c])
genre_matrix[i+1,genmat_col] <- 1
}
}
#convert into dataframe
genre_matrix2 <- as.data.frame(genre_matrix[-1,], stringsAsFactors=FALSE) #remove first row, which was the genre list
for (c in 1:ncol(genre_matrix2)) {
genre_matrix2[,c] <- as.integer(genre_matrix2[,c]) #convert from characters to integers
}
head(genre_matrix2)
## Action Adventure Animation Children Comedy Crime Documentary Drama
## 1 0 1 1 1 1 0 0 0
## 2 0 1 0 1 0 0 0 0
## 3 0 0 0 0 1 0 0 0
## 4 0 0 0 0 1 0 0 1
## 5 0 0 0 0 1 0 0 0
## 6 1 0 0 0 0 1 0 0
## Fantasy Film-Noir Horror Musical Mystery Romance Sci-Fi Thriller War
## 1 1 0 0 0 0 0 0 0 0
## 2 1 0 0 0 0 0 0 0 0
## 3 0 0 0 0 0 1 0 0 0
## 4 0 0 0 0 0 1 0 0 0
## 5 0 0 0 0 0 0 0 0 0
## 6 0 0 0 0 0 0 0 1 0
## Western
## 1 0
## 2 0
## 3 0
## 4 0
## 5 0
## 6 0
Let’s create a search matrix which allows an easy search of a movie by any of its genre
search_matrix <- cbind(movies[,1:2], genre_matrix2)
head(search_matrix)
## movieId title Action Adventure Animation
## 1 1 Toy Story (1995) 0 1 1
## 2 2 Jumanji (1995) 0 1 0
## 3 3 Grumpier Old Men (1995) 0 0 0
## 4 4 Waiting to Exhale (1995) 0 0 0
## 5 5 Father of the Bride Part II (1995) 0 0 0
## 6 6 Heat (1995) 1 0 0
## Children Comedy Crime Documentary Drama Fantasy Film-Noir Horror Musical
## 1 1 1 0 0 0 1 0 0 0
## 2 1 0 0 0 0 1 0 0 0
## 3 0 1 0 0 0 0 0 0 0
## 4 0 1 0 0 1 0 0 0 0
## 5 0 1 0 0 0 0 0 0 0
## 6 0 0 1 0 0 0 0 0 0
## Mystery Romance Sci-Fi Thriller War Western
## 1 0 0 0 0 0 0
## 2 0 0 0 0 0 0
## 3 0 1 0 0 0 0
## 4 0 1 0 0 0 0
## 5 0 0 0 0 0 0
## 6 0 0 0 1 0 0
In order to use the ratings data for building a recommendation engine with recommenderlab, I convert rating matrix into a sparse matrix.
#Create ratings matrix. Rows = userId, Columns = movieId
Sparsemat <- dcast(ratings, userId~movieId, value.var = "rating", na.rm=FALSE)
Sparsemat <- as.matrix(Sparsemat[,-1]) #remove userIds
#Convert rating matrix into a recommenderlab sparse matrix
Sparsemat <- as(Sparsemat, "realRatingMatrix")
Sparsemat
## 668 x 10325 rating matrix of class 'realRatingMatrix' with 105339 ratings.
#Let’s explore which functions exist in recommenderlab that can be useful later.
recommender_models <- recommenderRegistry$get_entries(dataType= "realRatingMatrix")
names(recommender_models)
## [1] "HYBRID_realRatingMatrix" "ALS_realRatingMatrix"
## [3] "ALS_implicit_realRatingMatrix" "IBCF_realRatingMatrix"
## [5] "LIBMF_realRatingMatrix" "POPULAR_realRatingMatrix"
## [7] "RANDOM_realRatingMatrix" "RERECOMMEND_realRatingMatrix"
## [9] "SVD_realRatingMatrix" "SVDF_realRatingMatrix"
## [11] "UBCF_realRatingMatrix"
Collaborative filtering algorithms are based on measuring the similarity between users or between items. For this purpose, I created the similarity matrix that uses the cosine distance:
similar_users <- similarity(Sparsemat[1:4, ],
method = "cosine",
which = "users")
as.matrix(similar_users)
## 1 2 3 4
## 1 0.0000000 0.9760860 0.9641723 0.9914398
## 2 0.9760860 0.0000000 0.9925732 0.9374253
## 3 0.9641723 0.9925732 0.0000000 0.9888968
## 4 0.9914398 0.9374253 0.9888968 0.0000000
image(as.matrix(similar_users), main = "User similarity")
Each row and each column corresponds to a user, and each cell corresponds to the similarity between two users. The more red the cell is, the more similar two users are. Note that the diagonal is red, since it’s comparing each user with itself.
# Similarity between the first four movies.
similarity_items <- similarity(Sparsemat[, 1:4], method =
"cosine", which = "items")
as.matrix(similarity_items)
## 1 2 3 4
## 1 0.0000000 0.9669732 0.9559341 0.9101276
## 2 0.9669732 0.0000000 0.9658757 0.9412416
## 3 0.9559341 0.9658757 0.0000000 0.9864877
## 4 0.9101276 0.9412416 0.9864877 0.0000000
image(as.matrix(similarity_items), main = "Movies similarity")
views_per_movie <- colCounts(Sparsemat)
table_views <- data.frame(movie = names(views_per_movie),
views = views_per_movie) # create dataframe of views
table_views <- table_views[order(table_views$views,
decreasing = TRUE), ] # sort by number of views
table_views$title <- NA
for (i in 1:10325){
table_views[i,3] <- as.character(subset(movies,
movies$movieId == table_views[i,1])$title)
}
table_views[1:6,]
## movie views title
## 296 296 325 Pulp Fiction (1994)
## 356 356 311 Forrest Gump (1994)
## 318 318 308 Shawshank Redemption, The (1994)
## 480 480 294 Jurassic Park (1993)
## 593 593 290 Silence of the Lambs, The (1991)
## 260 260 273 Star Wars: Episode IV - A New Hope (1977)
ggplot(table_views[1:10, ], aes(x = title, y = views)) +
geom_bar(stat="identity") +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
ggtitle("Number of views of the top movies")
As we can see that “Pulp Fiction (1994)” is the most viewed movie.
I decided to select a minimum number of users per rated movie as 50 and the minimum views number per movie as 50, so I can get the most relevant data.
movies_rat <- Sparsemat[rowCounts(Sparsemat) > 50,
colCounts(Sparsemat) > 50]
movies_rat
## 420 x 447 rating matrix of class 'realRatingMatrix' with 38341 ratings.
This new selection contains 420 users and 447 movies, compared to previous 668 users and 10325 movies in the total dataset.
To avoid biases I need to normalize the data in such a way that the average rating of each user is 0.
movies_rat_norm <- normalize(movies_rat)
sum(rowMeans(movies_rat_norm) > 0.00001)
## [1] 0
The visualization of the normalized matrix for the top movies is colored because the data is continuous:
min_movies <- quantile(rowCounts(movies_rat), 0.98)
min_users <- quantile(colCounts(movies_rat), 0.98)
image(movies_rat_norm[rowCounts(movies_rat_norm) > min_movies,
colCounts(movies_rat_norm) > min_users],
main = "Heatmap of the top users and movies")
I build the model using 80% of the whole dataset as a training set, and 20% - as a test set.
which_train <- sample(x = c(TRUE, FALSE),
size = nrow(movies_rat),
replace = TRUE,
prob = c(0.8, 0.2))
movie_train <- movies_rat[which_train, ]
movie_test <- movies_rat[!which_train, ]
recommender_models <- recommenderRegistry$get_entries(dataType ="realRatingMatrix")
recommender_models$IBCF_realRatingMatrix$parameters
## $k
## [1] 30
##
## $method
## [1] "Cosine"
##
## $normalize
## [1] "center"
##
## $normalize_sim_matrix
## [1] FALSE
##
## $alpha
## [1] 0.5
##
## $na_as_zero
## [1] FALSE
IBCF_model <- Recommender(data = movie_train,
method = "IBCF",
parameter = list(k = 30))
IBCF_model
## Recommender of type 'IBCF' for 'realRatingMatrix'
## learned using 326 users.
class(IBCF_model)
## [1] "Recommender"
## attr(,"package")
## [1] "recommenderlab"
n_recommended <- 10 # the number of items to recommend to each user
ibcf_preds <- predict(object = IBCF_model,
newdata = movie_test,
n = n_recommended)
ibcf_preds
## Recommendations as 'topNList' with n = 10 for 94 users.
Let’s explore the results of the recommendations for the first user:
recc_user_1 <- ibcf_preds@items[[1]] # recommendation for the first user
movies_user_1 <- ibcf_preds@itemLabels[recc_user_1]
movies_user_2 <- movies_user_1
for (i in 1:10){
movies_user_2[i] <- as.character(subset(movies,
movies$movieId == movies_user_1[i])$title)
}
movies_user_2
## [1] "Wizard of Oz, The (1939)"
## [2] "Blade (1998)"
## [3] "Mrs. Doubtfire (1993)"
## [4] "Crying Game, The (1992)"
## [5] "Dr. Strangelove or: How I Learned to Stop Worrying and Love the Bomb (1964)"
## [6] "Witness (1985)"
## [7] "Home Alone (1990)"
## [8] "Broken Arrow (1996)"
## [9] "Game, The (1997)"
## [10] "Congo (1995)"
Now, let’s identify the most recommended movies. The following image shows the distribution of the number of items for IBCF:
recc_matrix <- sapply(ibcf_preds@items,
function(x){ as.integer(colnames(movies_rat)[x]) })
number_of_items <- factor(table(recc_matrix))
chart_title <- "Distribution of the number of items for IBCF"
qplot(number_of_items) + ggtitle(chart_title)
number_of_items_sorted <- sort(number_of_items, decreasing = TRUE)
number_of_items_top <- head(number_of_items_sorted, n = 4)
table_top <- data.frame(as.integer(names(number_of_items_top)),
number_of_items_top)
for (i in 1:4){
table_top[i,1] <- as.character(subset(movies,
movies$movieId == table_top[i,1])$title)
}
colnames(table_top) <- c("Movie title", "No of items")
head(table_top)
## Movie title No of items
## 261 Little Women (1994) 14
## 5 Father of the Bride Part II (1995) 13
## 3 Grumpier Old Men (1995) 12
## 168 First Knight (1995) 12
IBCF recommends items on the basis of the similarity matrix. For each item, the model stores the k-most similar, so the amount of information is small once the model is built. This is an advantage in the presence of lots of data. In addition, this algorithm is efficient and scalable, so it works well with big rating matrices.
recommender_models <- recommenderRegistry$get_entries(dataType ="realRatingMatrix")
recommender_models$UBCF_realRatingMatrix$parameters
## $method
## [1] "cosine"
##
## $nn
## [1] 25
##
## $sample
## [1] FALSE
##
## $weighted
## [1] TRUE
##
## $normalize
## [1] "center"
##
## $min_matching_items
## [1] 0
##
## $min_predictive_items
## [1] 0
UBCF_model <- Recommender(data = movie_train, method = "UBCF")
UBCF_model
## Recommender of type 'UBCF' for 'realRatingMatrix'
## learned using 326 users.
model_details <- getModel(UBCF_model)
#names(model_details)
model_details$data
## 326 x 447 rating matrix of class 'realRatingMatrix' with 28870 ratings.
## Normalized using center on rows.
In the same way as the IBCF, I now determine the top ten recommendations for each new user in the test set.
n_recommended <- 10
ubcf_preds <- predict(object = UBCF_model,
newdata = movie_test,
n = n_recommended)
ubcf_preds
## Recommendations as 'topNList' with n = 10 for 94 users.
recc_matrix <- sapply(ubcf_preds@items,
function(x){ as.integer(colnames(movies_rat)[x]) })
number_of_items <- factor(table(recc_matrix))
chart_title <- "Distribution of the number of items for UBCF"
qplot(number_of_items) + ggtitle(chart_title)
Compared with the IBCF, the distribution has a longer tail. This means that there are some movies that are recommended much more often than the others. The maximum is more than 30, compared to 10-ish for IBCF.
There are a few options to choose from when deciding to create a recommendation engine. In order to compare their performances and choose the most appropriate model, I follow these steps:
In order to define number of items to use for each user to generate recommentations, need to check the min number of items rated by users.
min(rowCounts(movies_rat))
## [1] 8
keep <- 5
threshold <- 3
n_eval <- 1 #number of times to run evaluation
evaluation <- evaluationScheme(data = movies_rat,
method = "split",
train = 0.8, #Splitting the data into training and test sets is often done using a 80/20 proportion
given = keep,
goodRating = threshold,
k = n_eval)
evaluation
## Evaluation scheme with 5 items given
## Method: 'split' with 1 run(s).
## Training set proportion: 0.800
## Good ratings: >=3.000000
## Data set: 420 x 447 rating matrix of class 'realRatingMatrix' with 38341 ratings.
ev_train = getData(evaluation, "train")
ev_known = getData(evaluation, "known")
ev_unknown = getData(evaluation, "unknown")
models_ev <- list(
IBCF_cosine = list(name = "IBCF",
param = list(method = "cosine")),#IBCF using the Cosine as the distance function
IBCF_pearson = list(name = "IBCF",
param = list(method = "pearson")),#IBCF using the Pearson correlation as the distance function
UBCF_cosine = list(name = "UBCF",
param = list(method = "cosine")),#UBCF using the Cosine as the distance function
UBCF_pearson = list(name = "UBCF",
param = list(method = "pearson")),#UBCF using the Pearson correlation as the distance function
Random = list(name = "RANDOM", param=NULL)#Random recommendations to have a base line
)
Then, I define a different set of numbers for recommended movies (n_recommendations <- c(1, 5, seq(10, 100, 10))), run and evaluate the models:
n_recommendations <- c(1, 5, seq(10, 100, 10))
mod_results <- evaluate(x = evaluation,
method = models_ev,
n = n_recommendations)
## IBCF run fold/sample [model time/prediction time]
## 1 [0.32sec/0.01sec]
## IBCF run fold/sample [model time/prediction time]
## 1 [0.37sec/0.02sec]
## UBCF run fold/sample [model time/prediction time]
## 1 [0sec/0.12sec]
## UBCF run fold/sample [model time/prediction time]
## 1
## Timing stopped at: 0.02 0 0.03
## Error in neighbors[, x] : incorrect number of dimensions
## RANDOM run fold/sample [model time/prediction time]
## 1 [0sec/0.03sec]
## Warning in .local(x, method, ...):
## Recommender 'UBCF_pearson' has failed and has been removed from the results!
sapply(mod_results, class) == "evaluationResults"
## IBCF_cosine IBCF_pearson UBCF_cosine Random
## TRUE TRUE TRUE TRUE
The following table presents as an example the first rows of the performance evaluation matrix for the IBCF with Cosine distance:
avg_matrices <- lapply(mod_results, avg)
head(avg_matrices$IBCF_cos[, 5:8])
## precision recall TPR FPR
## 1 0.2530120 0.003823288 0.003823288 0.002058619
## 5 0.2024096 0.014229671 0.014229671 0.011065612
## 10 0.1783133 0.024691577 0.024691577 0.022706521
## 20 0.1649598 0.043287610 0.043287610 0.045825795
## 30 0.1694606 0.066064346 0.066064346 0.067333567
## 40 0.1700283 0.088997435 0.088997435 0.088916393
plot(mod_results, annotate = 1, legend = "topleft")
title("ROC curve")
plot(mod_results, "prec/rec", annotate = 1, legend = "bottomright")
title("Precision-recall")
A good performance index is the area under the curve (AUC), that is, the area under the ROC curve. Even without computing it, the chart shows that the highest is UBCF with cosine distance, so it’s the best-performing technique.
The UBCF with cosine distance is still the top model. Depending on what is the main purpose of the system, an appropriate number of items to recommend should be defined.
IBCF takes account of the k-closest items. I will explore more values, ranging between 5 and 40, in order to tune this parameter:
vector_k <- c(5, 10, 20, 30, 40)
models_to_evaluate <- lapply(vector_k, function(k){
list(name = "IBCF",
param = list(method = "cosine", k = k))
})
names(models_to_evaluate) <- paste0("IBCF_k_", vector_k)
Now I build and evaluate the same IBCF/cosine models with different values of the k-closest items:
n_recommendations <- c(1, 5, seq(10, 100, 10))
list_results <- evaluate(x = evaluation,
method = models_to_evaluate,
n = n_recommendations)
## IBCF run fold/sample [model time/prediction time]
## 1 [0.44sec/0.02sec]
## IBCF run fold/sample [model time/prediction time]
## 1 [0.29sec/0.02sec]
## IBCF run fold/sample [model time/prediction time]
## 1 [0.31sec/0.02sec]
## IBCF run fold/sample [model time/prediction time]
## 1 [0.32sec/0.01sec]
## IBCF run fold/sample [model time/prediction time]
## 1 [0.28sec/0.01sec]
plot(list_results, annotate = 1, legend = "topleft")
title("ROC curve")
plot(list_results, "prec/rec", annotate = 1, legend = "bottomright")
title("Precision-recall")
Based on the ROC curve’s plot, the k having the biggest AUC is 10. Another good candidate is 5, but it can never have a high TPR. This means that, even if we set a very high n value, the algorithm won’t be able to recommend a big percentage of items that the user liked. The IBCF with k = 5 recommends only a few items similar to the purchases. Therefore, it can’t be used to recommend many items.
Based on the precision/recall plot, k should be set to 10 to achieve the highest recall. If we are more interested in the precision, we set k to 5.
User-based Collaborative Filtering is a type of Memory-based Collaborative Filtering that uses all user data in the database to create recommendations.If there were millions of users, this computation would be very time consuming. UBCF needs to access the initial data. Since it needs to keep the entire database in memory, it doesn’t work well in the presence of a big rating matrix. Also, building the similarity matrix requires a lot of computing power and time.
However, UBCF’s accuracy is proven to be slightly more accurate than IBCF, so it’s a good option if the dataset is not too big. Result showed that User-based collaborative filtering with pearson is the best model out of the rest.
To see the implementation of the model visit the following URL: