For this project, the goal was to develop and deploy a Collaborative Filtering Recommender System (CFR) for Movie Recommendations.
A Collaborative Filtering approach consists of only the User’s Preferences, therefore, does not factors in the values or features of the particular variable being recommended.
In addition, the Movie Lens Dataset was used to gerenate values for this Recommendation System.
After the Trained CFR Model was successfully implemented, this system was deployed in a Shiny R Application.
The data for this project is the MovieLens Dataset, which can be found here here.
Containing 105339 ratings and 6138 tag applications, across 10329 movies, rated by 668 users.
The zipfile downloaded from the above link contained four files: links.csv, movies.csv, ratings.csv and tags.csv.
This system implements the use of the files movies.csv and ratings.csv.
movies <- read.csv("https://raw.githubusercontent.com/josephsimone/Data-612/master/final_project/movies.csv",stringsAsFactors=FALSE)
ratings <- read.csv("https://raw.githubusercontent.com/josephsimone/Data-612/master/final_project/ratings.csv")Decscription of movies file:
Summary of of ratings file:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.500 3.000 3.500 3.517 4.000 5.000
In for this to function properly, both the usersId and movieId will have to be changed from their data types of integers to factors.
In addition, the genres of the movies will ne to be reformatted.
First, the movie’s genres will have to be converted into a one-hot encoding format.
This will serve as the backbone of how users will be able to search for the movies they have watched within specific genres in the long format.
g <- as.data.frame(movies$g, stringsAsFactors=FALSE)
library(data.table)
g2 <- as.data.frame(tstrsplit(g[,1], '[|]',
type.convert=TRUE),
stringsAsFactors=FALSE)
colnames(g2) <- c(1:10)
gl <- c("Action", "Adventure", "Animation", "Children",
"Comedy", "Crime","Documentary", "Drama", "Fantasy",
"Film-Noir", "Horror", "Musical", "Mystery","Romance",
"Sci-Fi", "Thriller", "War", "Western") # we have 18 genres in total
gm <- matrix(0,10330,18) #empty matrix, 10330=no of movies+1, 18=no of genres
gm[1,] <- gl #set first row to genre list
colnames(gm) <- gl #set column names to genre list
#iterate through matrix
for (i in 1:nrow(g2)) {
for (c in 1:ncol(g2)) {
genmat_col = which(gm[1,] == g2[i,c])
gm[i+1,genmat_col] <- 1
}
}
#convert into dataframe
gm2 <- as.data.frame(gm[-1,], stringsAsFactors=FALSE) #remove first row, which was the genre list
for (c in 1:ncol(gm2)) {
gm2[,c] <- as.integer(gm2[,c])
} #convert from characters to integers
head(gm2)The creation of a search matrix will act as a database of a movie by their genre(s).
From here we can see the data begin to grow in size and sparcity.
Now each movie will correspond to one or more genres.
This project utilizes the use of the recommenderlab R package.
In order to build a recommendation engine within recommenderlab, the conversion of the newly created ratings matrix into a Sparse Matrix known as a realRatingMatrix.
## 668 x 10325 rating matrix of class 'realRatingMatrix' with 105339 ratings.
The recommenderlab package contains preconstructed models for the use of recommendation algorithms:
recommender_models <- recommenderRegistry$get_entries(dataType = "realRatingMatrix")
names(recommender_models)## [1] "ALS_realRatingMatrix" "ALS_implicit_realRatingMatrix"
## [3] "IBCF_realRatingMatrix" "LIBMF_realRatingMatrix"
## [5] "POPULAR_realRatingMatrix" "RANDOM_realRatingMatrix"
## [7] "RERECOMMEND_realRatingMatrix" "SVD_realRatingMatrix"
## [9] "SVDF_realRatingMatrix" "UBCF_realRatingMatrix"
## $ALS_realRatingMatrix
## [1] "Recommender for explicit ratings based on latent factors, calculated by alternating least squares algorithm."
##
## $ALS_implicit_realRatingMatrix
## [1] "Recommender for implicit data based on latent factors, calculated by alternating least squares algorithm."
##
## $IBCF_realRatingMatrix
## [1] "Recommender based on item-based collaborative filtering."
##
## $LIBMF_realRatingMatrix
## [1] "Matrix factorization with LIBMF via package recosystem (https://cran.r-project.org/web/packages/recosystem/vignettes/introduction.html)."
##
## $POPULAR_realRatingMatrix
## [1] "Recommender based on item popularity."
##
## $RANDOM_realRatingMatrix
## [1] "Produce random recommendations (real ratings)."
##
## $RERECOMMEND_realRatingMatrix
## [1] "Re-recommends highly rated items (real ratings)."
##
## $SVD_realRatingMatrix
## [1] "Recommender based on SVD approximation with column-mean imputation."
##
## $SVDF_realRatingMatrix
## [1] "Recommender based on Funk SVD with gradient descend (https://sifter.org/~simon/journal/20061211.html)."
##
## $UBCF_realRatingMatrix
## [1] "Recommender based on user-based collaborative filtering."
This project will utilize both the IBCF and UBCF Models for comparison and performance.
## $k
## [1] 30
##
## $method
## [1] "Cosine"
##
## $normalize
## [1] "center"
##
## $normalize_sim_matrix
## [1] FALSE
##
## $alpha
## [1] 0.5
##
## $na_as_zero
## [1] FALSE
## $method
## [1] "cosine"
##
## $nn
## [1] 25
##
## $sample
## [1] FALSE
##
## $normalize
## [1] "center"
Collaborative Filtering is based on the measuring between the similarity of users or between items.
Within recommenderlab, the supported methods to computate similarities are cosine, pearson, and jaccard.
Next, determing how similar the first four users are with each other.
similarity_users <- similarity(rm[1:4, ],
method = "cosine",
which = "users")
as.matrix(similarity_users)## 1 2 3 4
## 1 0.0000000 0.9760860 0.9641723 0.9914398
## 2 0.9760860 0.0000000 0.9925732 0.9374253
## 3 0.9641723 0.9925732 0.0000000 0.9888968
## 4 0.9914398 0.9374253 0.9888968 0.0000000
Using the same approach, for the first four movies.
similarity_items <- similarity(rm[, 1:4], method =
"cosine", which = "items")
as.matrix(similarity_items)## 1 2 3 4
## 1 0.0000000 0.9669732 0.9559341 0.9101276
## 2 0.9669732 0.0000000 0.9658757 0.9412416
## 3 0.9559341 0.9658757 0.0000000 0.9864877
## 4 0.9101276 0.9412416 0.9864877 0.0000000
Now, exploring the second data file’s values of ratings.
## [1] 0.0 5.0 4.0 3.0 4.5 1.5 2.0 3.5 1.0 2.5 0.5
## rating_values
## 0 0.5 1 1.5 2 2.5 3 3.5 4 4.5
## 6791761 1198 3258 1567 7943 5484 21729 12237 28880 8187
## 5
## 14856
There are 11 unique rating values.
Arating equal to 0 represents a missing value, therefore, them from the dataset before visualizing the results.
rating_values <- rating_values[rating_values != 0]
rating_values <- factor(rating_values)
qplot(rating_values) +
ggtitle("Ratings Distrubtions")The most common rating is \(4\).
The majority of movies are rated with a score of 3 or higher.
views_per_movie <- colCounts(rm)
count_views <- data.frame(movie = names(views_per_movie),
views = views_per_movie)
count_views <- count_views[order(count_views$views,
decreasing = TRUE), ]
count_views$title <- NAfor (i in 1:10325){
count_views[i,3] <- as.character(subset(movies,
movies$movieId == count_views[i,1])$title)
}ggplot(count_views[1:6, ], aes(x = title, y = views)) +
geom_bar(stat="identity") +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
ggtitle("Views of Top Movies")“Pulp Fiction (1994)” is the most watched Movie, with “Forrest Gump (1994)” being the second.
To find the Top-Rated Movies, the average rating for each was calculated.
average_ratings <- colMeans(rm)
qplot(average_ratings) +
stat_bin(binwidth = 0.1) +
ggtitle("Distribution of the average movie rating")## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
average_ratings_relevant <- average_ratings[views_per_movie > 50]
qplot(average_ratings_relevant) +
stat_bin(binwidth = 0.1) +
ggtitle(paste("Distribution of Relevant Average Ratings"))## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
The first graph represents the distribution of the average movie rating. The highest value is ~ 3, with a few movies whose rating is either 1 or 5.
This is most likely due to the fact that these movies received a rating from only a few users, thererfore we should exclude these ratings.
The movies where number of views is below the defined threshold of 50 we removed. This creates a more narrow subset of the most relevant movies.
The second graph represents the distribution of the relevant average ratings. The rankings are between 2.16 and 4.45. The highest value changes, and now it is ~ 4.
In order to select relevant data, defining the minimum number of users per rated movie and the minimum views per movie as 50
## 420 x 447 rating matrix of class 'realRatingMatrix' with 38341 ratings.
The previous rating-matrix had 668 users and 10325 movies, now the newly create most relevant rating-matrix contains 420 users and 447 movies.
average_ratings_per_user <- rowMeans(movie_ratings)
qplot(average_ratings_per_user) + stat_bin(binwidth = 0.1) +
ggtitle("Distribution of Average Ratings, per User")## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
When dealing with a user pool who rates at a high or low ratings can result in bias.
In an effort to circumvent this problem, the normalization of the data was need
## [1] 0
Now, I visualize the normalized matrix for the top movies. It is colored now because the data is continuous:
min_movies <- quantile(rowCounts(movie_ratings), 0.95)
min_users <- quantile(colCounts(movie_ratings), 0.95)
image(movie_ratings_norm[rowCounts(movie_ratings) > min_movies,
colCounts(movie_ratings) > min_users],
main = "Heatmap - Top Users & Movies")There are still some lines that seem to be more blue or more red.
This is due to the above chart is visualizing only the top movies.
However, the average rating is still 0 for each user.
In order for the recommendation models to work well with the data we already having, conversion to binary data, will be useful. This is done by defining a matrixta encompassing 0’s and 1’s.
The 0’s will be either treated as missing values or as bad ratings.
Set-up
Depending on the context, one option might be more suited to the needs of the models, depending on the context.
Next, defining of two matrices following the two different “set-up” options which visualize 5% portion of each of newly created binary matrices.
Option 1: Define a matrix equal to 1, if the movie has been watched
movie_ratings_watched <- binarize(movie_ratings, minRating = 1)
boolean_min_movies <- quantile(rowCounts(movie_ratings), 0.95)
boolean_min_users <- quantile(colCounts(movie_ratings), 0.95)
image(movie_ratings_watched[rowCounts(movie_ratings) > boolean_min_movies,
colCounts(movie_ratings) > boolean_min_users],
main = "Heatmap - Top Users & Movies")Option 2: Define a matrix equal to 1 if the cell has a rating above the threshold
movie_ratings_good <- binarize(movie_ratings, minRating = 3)
image(movie_ratings_good[rowCounts(movie_ratings) > boolean_min_movies,
colCounts(movie_ratings) > boolean_min_users],
main = "Heatmap - Top Users & Movies")In the second heatmap, there are more white cells which means that there are more movies with no or bad ratings than movies that were not viewed.
For this type of model, we will first need to create a rating-matrix, which rows corresponds to users and columns corresponds to items.
This approach is based on:
k most similar itemsBuilt the model using \(80%\) of the total dataset as a training set, and \(20%\) as a test set.
Let’s have a look at the default parameters of IBCF model. Here, k is the number of items to compute the similarities among them in the first step. After, for each item, the algorithm identifies its k most similar items and stores the number. method is a similarity funtion, which is Cosine by default, may also be pearson. I create the model using the default parameters of method = Cosine and k=30.
recommender_models <- recommenderRegistry$get_entries(dataType ="realRatingMatrix")
recommender_models$IBCF_realRatingMatrix$parameters## $k
## [1] 30
##
## $method
## [1] "Cosine"
##
## $normalize
## [1] "center"
##
## $normalize_sim_matrix
## [1] FALSE
##
## $alpha
## [1] 0.5
##
## $na_as_zero
## [1] FALSE
## Recommender of type 'IBCF' for 'realRatingMatrix'
## learned using 340 users.
## [1] "Recommender"
## attr(,"package")
## [1] "recommenderlab"
## [1] "dgCMatrix"
## attr(,"package")
## [1] "Matrix"
## [1] 447 447
n_items_top <- 20
image(results_model$sim[1:n_items_top, 1:n_items_top],
main = "Heatmap - First Rows and Columns")## row_sums
## 30
## 447
col_sums <- colSums(results_model$sim > 0)
qplot(col_sums) + stat_bin(binwidth = 1) + ggtitle("Distribution of Column Count")## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
This newly created dgCMatrix similarity matrix has dimensions are 447 x 447, which is equal to the number of items.
The Heatmap shows 20 first items show that many values are equal to 0.
This is due to each row contains only \(k = 30\) elements that are greater than 0.
The number of non-null elements for each column depends on the amount the corresponding movie was included in the top k of another movie.
Therfore, this matrix is not simmetric, which is also the same in our model.
The chart of the distribution of the number of elements respresents, by column, that there are a few movies that are similar to others.
n_recommended <- 10
model_predictions <- predict(object = movie_model,
newdata = movie_test,
n = n_recommended)## Recommendations as 'topNList' with n = 10 for 80 users.
Let’s explore the results of the recommendations for the First User
user_recommendation_1 <- model_predictions@items[[1]]
movies_user_1 <- model_predictions@itemLabels[user_recommendation_1]
movies_user_2 <- movies_user_1for (i in 1:10){
movies_user_2[i] <- as.character(subset(movies,
movies$movieId == movies_user_1[i])$title)
}## [1] "Dr. Strangelove or: How I Learned to Stop Worrying and Love the Bomb (1964)"
## [2] "Wedding Singer, The (1998)"
## [3] "Clueless (1995)"
## [4] "Austin Powers: The Spy Who Shagged Me (1999)"
## [5] "Star Trek: Generations (1994)"
## [6] "Bridget Jones's Diary (2001)"
## [7] "Austin Powers: International Man of Mystery (1997)"
## [8] "Robin Hood: Men in Tights (1993)"
## [9] "Leaving Las Vegas (1995)"
## [10] "GoldenEye (1995)"
matrix_recommendation <- sapply(model_predictions@items,
function(x){ as.integer(colnames(movie_ratings)[x]) })
matrix_recommendation[,1:4]## [,1] [,2] [,3] [,4]
## [1,] 750 25 8665 1
## [2,] 1777 32 2424 2
## [3,] 39 104 48780 3
## [4,] 2683 110 5816 5
## [5,] 329 161 3703 7
## [6,] 4246 165 2542 10
## [7,] 1517 223 60069 11
## [8,] 520 231 3114 16
## [9,] 25 317 784 17
## [10,] 10 318 3421 19
Here, the columns represent the first 4 users, and the rows are the movieId values of recommended 10 movies.
Now, let’s identify the most recommended movies. The following image shows the distribution of the number of items for IBCF:
num_items <- factor(table(matrix_recommendation))
chart_title <- "Distribution of Items for IBCF"
qplot(num_items) + ggtitle(chart_title)num_items_sorted <- sort(num_items, decreasing = TRUE)
num_items_top <- head(num_items_sorted, n = 4)
table_top <- data.frame(as.integer(names(num_items_top)),
num_items_top)for (i in 1:4){
table_top[i,1] <- as.character(subset(movies,
movies$movieId == table_top[i,1])$title)
}Most of the movies have been recommended only a few times, and a few movies have been recommended more than 5 times.
IBCF recommends items on the basis of the similarity matrix. It’s an eager-learning model, that is, once it’s built, it doesn’t need to access the initial data. For each item, the model stores the k-most similar, so the amount of information is small once the model is built. This is an advantage in the presence of lots of data.
In addition, this algorithm is efficient and scalable, so it works well with big rating matrices.
Now, I will use the user-based approach. According to this approach, given a new user, its similar users are first identified. Then, the top-rated items rated by similar users are recommended.
For each new user, these are the steps:
Measure how similar each user is to the new one. Like IBCF, popular similarity measures are correlation and cosine.
Identify the most similar users. The options are:
Rate the movies rated by the most similar users. The rating is the average rating among similar users and the approaches are:
Pick the top-rated movies.
Again, let’s first check the default parameters of UBCF model. Here, nn is a number of similar users, and method is a similarity function, which is cosine by default. I build a recommender model leaving the parameters to their defaults and using the training set.
recommender_models <- recommenderRegistry$get_entries(dataType ="realRatingMatrix")
recommender_models$UBCF_realRatingMatrix$parameters## $method
## [1] "cosine"
##
## $nn
## [1] 25
##
## $sample
## [1] FALSE
##
## $normalize
## [1] "center"
## Recommender of type 'UBCF' for 'realRatingMatrix'
## learned using 340 users.
## 340 x 447 rating matrix of class 'realRatingMatrix' with 30415 ratings.
## Normalized using center on rows.
In the same way as the IBCF, I now determine the top ten recommendations for each new user in the test set.
n_recommended <- 10
model_predictions <- predict(object = movie_model,
newdata = movie_test,
n = n_recommended)
model_predictions## Recommendations as 'topNList' with n = 10 for 80 users.
First Four Users
matrix_recommendation <- sapply(model_predictions@items,
function(x){ as.integer(colnames(movie_ratings)[x]) })## [,1] [,2] [,3] [,4]
## [1,] 1276 296 318 593
## [2,] 1250 356 50 318
## [3,] 1252 260 11 356
## [4,] 1288 527 357 47
## [5,] 6 318 265 110
## [6,] 778 34 1193 11
## [7,] 51662 778 1183 595
## [8,] 2617 1210 497 480
## [9,] 2692 50 17 527
## [10,] 1247 150 141 2571
The matrix above contains contain the movieId of each recommended movie, rows, for the first four users, cloumns, in the test dataset.
num_items <- factor(table(matrix_recommendation))
chart_title <- "Distribution of Items for UBCF"
qplot(num_items) + ggtitle(chart_title)Compared with the IBCF, the distribution of some movies that are recommended much more often than the others.
The maximum is more than 30, compared to \(10\) for IBCF.
num_items_sorted <- sort(num_items, decreasing = TRUE)
num_items_top <- head(num_items_sorted, n = 4)
table_top <- data.frame(as.integer(names(num_items_top)), num_items_top)
for (i in 1:4){
table_top[i,1] <- as.character(subset(movies,
movies$movieId == table_top[i,1])$title)
}
colnames(table_top) <- c("Movie Titles", "# Items")
head(table_top)Comparison of the outcomes of both the UBCF with the IBCF, aids in finding useful insight on different methods.
The UBCF needs to access the initial data. Since it needs to keep the entire database in memory, it doesn’t work well in the presence of a big rating matrix.
In addition, building the similarity matrix requires a lot of computing power and time.
However, UBCF’s accuracy is proven to be slightly more accurate than IBCF (I will also discuss it in the next section), so it’s a good option if the dataset is not too big.
In order to compare the models’ performances and choose the best suited model:
Training and Test SetsBootstrapping DataK-Fold ApproachSplitting the data into Training and Test Sets at a 80/20 proportion.
First, for each user in the test set, we need to define how the number of moviess to use, in order to generate recommendations.
To achieve this we need to check the minimum number of movies rated by users to be sure there will be no users with no movie to test.
## [1] 8
eval_sets <- evaluationScheme(data = movie_ratings,
method = "split",
train = train_percent,
given = keep,
goodRating = best_rating, k = n_eval) ## Evaluation scheme with 5 items given
## Method: 'split' with 1 run(s).
## Training set proportion: 0.800
## Good ratings: >=3.000000
## Data set: 420 x 447 rating matrix of class 'realRatingMatrix' with 38341 ratings.
## 336 x 447 rating matrix of class 'realRatingMatrix' with 30369 ratings.
## 84 x 447 rating matrix of class 'realRatingMatrix' with 420 ratings.
## 84 x 447 rating matrix of class 'realRatingMatrix' with 7552 ratings.
qplot(rowCounts(getData(eval_sets, "unknown"))) +
geom_histogram(binwidth = 10) +
ggtitle("unknown moviess by the users")## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
This represents the unknown moviess by the users, varying quite a lot.
The same user can be sampled more than once during bootstraping.
When the training set has the same size previously, this will be more users in the test set.
eval_sets <- evaluationScheme(data = movie_ratings,
method = "bootstrap",
train = train_percent,
given = keep,
goodRating = best_rating,
k = n_eval)
table_train <- table(eval_sets@runsTrain[[1]])
n_repetitions <- factor(as.vector(table_train))
qplot(n_repetitions) +
ggtitle("Repetitions in the Training Set")Respresents that most of the users that have sampled lower than four times.
K-Fold Cross-Validation yields is the most accurate, however, is the most resource intensive.
n_total <- 4
eval_sets <- evaluationScheme(data = movie_ratings,
method = "cross-validation",
k = n_total,
given = keep,
goodRating = best_rating)
size_sets <- sapply(eval_sets@runsTrain, length)
size_sets## [1] 315 315 315 315
When using the K-Fold approach, results in four sets of the same size 315.
K-Fold Approach is used for evaluation of the results:
First, re-defining the evaluation sets.
Build IBCF model
Creating a matrix with predicted ratings.
eval_sets <- evaluationScheme(data = movie_ratings,
method = "cross-validation",
k = n_total,
given = keep,
goodRating = best_rating)evaluation_for_model <- "IBCF"
model_parameters <- NULL
evaluate_model <- Recommender(data = getData(eval_sets, "train"),
method = evaluation_for_model, parameter = model_parameters)ri <- 10
evaluate_prediction <- predict(object = evaluate_model,
newdata = getData(eval_sets, "known"),
n = ri,
type = "ratings")qplot(rowCounts(evaluate_prediction)) +
geom_histogram(binwidth = 10) +
ggtitle("Distribution of Movies, per user")## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Represents the distribution of movies per user in the matrix of predicted ratings.
Most of the RMSEs (Root mean square errors) are in the range of 0.5 to 1.8:
evaluate_accuracy <- calcPredictionAccuracy(x = evaluate_prediction,
data = getData(eval_sets, "unknown"),
byUser = TRUE)
head(evaluate_accuracy)## RMSE MSE MAE
## [1,] 1.0399991 1.0815981 0.7794960
## [2,] 2.0816660 4.3333333 2.0000000
## [3,] 1.0787390 1.1636779 0.7935005
## [4,] 1.1072313 1.2259611 0.8281737
## [5,] 0.6859943 0.4705882 0.4705882
## [6,] 0.6653993 0.4427563 0.5689712
qplot(evaluate_accuracy[, "RMSE"]) +
geom_histogram(binwidth = 0.1) +
ggtitle("Distribution of RMSE, by user")## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 3 rows containing non-finite values (stat_bin).
## Warning: Removed 3 rows containing non-finite values (stat_bin).
evaluate_accuracy <- calcPredictionAccuracy(x = evaluate_prediction,
data = getData(eval_sets, "unknown"),
byUser = FALSE)
evaluate_accuracy## RMSE MSE MAE
## 1.415940 2.004886 1.038826
## IBCF run fold/sample [model time/prediction time]
## 1 [0.25sec/0.03sec]
## 2 [0.26sec/0.04sec]
## 3 [0.3sec/0.03sec]
## 4 [0.34sec/0.05sec]
## TP FP FN TN precision recall TPR
## 10 1.676190 8.038095 78.42857 353.8571 0.1725490 0.01897097 0.01897097
## 20 3.571429 15.857143 76.53333 346.0381 0.1838235 0.04110615 0.04110615
## 30 5.209524 23.742857 74.89524 338.1524 0.1798257 0.06062756 0.06062756
## 40 6.733333 31.428571 73.37143 330.4667 0.1763070 0.07755275 0.07755275
## 50 8.266667 38.704762 71.83810 323.1905 0.1764674 0.09480993 0.09480993
## 60 9.695238 45.542857 70.40952 316.3524 0.1766989 0.11216576 0.11216576
## FPR
## 10 0.02244192
## 20 0.04419473
## 30 0.06604743
## 40 0.08737263
## 50 0.10741809
## 60 0.12653896
sun_col <- c("TP", "FP", "FN", "TN")
indices_summed <- Reduce("+", getConfusionMatrix(results))[, sun_col]
head(indices_summed)## TP FP FN TN
## 10 6.752381 32.12381 295.3810 1433.743
## 20 13.190476 64.28571 288.9429 1401.581
## 30 19.552381 95.61905 282.5810 1370.248
## 40 25.380952 126.08571 276.7524 1339.781
## 50 30.780952 155.29524 271.3524 1310.571
## 60 35.676190 183.01905 266.4571 1282.848
From the above graphs, if a small percentage of rated movies is recommended, the precision decreases.
At the same time, the higher percentage of rated movies that are recommended, the higher the recall.
In order to compare different models, create a basline measure out of the following list:
base_line <- list(
IBCF_cosine = list(name = "IBCF",
param = list(method = "cosine")),
IBCF_pearson = list(name = "IBCF",
param = list(method = "pearson")),
UBCF_cosine = list(name = "UBCF",
param = list(method = "cosine")),
UBCF_pearson = list(name = "UBCF",
param = list(method = "pearson")),
Random = list(name = "RANDOM", param=NULL)
)n_recommendations <- c(1, 5, seq(10, 100, 10))
results_list <- evaluate(x = eval_sets,
method = base_line,
n = n_recommendations)## IBCF run fold/sample [model time/prediction time]
## 1 [0.31sec/0.03sec]
## 2 [0.25sec/0.04sec]
## 3 [0.2sec/0.03sec]
## 4 [0.22sec/0.01sec]
## IBCF run fold/sample [model time/prediction time]
## 1 [0.29sec/0.01sec]
## 2 [0.33sec/0.03sec]
## 3 [0.33sec/0.08sec]
## 4 [0.33sec/0.03sec]
## UBCF run fold/sample [model time/prediction time]
## 1 [0sec/0.17sec]
## 2 [0sec/0.15sec]
## 3 [0sec/0.13sec]
## 4 [0sec/0.14sec]
## UBCF run fold/sample [model time/prediction time]
## 1 [0sec/0.16sec]
## 2 [0sec/0.18sec]
## 3 [0sec/0.15sec]
## 4 [0sec/0.14sec]
## RANDOM run fold/sample [model time/prediction time]
## 1 [0sec/0.03sec]
## 2 [0sec/0.03sec]
## 3 [0sec/0.04sec]
## 4 [0sec/0.03sec]
## IBCF_cosine IBCF_pearson UBCF_cosine UBCF_pearson Random
## TRUE TRUE TRUE TRUE TRUE
## precision recall TPR FPR
## 1 0.2052959 0.002714384 0.002714384 0.002080115
## 5 0.1794248 0.012236246 0.012236246 0.010943537
## 10 0.1738647 0.022505419 0.022505419 0.022026967
## 20 0.1706899 0.042833868 0.042833868 0.044138414
## 30 0.1705764 0.063677949 0.063677949 0.065537575
## 40 0.1685407 0.081774511 0.081774511 0.086342530
ROC curves & Precision/Recall Curves.
The UBCF with cosine distance performs the best out of all the models.
IBCF takes in consideration the k-nearest neighbor.
Tuning parameters ranging between 5 and 40.
k <- c(5, 10, 20, 30, 40)
base_line <- lapply(k, function(k){
list(name = "IBCF",
param = list(method = "cosine", k = k))
})
names(base_line) <- paste0("IBCF_k_", k)Construct IBCF/Cosine Models with different values of the k-nearest neighbor.
n_recommendations <- c(1, 5, seq(10, 100, 10))
results_list <- evaluate(x = eval_sets,
method = base_line,
n = n_recommendations)## IBCF run fold/sample [model time/prediction time]
## 1 [0.22sec/0.01sec]
## 2 [0.25sec/0.03sec]
## 3 [0.22sec/0.02sec]
## 4 [0.22sec/0.03sec]
## IBCF run fold/sample [model time/prediction time]
## 1 [0.23sec/0.02sec]
## 2 [0.22sec/0.02sec]
## 3 [0.22sec/0.02sec]
## 4 [0.24sec/0.01sec]
## IBCF run fold/sample [model time/prediction time]
## 1 [0.2sec/0.04sec]
## 2 [0.23sec/0.05sec]
## 3 [0.24sec/0.03sec]
## 4 [0.22sec/0.02sec]
## IBCF run fold/sample [model time/prediction time]
## 1 [0.22sec/0.02sec]
## 2 [0.23sec/0.04sec]
## 3 [0.21sec/0.04sec]
## 4 [0.25sec/0.02sec]
## IBCF run fold/sample [model time/prediction time]
## 1 [0.22sec/0.03sec]
## 2 [0.23sec/0.02sec]
## 3 [0.22sec/0.03sec]
## 4 [0.24sec/0.01sec]
The ROC Curve’s Plot shows k having the biggest \(AUC = 10\).
In additon. another good candidate is \(5\), because it can never have a high TPR.
Furthermore, the IBCF with k = 5 only recommends a few movies similar to the ratings.
Therefore, it cannot be the model used for recommendations.
Based on the Precision/Recall Plot, \(k\) should equal \(10\) to achieve the highest recall.
In that event, if we are more interested in the Precision, we set \(k\) to \(5\).
User-Based Collaborative Filtering
Strengths
User-Based Collaborative Filtering provides recommendations that are complimentary to the item the user are observing.
This prvides stronger Recommendations than an Item-Based Recommender.
Weaknesses
User-based Collaborative Filtering is a memeory intensive Collaborative Filtering that uses all user data in the database to create Recommendations.
However, comparing the pairwise correlation of every user in your dataset is not scalable. The more users the more time it would take to compute the Recommendations.
For the project, the developed and evaluated a Collaborative Filtering Recommender System was implemented for the Recommendation of Movies.
The deployment of a Shiny R Application demonstrates the User-Based Collaborative Filtering Approach for the Recommendation Model.
Implementation