Instruction
Introduction
Dataset
Data
Exploration
Data
Wrangling
Model
Exploration
Data
Distribution
Collaborative
Filtering Model
Item-based
Collaborative Filtering
Building
the IBCF model
User-based
Collaborative Filtering Model
Building
the UBCF model
Evaluation
of model
Accuracy
Measures
Probability
thresholds
Comparing
models
Ideal
model
Spark
Implementation
Spark
Accuracy Measures
Spark
Prediction
Evaluation
Metrics for the Recommended System
Summary
Hands on with AWS This assignment is designed to get you some hands-on assignment with deploying a cloud solution. Take one of your previous assignments and deploy it in AWS. You may also choose to do this as part of your final project and submit both this assignment and the final project at the same time. To get full credit for this assignment, your cloud solution must include the following pieces: • Utilize some sort of long term storage. This can either be a database like RDS or file storage like S3. • Utilize some sort of compute service. This can be a virtual machine like EC2 or a compute solution like AWS Lambda. • Utilize at least one VPC to ensure that only you or your application are the only ones that can access your data. If you choose to, you may pay for AWS services, but keep it mind it can get expensive. With that said, this assignment has been designed, so that it can be completed using only capacities provided by the AWS Free Tier. All the services in the Free Tier can be found here: https://aws.amazon.com/free/?all-free-tier Submission: Submit a video link where you walk through all three aspects mentioned above. Discuss why you chose the specific solution. What are its advantages and disadvantages for the compute and storage parts.
This project focuses on analyzing the MovieLens dataset using Collaborative Filtering techniques to predict user-item ratings. Specifically, it implements the Alternating Least Squares (ALS) algorithm in Apache Spark to generate recommendations and evaluate prediction accuracy.
Collaborative Filtering is a popular method for making automated predictions about a user’s interests by leveraging the preferences and behaviors of many users. With Spark’s ALS model, both users and items are represented by a set of latent factors, which are learned from the data and used to estimate missing ratings.
We have used the MovieLens 1M dataset, which contains 1 million ratings from 6,000 users on 4,000 movies. Each rating is an integer between 1 and 5, and the dataset includes demographic information about users (age, gender, occupation, zip code) as well as movie metadata (title, genres). This well-structured dataset is widely used in recommender system research and is suitable for building collaborative filtering and hybrid models.
movies <- read.csv("http://recom-system-data-final-project.s3-website.us-east-2.amazonaws.com/movies.csv",
stringsAsFactors = FALSE)
ratings <- read.csv("http://recom-system-data-final-project.s3-website.us-east-2.amazonaws.com/ratings.csv",
stringsAsFactors = FALSE)
movieId title genres
Min. : 1.0 Length:3883 Length:3883
1st Qu.: 982.5 Class :character Class :character
Median :2010.0 Mode :character Mode :character
Mean :1986.0
3rd Qu.:2980.5
Max. :3952.0
userId movieId rating timestamp
Min. : 1 Min. : 1 Min. :1.000 Min. :9.567e+08
1st Qu.:1506 1st Qu.:1030 1st Qu.:3.000 1st Qu.:9.653e+08
Median :3070 Median :1835 Median :4.000 Median :9.730e+08
Mean :3025 Mean :1866 Mean :3.582 Mean :9.722e+08
3rd Qu.:4476 3rd Qu.:2770 3rd Qu.:4.000 3rd Qu.:9.752e+08
Max. :6040 Max. :3952 Max. :5.000 Max. :1.046e+09
[1] 1000209
n_users n_movies
1 6040 3706
par(mfrow=c(2,2))
ratings %>%
count(movieId) %>%
ggplot(aes(n)) +
geom_histogram( bins=30, color = "red") +
scale_x_log10() +
ggtitle("Movies") +
labs(subtitle ="number of ratings by movieId",
x="movieId" ,
y="number of ratings",
caption ="source data : ratings set") +
theme(panel.border = element_rect(colour="black", fill=NA))
This histogram shows the distribution of the number of ratings per movie on a log scale. Most movies have relatively few ratings, while a small number receive a large volume of ratings. This long-tail pattern is typical in recommendation datasets and highlights the popularity imbalance across items.
ratings %>%
count(userId) %>%
ggplot(aes(n)) +
geom_histogram( bins=30, color = "gold") +
scale_x_log10() +
ggtitle("Users") +
labs(subtitle ="number of ratings by UserId",
x="userId" ,
y="number of ratings") +
theme(panel.border = element_rect(colour="black", fill=NA))
This histogram illustrates how many ratings each user has given. Most users rate a relatively small number of movies, while a few are highly active. The log scale reveals a wide variation in user engagement, which is important to consider when building personalized recommendation models.
kable(head(ratings),caption = "Sample Ratings Dataset") %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive")) %>%
row_spec(0, bold = T, color = "white", background = "#ea7872")
userId | movieId | rating | timestamp |
---|---|---|---|
1 | 1193 | 5 | 978300760 |
1 | 661 | 3 | 978302109 |
1 | 914 | 3 | 978301968 |
1 | 3408 | 4 | 978300275 |
1 | 2355 | 5 | 978824291 |
1 | 1197 | 3 | 978302268 |
kable(head(movies),caption = "Sample Movies Dataset") %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive")) %>%
row_spec(0, bold = T, color = "white", background = "#ea7872")
movieId | title | genres |
---|---|---|
1 | Toy Story (1995) | Animation|Children’s|Comedy |
2 | Jumanji (1995) | Adventure|Children’s|Fantasy |
3 | Grumpier Old Men (1995) | Comedy|Romance |
4 | Waiting to Exhale (1995) | Comedy|Drama |
5 | Father of the Bride Part II (1995) | Comedy |
6 | Heat (1995) | Action|Crime|Thriller |
We have processed the dataset to ensure it is clean and ready for analysis. This involves merging data rating with user and movie metadata, converting data types as needed, and handling missing or inconsistent entries.
genres <- as.data.frame(movies$genres, stringsAsFactors=FALSE)
genres2 <- as.data.frame(tstrsplit(genres[,1], '[|]',
type.convert=TRUE),
stringsAsFactors=FALSE)
colnames(genres2) <- paste0("V", 1:ncol(genres2)) # Dynamic colnames to avoid length mismatch
genre_list <- c("Action", "Adventure", "Animation", "Children",
"Comedy", "Crime","Documentary", "Drama", "Fantasy",
"Film-Noir", "Horror", "Musical", "Mystery","Romance",
"Sci-Fi", "Thriller", "War", "Western") # 18 genres
# genre_matrix <- matrix(0, 10330, 18) # 10329 movies + 1 header row
genre_matrix <- matrix(0, nrow(movies) + 1, length(genre_list)) # dynamically match movie count
genre_matrix[1,] <- genre_list
colnames(genre_matrix) <- genre_list
# Fill genre matrix
for (i in 1:nrow(genres2)) {
for (c in 1:ncol(genres2)) {
genmat_col <- which(genre_matrix[1,] == genres2[i,c])
if (length(genmat_col) > 0) {
genre_matrix[i+1, genmat_col] <- 1
}
}
}
# Convert to data frame
genre_matrix2 <- as.data.frame(genre_matrix[-1,], stringsAsFactors=FALSE) # Remove header row
for (c in 1:ncol(genre_matrix2)) {
genre_matrix2[,c] <- as.integer(genre_matrix2[,c])
}
# Combine with movie info
search_matrix <- cbind(movies[,1:2], genre_matrix2)
# Table display
kable(head(search_matrix), caption = "Sample Movies Dataset") %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive")) %>%
row_spec(0, bold = T, color = "white", background = "#ea7872") %>%
scroll_box(width = "100%", height = "300px")
movieId | title | Action | Adventure | Animation | Children | Comedy | Crime | Documentary | Drama | Fantasy | Film-Noir | Horror | Musical | Mystery | Romance | Sci-Fi | Thriller | War | Western |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | Toy Story (1995) | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
2 | Jumanji (1995) | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
3 | Grumpier Old Men (1995) | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 |
4 | Waiting to Exhale (1995) | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
5 | Father of the Bride Part II (1995) | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
6 | Heat (1995) | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 |
We have explored various recommendation algorithms to identify the most effective approach. Starting with baseline models, such as user-based (UBCF) and item-based collaborative filtering (IBCF), we then explored advanced techniques, including matrix factorization, to improve prediction accuracy and recommendation quality.
library(data.table)
# Convert ratings to data.table
setDT(ratings)
# Then use dcast
ratings_wide <- dcast(ratings, userId ~ movieId, value.var = "rating")
#Create ratings matrix. Rows = userId, Columns = movieId
ratingmat <- dcast(ratings, userId~movieId, value.var = "rating", na.rm=FALSE)
ratingmat <- as.matrix(ratingmat[,-1]) #remove userIds
#Convert rating matrix into a recommenderlab sparse matrix
ratingmat <- as(ratingmat, "realRatingMatrix")
ratingmat
6040 x 3706 rating matrix of class 'realRatingMatrix' with 1000209 ratings.
Identifying the algorithms and recommendation model
recommender_models <- recommenderRegistry$get_entries(dataType = "realRatingMatrix")
names(recommender_models)
[1] "HYBRID_realRatingMatrix" "ALS_realRatingMatrix"
[3] "ALS_implicit_realRatingMatrix" "IBCF_realRatingMatrix"
[5] "LIBMF_realRatingMatrix" "POPULAR_realRatingMatrix"
[7] "RANDOM_realRatingMatrix" "RERECOMMEND_realRatingMatrix"
[9] "SVD_realRatingMatrix" "SVDF_realRatingMatrix"
[11] "UBCF_realRatingMatrix"
$HYBRID_realRatingMatrix
[1] "Hybrid recommender that aggegates several recommendation strategies using weighted averages."
$ALS_realRatingMatrix
[1] "Recommender for explicit ratings based on latent factors, calculated by alternating least squares algorithm."
$ALS_implicit_realRatingMatrix
[1] "Recommender for implicit data based on latent factors, calculated by alternating least squares algorithm."
$IBCF_realRatingMatrix
[1] "Recommender based on item-based collaborative filtering."
$LIBMF_realRatingMatrix
[1] "Matrix factorization with LIBMF via package recosystem (https://cran.r-project.org/web/packages/recosystem/vignettes/introduction.html)."
$POPULAR_realRatingMatrix
[1] "Recommender based on item popularity."
$RANDOM_realRatingMatrix
[1] "Produce random recommendations (real ratings)."
$RERECOMMEND_realRatingMatrix
[1] "Re-recommends highly rated items (real ratings)."
$SVD_realRatingMatrix
[1] "Recommender based on SVD approximation with column-mean imputation."
$SVDF_realRatingMatrix
[1] "Recommender based on Funk SVD with gradient descend (https://sifter.org/~simon/journal/20061211.html)."
$UBCF_realRatingMatrix
[1] "Recommender based on user-based collaborative filtering."
$k
[1] 30
$method
[1] "cosine"
$normalize
[1] "center"
$normalize_sim_matrix
[1] FALSE
$alpha
[1] 0.5
$na_as_zero
[1] FALSE
$method
[1] "cosine"
$nn
[1] 25
$sample
[1] FALSE
$weighted
[1] TRUE
$normalize
[1] "center"
$min_matching_items
[1] 0
$min_predictive_items
[1] 0
similarity_users <- similarity(ratingmat[1:4, ],
method = "cosine",
which = "users")
par(mfrow=c(1,2))
as.matrix(similarity_users)
1 2 3 4
1 NA 0.9924242 0.9757859 0.9969045
2 0.9924242 NA 0.9821403 0.9798724
3 0.9757859 0.9821403 NA 0.9886995
4 0.9969045 0.9798724 0.9886995 NA
similarity_items <- similarity(ratingmat[, 1:4], method =
"cosine", which = "items")
as.matrix(similarity_items)
1 2 3 4
1 NA 0.9744467 0.9668976 0.9733027
2 0.9744467 NA 0.9612445 0.9507502
3 0.9668976 0.9612445 NA 0.9569447
4 0.9733027 0.9507502 0.9569447 NA
Our purpose in using the data distribution of key variables in the dataset is to help reveal user behavior patterns and item popularity. This analysis provides valuable insights that guide model selection and optimization.
vector_ratings <- as.vector(ratingmat@data)
unique(vector_ratings) # what are unique values of ratings
[1] 5 0 4 3 2 1
table_ratings <- table(vector_ratings) # what is the count of each rating value
vector_ratings <- vector_ratings[vector_ratings != 0] # rating == 0 are NA values
vector_ratings <- factor(vector_ratings)
par(mfrow=c(1,2))
qplot(vector_ratings) +
ggtitle("Distribution of the vector ratings")+
theme(plot.title = element_text(hjust = 0.5))
The plot shows the distribution of rating values, with ratings from 1 to 5. Most ratings cluster around 3, 4, and 5, indicating users generally give positive feedback.
The heatmap visualizes the rating patterns of the first 20 users across 25 movies. The sparse distribution of colored cells highlights the typical sparsity in recommendation datasets, where most user-item pairs have no rating.
Defining the minimum number of users per rated movie as 50 and the minimum views number per movie as 50
ratings_movies <- ratingmat[rowCounts(ratingmat) > 50,
colCounts(ratingmat) > 50]
ratings_movies_norm <- normalize(ratings_movies)
We will be exploring the Item-based and User-based collaborative filtering model
Item-based Collaborative Filtering
Item-based collaborative filtering, also known as item-to-item filtering, is a recommendation approach that focuses on the similarity between items. It leverages user ratings to find items that are closely related, and recommends items similar to those a user has already liked or rated highly.
The training and test datasets are defined as follows:
which_train <- sample(x = c(TRUE, FALSE),
size = nrow(ratings_movies),
replace = TRUE,
prob = c(0.8, 0.2))
recc_data_train <- ratings_movies[which_train, ]
recc_data_test <- ratings_movies[!which_train, ]
recommender_models <- recommenderRegistry$get_entries(dataType ="realRatingMatrix")
recommender_models$IBCF_realRatingMatrix$parameters
$k
[1] 30
$method
[1] "cosine"
$normalize
[1] "center"
$normalize_sim_matrix
[1] FALSE
$alpha
[1] 0.5
$na_as_zero
[1] FALSE
recc_model <- Recommender(data = recc_data_train,
method = "IBCF",
parameter = list(k = 30))
recc_model
Recommender of type 'IBCF' for 'realRatingMatrix'
learned using 3396 users.
[1] "Recommender"
attr(,"package")
[1] "recommenderlab"
[1] "dgCMatrix"
attr(,"package")
[1] "Matrix"
[1] 2499 2499
n_items_top <- 20
image(model_details$sim[1:n_items_top, 1:n_items_top],
main = "Heatmap of the first rows and columns")
This heatmap displays the similarity matrix for the top 20 items based on the collaborative filtering model. It visualizes how closely related these items are to each other, which helps in making item-based recommendations.
row_sums
30
2499
col_sums <- colSums(model_details$sim > 0)
qplot(col_sums) + stat_bin(binwidth = 1) + ggtitle("Distribution of the column count")
This plot shows the distribution of the number of positive similarities each item has with other items. It helps understand how many neighbors each item has in the similarity matrix, which impacts recommendation diversity.
n_recommended <- 10 # the number of items to recommend to each user
recc_predicted <- predict(object = recc_model,
newdata = recc_data_test,
n = n_recommended)
recc_predicted
Recommendations as 'topNList' with n = 10 for 851 users.
recc_user_1 <- recc_predicted@items[[1]] # recommendation for the first user
movies_user_1 <- recc_predicted@itemLabels[recc_user_1]
movies_user_2 <- movies_user_1
for (i in 1:10){
movies_user_2[i] <- as.character(subset(movies,
movies$movieId == movies_user_1[i])$title)
}
movies_user_2
[1] "Apollo 13 (1995)" "Die Hard: With a Vengeance (1995)"
[3] "True Lies (1994)" "Blade Runner (1982)"
[5] "Close Shave, A (1995)" "Godfather, The (1972)"
[7] "Rear Window (1954)" "Monty Python's Life of Brian (1979)"
[9] "Reservoir Dogs (1992)" "Platoon (1986)"
recc_matrix <- sapply(recc_predicted@items,
function(x){ as.integer(colnames(ratings_movies)[x]) }) # matrix with the recommendations for each user
#dim(recc_matrix)
recc_matrix[,1:4]
0 1 2 3
[1,] 150 11 107 11
[2,] 165 13 122 14
[3,] 380 17 333 50
[4,] 541 79 490 95
[5,] 745 92 708 161
[6,] 858 135 799 290
[7,] 904 169 848 296
[8,] 1080 261 900 302
[9,] 1089 328 932 326
[10,] 1090 345 982 334
number_of_items <- factor(table(recc_matrix))
chart_title <- "Distribution of the number of items for IBCF"
qplot(number_of_items) + ggtitle(chart_title)
This chart shows the distribution of the number of items recommended by the Item-Based Collaborative Filtering (IBCF) model. It provides insight into how frequently different recommendations occur across users.
number_of_items_sorted <- sort(number_of_items, decreasing = TRUE)
number_of_items_top <- head(number_of_items_sorted, n = 4)
table_top <- data.frame(as.integer(names(number_of_items_top)),
number_of_items_top)
for (i in 1:4){
table_top[i,1] <- as.character(subset(movies,
movies$movieId == table_top[i,1])$title)
}
colnames(table_top) <- c("Movie title", "No of items")
head(table_top)
Movie title No of items
32 Twelve Monkeys (1995) 116
111 Taxi Driver (1976) 101
745 Close Shave, A (1995) 96
16 Casino (1995) 89
Item-Based Collaborative Filtering recommends items by finding similarities between items rather than users. Since items usually have more ratings than individual users, their rating patterns are more stable, reducing how often the model needs updating. When a user rates an item, similar items are quickly recommended based on the precomputed item similarity matrix.
User-based Collaborative Filtering Model
User-based collaborative filtering operates on the principle that users with similar preferences tend to like similar items. It finds a set of users most similar to the target user and predicts ratings by taking a weighted average of their ratings on the items of interest.
recommender_models <- recommenderRegistry$get_entries(dataType ="realRatingMatrix")
recommender_models$UBCF_realRatingMatrix$parameters
$method
[1] "cosine"
$nn
[1] 25
$sample
[1] FALSE
$weighted
[1] TRUE
$normalize
[1] "center"
$min_matching_items
[1] 0
$min_predictive_items
[1] 0
Recommender of type 'UBCF' for 'realRatingMatrix'
learned using 3396 users.
3396 x 2499 rating matrix of class 'realRatingMatrix' with 732069 ratings.
Normalized using center on rows.
n_recommended <- 10
recc_predicted <- predict(object = recc_model,
newdata = recc_data_test,
n = n_recommended)
recc_predicted
Recommendations as 'topNList' with n = 10 for 851 users.
recc_matrix <- sapply(recc_predicted@items,
function(x){ as.integer(colnames(ratings_movies)[x]) })
#dim(recc_matrix)
recc_matrix[, 1:4]
0 1 2 3
[1,] 125 12 1261 12
[2,] 2750 328 1732 328
[3,] 3911 330 308 330
[4,] 3316 842 381 842
[5,] 3409 1973 1719 2974
[6,] 6 2107 913 3420
[7,] 1711 2793 2070 1614
[8,] 2028 2974 2575 1320
[9,] 1287 3264 368 2278
[10,] 1301 663 852 2107
number_of_items <- factor(table(recc_matrix))
chart_title <- "Distribution of the number of items for UBCF"
qplot(number_of_items) + ggtitle(chart_title)
The chart shows the distribution of the number of items recommended by the User-Based Collaborative Filtering (UBCF) model. It highlights how many recommendations users typically receive, revealing the variability in recommendation counts across the user base.
number_of_items_sorted <- sort(number_of_items, decreasing = TRUE)
number_of_items_top <- head(number_of_items_sorted, n = 4)
table_top <- data.frame(as.integer(names(number_of_items_top)), number_of_items_top)
for (i in 1:4){
table_top[i,1] <- as.character(subset(movies,
movies$movieId == table_top[i,1])$title)
}
colnames(table_top) <- c("Movie title", "No of items")
head(table_top)
Movie title No of items
3613 Things Change (1988) 83
3491 My Chauffeur (1986) 77
12 Dracula: Dead and Loving It (1995) 73
1658 Life Less Ordinary, A (1997) 73
[1] 31
items_to_keep <- 5 #number of items to generate recommendations
rating_threshold <- 3 # threshold with the minimum rating that is considered good
n_eval <- 1 #number of times to run evaluation
eval_sets <- evaluationScheme(data = ratings_movies,
method = "split",
train = percentage_training,
given = items_to_keep,
goodRating = rating_threshold,
k = n_eval)
eval_sets
Evaluation scheme with 5 items given
Method: 'split' with 1 run(s).
Training set proportion: 0.800
Good ratings: >=3.000000
Data set: 4247 x 2499 rating matrix of class 'realRatingMatrix' with 918946 ratings.
3397 x 2499 rating matrix of class 'realRatingMatrix' with 734871 ratings.
850 x 2499 rating matrix of class 'realRatingMatrix' with 4142 ratings.
850 x 2499 rating matrix of class 'realRatingMatrix' with 179933 ratings.
# Bootstrapping the data
#Bootrstrapping is another approach to split the data. The same user can be sampled more than once and, if the training set has the same size as it did earlier, there will be more users in the test set.
eval_sets <- evaluationScheme(data = ratings_movies,
method = "bootstrap",
train = percentage_training,
given = items_to_keep,
goodRating = rating_threshold,
k = n_eval)
table_train <- table(eval_sets@runsTrain[[1]])
n_repetitions <- factor(as.vector(table_train))
# Using cross-validation to validate models
#The k-fold cross-validation approach is the most accurate one, we split the data into some chunks, take a chunk out as the test set, and evaluate the accuracy. Then, we can do the same with each other chunk and compute the average accuracy.
n_fold <- 4
eval_sets <- evaluationScheme(data = ratings_movies,
method = "cross-validation",
k = n_fold,
given = items_to_keep,
goodRating = rating_threshold)
## Evaluating the ratings
#k-fold approach has been picked for evaluation
eval_sets <- evaluationScheme(data = ratings_movies,
method = "cross-validation",
k = n_fold,
given = items_to_keep,
goodRating = rating_threshold)
model_to_evaluate <- "IBCF"
model_parameters <- NULL
eval_recommender <- Recommender(data = getData(eval_sets, "train"),
method = model_to_evaluate,
parameter = model_parameters)
items_to_recommend <- 10
eval_prediction <- predict(object = eval_recommender,
newdata = getData(eval_sets, "known"),
n = items_to_recommend,
type = "ratings")
eval_accuracy <- calcPredictionAccuracy(x = eval_prediction,
data = getData(eval_sets, "unknown"),
byUser = TRUE)
eval_accuracy <- calcPredictionAccuracy(x = eval_prediction,
data = getData(eval_sets, "unknown"),
byUser = FALSE)
eval_accuracy
RMSE MSE MAE
1.1405128 1.3007695 0.7906891
The evaluation metrics show the model’s prediction accuracy on the test data. The Root Mean Squared Error (RMSE) is approximately 1.14, indicating the average deviation between predicted and actual ratings. The Mean Squared Error (MSE) and Mean Absolute Error (MAE) values further quantify prediction errors, with lower values reflecting better model performance.
Comparing the recommendations with the views having a positive rating.
IBCF run fold/sample [model time/prediction time]
1 [73.17sec/0.28sec]
2 [72.53sec/0.31sec]
3 [72.32sec/0.26sec]
4 [71.64sec/0.32sec]
TP FP FN TN N precision recall
[1,] 2.497180 4.672932 176.9417 2309.848 2493.96 0.3291064 0.01513258
[2,] 4.103383 8.538534 175.3355 2305.982 2493.96 0.3052514 0.02332487
[3,] 5.390038 11.782895 174.0489 2302.738 2493.96 0.2949786 0.02956454
[4,] 6.417293 14.604323 173.0216 2299.916 2493.96 0.2884792 0.03404508
[5,] 7.247180 17.043233 172.1917 2297.477 2493.96 0.2837337 0.03740871
[6,] 7.872180 19.256579 171.5667 2295.264 2493.96 0.2789557 0.03982976
TPR FPR n
[1,] 0.01513258 0.002032013 10
[2,] 0.02332487 0.003734972 20
[3,] 0.02956454 0.005177913 30
[4,] 0.03404508 0.006441565 40
[5,] 0.03740871 0.007542171 50
[6,] 0.03982976 0.008549408 60
columns_to_sum <- c("TP", "FP", "FN", "TN")
indices_summed <- Reduce("+", getConfusionMatrix(results))[, columns_to_sum]
head(indices_summed)
TP FP FN TN
[1,] 10.15508 18.82801 698.3393 9248.678
[2,] 16.80357 34.36184 691.6908 9233.144
[3,] 21.92575 47.29417 686.5686 9220.211
[4,] 25.84305 58.20489 682.6513 9209.301
[5,] 29.02914 67.56673 679.4652 9199.939
[6,] 31.52068 75.94173 676.9737 9191.564
The ROC and Precision-Recall curves demonstrate the trade-off between precision and recall as the number of recommended items increases. Higher recall comes at the cost of lower precision, highlighting the balance needed to optimize recommendation quality.
To identify the most effective recommendation approach, we compared different collaborative filtering models. This includes both item-based and user-based methods, each evaluated using Cosine similarity and Pearson correlation metrics. Additionally, a random recommendation model was included as a baseline for benchmarking.
The following models are considered for comparison
Item-based collaborative filtering, using the Cosine as the distance function Item-based collaborative filtering, using the Pearson correlation as the distance function User-based collaborative filtering, using the Cosine as the distance function User-based collaborative filtering, using the Pearson correlation as the distance function Random recommendations to have a base line
models_to_evaluate <- list(
IBCF_cos = list(name = "IBCF",
param = list(method = "cosine")),
IBCF_cor = list(name = "IBCF",
param = list(method = "pearson")),
UBCF_cos = list(name = "UBCF",
param = list(method = "cosine")),
UBCF_cor = list(name = "UBCF",
param = list(method = "pearson")),
random = list(name = "RANDOM", param=NULL)
)
n_recommendations <- c(1, 5, seq(10, 100, 10))
list_results <- evaluate(x = eval_sets,
method = models_to_evaluate,
n = n_recommendations)
IBCF run fold/sample [model time/prediction time]
1 [72.33sec/0.31sec]
2 [72.07sec/0.53sec]
3 [72.57sec/0.24sec]
4 [72.09sec/0.25sec]
IBCF run fold/sample [model time/prediction time]
1 [51.91sec/0.26sec]
2 [51.6sec/0.23sec]
3 [52.54sec/0.24sec]
4 [52.6sec/0.26sec]
UBCF run fold/sample [model time/prediction time]
1 [0.04sec/49.37sec]
2 [0.05sec/49.41sec]
3 [0.05sec/48.31sec]
4 [0.05sec/46.34sec]
UBCF run fold/sample [model time/prediction time]
1 [0.05sec/17.29sec]
2 [0.04sec/18.27sec]
3 [0.04sec/16.94sec]
4 [0.04sec/17.37sec]
RANDOM run fold/sample [model time/prediction time]
1 [0.04sec/0.9sec]
2 [0.05sec/0.97sec]
3 [0.03sec/0.92sec]
4 [0.06sec/0.94sec]
IBCF_cos IBCF_cor UBCF_cos UBCF_cor random
TRUE TRUE TRUE TRUE TRUE
N precision recall TPR
[1,] 2494 0.3787891 0.002106721 0.002106721
[2,] 2494 0.3504439 0.008925973 0.008925973
[3,] 2494 0.3344106 0.015716293 0.015716293
[4,] 2494 0.3109248 0.024629497 0.024629497
[5,] 2494 0.2998529 0.031020490 0.031020490
[6,] 2494 0.2925109 0.035371292 0.035371292
Comparing the chart will give oppurtunity to pick the ideal model
The graph shows that,UBCF with cosine distance is best performing model
We implemented an Alternating Least Squares (ALS) model using Spark ML to efficiently handle large-scale data and predict user ratings. This scalable approach leverages distributed computing for faster model training and evaluation.
# connect to spark
config <- spark_config()
config$`sparklyr.shell.driver-memory` <- "8G"
config$`sparklyr.shell.executor-memory` <- "8G"
# The memory limit could not be configured on my low resouce laptop.
sc <- spark_connect(master = "local", config = config)
#Assign the dataset to Spark
spark_df <- ratings
#Building Train and Test dataset
train <- sample(x = c(TRUE, FALSE), size = nrow(spark_df),
replace = TRUE, prob = c(0.8, 0.2))
train_df <- spark_df[train, ]
test_df <- spark_df[!train, ]
#Copy function to Spark
spark_train <- sdf_copy_to(sc, train_df, "train_ratings", overwrite = TRUE)
spark_test <- sdf_copy_to(sc, test_df, "test_ratings", overwrite = TRUE)
sparkALS <- ml_als(spark_train, max_iter = 5, nonnegative = TRUE,
rating_col = "rating", user_col = "userId", item_col = "movieId")
sparkPred <- sparkALS$.jobj %>%
invoke("transform", spark_dataframe(spark_test)) %>% collect()
sparkPred <- sparkPred[!is.na(sparkPred$prediction), ]
#Calculate error
mse_spark <- mean((sparkPred$rating - sparkPred$prediction)^2)
rmse_spark <- sqrt(mse_spark)
mae_spark <- mean(abs(sparkPred$rating - sparkPred$prediction))
# disconnect from spark
spark_disconnect(sc)
spark_pred_datatable <- rename(sparkPred, c("spark_prediction"="prediction"))
spark_pred_datatable <- subset(spark_pred_datatable, select=c(userId,movieId,rating,spark_prediction))
spark_error <- data.frame(RMSE = rmse_spark, MSE = mse_spark, MAE = mae_spark)
spark_error
RMSE MSE MAE
1 0.9020105 0.813623 0.7306402
The Spark ALS model achieved strong prediction accuracy with an RMSE of 0.90, MSE of 0.81, and MAE of 0.73, outperforming previous models and demonstrating its effectiveness for rating prediction.
library(DT)
datatable(spark_pred_datatable,
rownames = FALSE,
filter = "top",
options = list(pageLength = 5, scrollX = TRUE))
Evaluation Metrics for the Recommended System
In this project, we used Root Mean Square Error (RMSE) as the primary evaluation metric to assess the accuracy of predicted ratings for our recommender models (UBCF, IBCF, and SVD). RMSE measures the average magnitude of prediction errors, making it a natural choice for rating prediction systems.
We selected RMSE because our main objective was to predict user ratings as accurately as possible, rather than to rank items. Among the models tested, the SVD-based recommender achieved the lowest RMSE, demonstrating superior performance in estimating user preferences.
While RMSE effectively evaluates prediction accuracy, it does not fully capture how well recommendations align with user interests in top-N recommendation lists. Future work could incorporate top-N metrics such as Precision@K or Hit Rate to better evaluate recommendation relevance and user experience.
This project analyzed Collaborative Filtering models using both Item-Based Collaborative Filtering (IBCF) and User-Based Collaborative Filtering (UBCF) on the MovieLens dataset. The results indicate that UBCF with cosine similarity delivered the best performance among the traditional approaches. Additionally, the Spark ALS model produced rating predictions that closely align with the actual ratings in the dataset, demonstrating its effectiveness for large-scale recommendation tasks.