Final-Project.knit

Column

Instruction

Hands on with AWS This assignment is designed to get you some hands-on assignment with deploying a cloud solution. Take one of your previous assignments and deploy it in AWS. You may also choose to do this as part of your final project and submit both this assignment and the final project at the same time. To get full credit for this assignment, your cloud solution must include the following pieces: • Utilize some sort of long term storage. This can either be a database like RDS or file storage like S3. • Utilize some sort of compute service. This can be a virtual machine like EC2 or a compute solution like AWS Lambda. • Utilize at least one VPC to ensure that only you or your application are the only ones that can access your data. If you choose to, you may pay for AWS services, but keep it mind it can get expensive. With that said, this assignment has been designed, so that it can be completed using only capacities provided by the AWS Free Tier. All the services in the Free Tier can be found here: https://aws.amazon.com/free/?all-free-tier Submission: Submit a video link where you walk through all three aspects mentioned above. Discuss why you chose the specific solution. What are its advantages and disadvantages for the compute and storage parts.

Introduction

This project focuses on analyzing the MovieLens dataset using Collaborative Filtering techniques to predict user-item ratings. Specifically, it implements the Alternating Least Squares (ALS) algorithm in Apache Spark to generate recommendations and evaluate prediction accuracy.

Collaborative Filtering is a popular method for making automated predictions about a user’s interests by leveraging the preferences and behaviors of many users. With Spark’s ALS model, both users and items are represented by a set of latent factors, which are learned from the data and used to estimate missing ratings.

Dataset

We have used the MovieLens 1M dataset, which contains 1 million ratings from 6,000 users on 4,000 movies. Each rating is an integer between 1 and 5, and the dataset includes demographic information about users (age, gender, occupation, zip code) as well as movie metadata (title, genres). This well-structured dataset is widely used in recommender system research and is suitable for building collaborative filtering and hybrid models.

movies <- read.csv("http://recom-system-data-final-project.s3-website.us-east-2.amazonaws.com/movies.csv", 
                   stringsAsFactors = FALSE)

ratings <- read.csv("http://recom-system-data-final-project.s3-website.us-east-2.amazonaws.com/ratings.csv", 
                    stringsAsFactors = FALSE)

Data Exploration

summary(movies)

    movieId          title              genres         
 Min.   :   1.0   Length:3883        Length:3883       
 1st Qu.: 982.5   Class :character   Class :character  
 Median :2010.0   Mode  :character   Mode  :character  
 Mean   :1986.0                                        
 3rd Qu.:2980.5                                        
 Max.   :3952.0

summary(ratings)

     userId        movieId         rating        timestamp        
 Min.   :   1   Min.   :   1   Min.   :1.000   Min.   :9.567e+08  
 1st Qu.:1506   1st Qu.:1030   1st Qu.:3.000   1st Qu.:9.653e+08  
 Median :3070   Median :1835   Median :4.000   Median :9.730e+08  
 Mean   :3025   Mean   :1866   Mean   :3.582   Mean   :9.722e+08  
 3rd Qu.:4476   3rd Qu.:2770   3rd Qu.:4.000   3rd Qu.:9.752e+08  
 Max.   :6040   Max.   :3952   Max.   :5.000   Max.   :1.046e+09

nrow(ratings)

[1] 1000209

ratings %>% summarize(n_users = n_distinct(userId),n_movies = n_distinct(movieId))

  n_users n_movies
1    6040     3706

par(mfrow=c(2,2))
ratings %>% 
  count(movieId) %>% 
  ggplot(aes(n)) + 
  geom_histogram( bins=30, color = "red") +
  scale_x_log10() + 
  ggtitle("Movies") +
  labs(subtitle  ="number of ratings by movieId", 
       x="movieId" , 
       y="number of ratings", 
       caption ="source data : ratings set") +
  theme(panel.border = element_rect(colour="black", fill=NA))

This histogram shows the distribution of the number of ratings per movie on a log scale. Most movies have relatively few ratings, while a small number receive a large volume of ratings. This long-tail pattern is typical in recommendation datasets and highlights the popularity imbalance across items.

  ratings %>% 
  count(userId) %>% 
  ggplot(aes(n)) + 
  geom_histogram( bins=30, color = "gold") +
  scale_x_log10() + 
  ggtitle("Users") +
  labs(subtitle ="number of ratings by UserId", 
       x="userId" , 
       y="number of ratings") +
  theme(panel.border = element_rect(colour="black", fill=NA))

This histogram illustrates how many ratings each user has given. Most users rate a relatively small number of movies, while a few are highly active. The log scale reveals a wide variation in user engagement, which is important to consider when building personalized recommendation models.

kable(head(ratings),caption = "Sample Ratings Dataset") %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive")) %>%
  row_spec(0, bold = T, color = "white", background = "#ea7872")

Sample Ratings Dataset
userId	movieId	rating	timestamp
1	1193	5	978300760
1	661	3	978302109
1	914	3	978301968
1	3408	4	978300275
1	2355	5	978824291
1	1197	3	978302268

kable(head(movies),caption = "Sample Movies Dataset") %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive")) %>%
  row_spec(0, bold = T, color = "white", background = "#ea7872")

Sample Movies Dataset
movieId	title	genres
1	Toy Story (1995)	Animation\|Children’s\|Comedy
2	Jumanji (1995)	Adventure\|Children’s\|Fantasy
3	Grumpier Old Men (1995)	Comedy\|Romance
4	Waiting to Exhale (1995)	Comedy\|Drama
5	Father of the Bride Part II (1995)	Comedy
6	Heat (1995)	Action\|Crime\|Thriller

Data Wrangling

We have processed the dataset to ensure it is clean and ready for analysis. This involves merging data rating with user and movie metadata, converting data types as needed, and handling missing or inconsistent entries.

genres <- as.data.frame(movies$genres, stringsAsFactors=FALSE)
genres2 <- as.data.frame(tstrsplit(genres[,1], '[|]', 
                                   type.convert=TRUE), 
                         stringsAsFactors=FALSE)
colnames(genres2) <- paste0("V", 1:ncol(genres2))  # Dynamic colnames to avoid length mismatch

genre_list <- c("Action", "Adventure", "Animation", "Children", 
                "Comedy", "Crime","Documentary", "Drama", "Fantasy",
                "Film-Noir", "Horror", "Musical", "Mystery","Romance",
                "Sci-Fi", "Thriller", "War", "Western")  # 18 genres

# genre_matrix <- matrix(0, 10330, 18)  # 10329 movies + 1 header row
genre_matrix <- matrix(0, nrow(movies) + 1, length(genre_list))  # dynamically match movie count
genre_matrix[1,] <- genre_list
colnames(genre_matrix) <- genre_list


# Fill genre matrix
for (i in 1:nrow(genres2)) {
  for (c in 1:ncol(genres2)) {
    genmat_col <- which(genre_matrix[1,] == genres2[i,c])
    if (length(genmat_col) > 0) {
      genre_matrix[i+1, genmat_col] <- 1
    }
  }
}

# Convert to data frame
genre_matrix2 <- as.data.frame(genre_matrix[-1,], stringsAsFactors=FALSE)  # Remove header row
for (c in 1:ncol(genre_matrix2)) {
  genre_matrix2[,c] <- as.integer(genre_matrix2[,c])
}

# Combine with movie info
search_matrix <- cbind(movies[,1:2], genre_matrix2)

# Table display
kable(head(search_matrix), caption = "Sample Movies Dataset") %>%
  kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive")) %>%
  row_spec(0, bold = T, color = "white", background = "#ea7872") %>%
  scroll_box(width = "100%", height = "300px")

Sample Movies Dataset
movieId	title	Action	Adventure	Animation	Comedy	Crime	Drama	Fantasy	Romance	Thriller
1	Toy Story (1995)	0	0	1	1	0	0	0	0	0
2	Jumanji (1995)	0	1	0	0	0	0	1	0	0
3	Grumpier Old Men (1995)	0	0	0	1	0	0	0	1	0
4	Waiting to Exhale (1995)	0	0	0	1	0	1	0	0	0
5	Father of the Bride Part II (1995)	0	0	0	1	0	0	0	0	0
6	Heat (1995)	1	0	0	0	1	0	0	0	1

Model Exploration

We have explored various recommendation algorithms to identify the most effective approach. Starting with baseline models, such as user-based (UBCF) and item-based collaborative filtering (IBCF), we then explored advanced techniques, including matrix factorization, to improve prediction accuracy and recommendation quality.

library(data.table)

# Convert ratings to data.table
setDT(ratings)

# Then use dcast
ratings_wide <- dcast(ratings, userId ~ movieId, value.var = "rating")


#Create ratings matrix. Rows = userId, Columns = movieId
ratingmat <- dcast(ratings, userId~movieId, value.var = "rating", na.rm=FALSE)
ratingmat <- as.matrix(ratingmat[,-1]) #remove userIds
#Convert rating matrix into a recommenderlab sparse matrix
ratingmat <- as(ratingmat, "realRatingMatrix")
ratingmat

6040 x 3706 rating matrix of class 'realRatingMatrix' with 1000209 ratings.

Identifying the algorithms and recommendation model

recommender_models <- recommenderRegistry$get_entries(dataType = "realRatingMatrix")
names(recommender_models)

 [1] "HYBRID_realRatingMatrix"       "ALS_realRatingMatrix"         
 [3] "ALS_implicit_realRatingMatrix" "IBCF_realRatingMatrix"        
 [5] "LIBMF_realRatingMatrix"        "POPULAR_realRatingMatrix"     
 [7] "RANDOM_realRatingMatrix"       "RERECOMMEND_realRatingMatrix" 
 [9] "SVD_realRatingMatrix"          "SVDF_realRatingMatrix"        
[11] "UBCF_realRatingMatrix"

lapply(recommender_models, "[[", "description")

$HYBRID_realRatingMatrix
[1] "Hybrid recommender that aggegates several recommendation strategies using weighted averages."

$ALS_realRatingMatrix
[1] "Recommender for explicit ratings based on latent factors, calculated by alternating least squares algorithm."

$ALS_implicit_realRatingMatrix
[1] "Recommender for implicit data based on latent factors, calculated by alternating least squares algorithm."

$IBCF_realRatingMatrix
[1] "Recommender based on item-based collaborative filtering."

$LIBMF_realRatingMatrix
[1] "Matrix factorization with LIBMF via package recosystem (https://cran.r-project.org/web/packages/recosystem/vignettes/introduction.html)."

$POPULAR_realRatingMatrix
[1] "Recommender based on item popularity."

$RANDOM_realRatingMatrix
[1] "Produce random recommendations (real ratings)."

$RERECOMMEND_realRatingMatrix
[1] "Re-recommends highly rated items (real ratings)."

$SVD_realRatingMatrix
[1] "Recommender based on SVD approximation with column-mean imputation."

$SVDF_realRatingMatrix
[1] "Recommender based on Funk SVD with gradient descend (https://sifter.org/~simon/journal/20061211.html)."

$UBCF_realRatingMatrix
[1] "Recommender based on user-based collaborative filtering."

recommender_models$IBCF_realRatingMatrix$parameters

$k
[1] 30

$method
[1] "cosine"

$normalize
[1] "center"

$normalize_sim_matrix
[1] FALSE

$alpha
[1] 0.5

$na_as_zero
[1] FALSE

recommender_models$UBCF_realRatingMatrix$parameters

$method
[1] "cosine"

$nn
[1] 25

$sample
[1] FALSE

$weighted
[1] TRUE

$normalize
[1] "center"

$min_matching_items
[1] 0

$min_predictive_items
[1] 0

similarity_users <- similarity(ratingmat[1:4, ], 
                               method = "cosine", 
                               which = "users")
par(mfrow=c(1,2))
as.matrix(similarity_users)

          1         2         3         4
1        NA 0.9924242 0.9757859 0.9969045
2 0.9924242        NA 0.9821403 0.9798724
3 0.9757859 0.9821403        NA 0.9886995
4 0.9969045 0.9798724 0.9886995        NA

image(as.matrix(similarity_users), main = "User similarity")

similarity_items <- similarity(ratingmat[, 1:4], method =
                                 "cosine", which = "items")
as.matrix(similarity_items)

          1         2         3         4
1        NA 0.9744467 0.9668976 0.9733027
2 0.9744467        NA 0.9612445 0.9507502
3 0.9668976 0.9612445        NA 0.9569447
4 0.9733027 0.9507502 0.9569447        NA

image(as.matrix(similarity_items), main = "Movies similarity")

Data Distribution

Our purpose in using the data distribution of key variables in the dataset is to help reveal user behavior patterns and item popularity. This analysis provides valuable insights that guide model selection and optimization.

vector_ratings <- as.vector(ratingmat@data)
unique(vector_ratings) # what are unique values of ratings

[1] 5 0 4 3 2 1

table_ratings <- table(vector_ratings) # what is the count of each rating value
vector_ratings <- vector_ratings[vector_ratings != 0] # rating == 0 are NA values
vector_ratings <- factor(vector_ratings)
par(mfrow=c(1,2))
qplot(vector_ratings) + 
  ggtitle("Distribution of the vector ratings")+
  theme(plot.title = element_text(hjust = 0.5))

The plot shows the distribution of rating values, with ratings from 1 to 5. Most ratings cluster around 3, 4, and 5, indicating users generally give positive feedback.

image(ratingmat[1:20, 1:25], main = "Heatmap of the first 20 rows and 25 columns")

The heatmap visualizes the rating patterns of the first 20 users across 25 movies. The sparse distribution of colored cells highlights the typical sparsity in recommendation datasets, where most user-item pairs have no rating.

Defining the minimum number of users per rated movie as 50 and the minimum views number per movie as 50

ratings_movies <- ratingmat[rowCounts(ratingmat) > 50,
                             colCounts(ratingmat) > 50]
ratings_movies_norm <- normalize(ratings_movies)

Collaborative Filtering Model

We will be exploring the Item-based and User-based collaborative filtering model

Item-based Collaborative Filtering

Item-based collaborative filtering, also known as item-to-item filtering, is a recommendation approach that focuses on the similarity between items. It leverages user ratings to find items that are closely related, and recommends items similar to those a user has already liked or rated highly.

The training and test datasets are defined as follows:

which_train <- sample(x = c(TRUE, FALSE), 
                      size = nrow(ratings_movies),
                      replace = TRUE, 
                      prob = c(0.8, 0.2))
recc_data_train <- ratings_movies[which_train, ]
recc_data_test <- ratings_movies[!which_train, ]

Building the IBCF model

recommender_models <- recommenderRegistry$get_entries(dataType ="realRatingMatrix")
recommender_models$IBCF_realRatingMatrix$parameters

$k
[1] 30

$method
[1] "cosine"

$normalize
[1] "center"

$normalize_sim_matrix
[1] FALSE

$alpha
[1] 0.5

$na_as_zero
[1] FALSE

recc_model <- Recommender(data = recc_data_train, 
                          method = "IBCF",
                          parameter = list(k = 30))
recc_model

Recommender of type 'IBCF' for 'realRatingMatrix' 
learned using 3396 users.

class(recc_model)

[1] "Recommender"
attr(,"package")
[1] "recommenderlab"

model_details <- getModel(recc_model)
class(model_details$sim) # this contains a similarity matrix

[1] "dgCMatrix"
attr(,"package")
[1] "Matrix"

dim(model_details$sim)

[1] 2499 2499

n_items_top <- 20
image(model_details$sim[1:n_items_top, 1:n_items_top],
      main = "Heatmap of the first rows and columns")

This heatmap displays the similarity matrix for the top 20 items based on the collaborative filtering model. It visualizes how closely related these items are to each other, which helps in making item-based recommendations.

row_sums <- rowSums(model_details$sim > 0)
table(row_sums)

row_sums
  30 
2499

col_sums <- colSums(model_details$sim > 0)
qplot(col_sums) + stat_bin(binwidth = 1) + ggtitle("Distribution of the column count")

This plot shows the distribution of the number of positive similarities each item has with other items. It helps understand how many neighbors each item has in the similarity matrix, which impacts recommendation diversity.

n_recommended <- 10 # the number of items to recommend to each user
recc_predicted <- predict(object = recc_model, 
                          newdata = recc_data_test, 
                          n = n_recommended)
recc_predicted

Recommendations as 'topNList' with n = 10 for 851 users.

recc_user_1 <- recc_predicted@items[[1]] # recommendation for the first user
movies_user_1 <- recc_predicted@itemLabels[recc_user_1]
movies_user_2 <- movies_user_1
for (i in 1:10){
  movies_user_2[i] <- as.character(subset(movies, 
                                         movies$movieId == movies_user_1[i])$title)
}
movies_user_2

 [1] "Apollo 13 (1995)"                    "Die Hard: With a Vengeance (1995)"  
 [3] "True Lies (1994)"                    "Blade Runner (1982)"                
 [5] "Close Shave, A (1995)"               "Godfather, The (1972)"              
 [7] "Rear Window (1954)"                  "Monty Python's Life of Brian (1979)"
 [9] "Reservoir Dogs (1992)"               "Platoon (1986)"

recc_matrix <- sapply(recc_predicted@items, 
                      function(x){ as.integer(colnames(ratings_movies)[x]) }) # matrix with the recommendations for each user
#dim(recc_matrix)
recc_matrix[,1:4]

         0   1   2   3
 [1,]  150  11 107  11
 [2,]  165  13 122  14
 [3,]  380  17 333  50
 [4,]  541  79 490  95
 [5,]  745  92 708 161
 [6,]  858 135 799 290
 [7,]  904 169 848 296
 [8,] 1080 261 900 302
 [9,] 1089 328 932 326
[10,] 1090 345 982 334

number_of_items <- factor(table(recc_matrix))
chart_title <- "Distribution of the number of items for IBCF"
qplot(number_of_items) + ggtitle(chart_title)

This chart shows the distribution of the number of items recommended by the Item-Based Collaborative Filtering (IBCF) model. It provides insight into how frequently different recommendations occur across users.

number_of_items_sorted <- sort(number_of_items, decreasing = TRUE)
number_of_items_top <- head(number_of_items_sorted, n = 4)
table_top <- data.frame(as.integer(names(number_of_items_top)),
                       number_of_items_top)
for (i in 1:4){
  table_top[i,1] <- as.character(subset(movies, 
                                         movies$movieId == table_top[i,1])$title)
}
colnames(table_top) <- c("Movie title", "No of items")
head(table_top)

              Movie title No of items
32  Twelve Monkeys (1995)         116
111    Taxi Driver (1976)         101
745 Close Shave, A (1995)          96
16          Casino (1995)          89

Item-Based Collaborative Filtering recommends items by finding similarities between items rather than users. Since items usually have more ratings than individual users, their rating patterns are more stable, reducing how often the model needs updating. When a user rates an item, similar items are quickly recommended based on the precomputed item similarity matrix.

User-based Collaborative Filtering Model

User-based collaborative filtering operates on the principle that users with similar preferences tend to like similar items. It finds a set of users most similar to the target user and predicts ratings by taking a weighted average of their ratings on the items of interest.

Building the UBCF model

recommender_models <- recommenderRegistry$get_entries(dataType ="realRatingMatrix")
recommender_models$UBCF_realRatingMatrix$parameters

$method
[1] "cosine"

$nn
[1] 25

$sample
[1] FALSE

$weighted
[1] TRUE

$normalize
[1] "center"

$min_matching_items
[1] 0

$min_predictive_items
[1] 0

recc_model <- Recommender(data = recc_data_train, method = "UBCF")
recc_model

Recommender of type 'UBCF' for 'realRatingMatrix' 
learned using 3396 users.

model_details <- getModel(recc_model)
#names(model_details)
model_details$data

3396 x 2499 rating matrix of class 'realRatingMatrix' with 732069 ratings.
Normalized using center on rows.

n_recommended <- 10
recc_predicted <- predict(object = recc_model,
                          newdata = recc_data_test, 
                          n = n_recommended) 
recc_predicted

Recommendations as 'topNList' with n = 10 for 851 users.

recc_matrix <- sapply(recc_predicted@items, 
                      function(x){ as.integer(colnames(ratings_movies)[x]) })
#dim(recc_matrix)
recc_matrix[, 1:4]

         0    1    2    3
 [1,]  125   12 1261   12
 [2,] 2750  328 1732  328
 [3,] 3911  330  308  330
 [4,] 3316  842  381  842
 [5,] 3409 1973 1719 2974
 [6,]    6 2107  913 3420
 [7,] 1711 2793 2070 1614
 [8,] 2028 2974 2575 1320
 [9,] 1287 3264  368 2278
[10,] 1301  663  852 2107

number_of_items <- factor(table(recc_matrix))
chart_title <- "Distribution of the number of items for UBCF"
qplot(number_of_items) + ggtitle(chart_title)

The chart shows the distribution of the number of items recommended by the User-Based Collaborative Filtering (UBCF) model. It highlights how many recommendations users typically receive, revealing the variability in recommendation counts across the user base.

number_of_items_sorted <- sort(number_of_items, decreasing = TRUE)
number_of_items_top <- head(number_of_items_sorted, n = 4)
table_top <- data.frame(as.integer(names(number_of_items_top)), number_of_items_top)
for (i in 1:4){
  table_top[i,1] <- as.character(subset(movies, 
                                         movies$movieId == table_top[i,1])$title)
}
colnames(table_top) <- c("Movie title", "No of items")
head(table_top)

                            Movie title No of items
3613               Things Change (1988)          83
3491                My Chauffeur (1986)          77
12   Dracula: Dead and Loving It (1995)          73
1658       Life Less Ordinary, A (1997)          73

Evaluation of model

percentage_training <- 0.8

min(rowCounts(ratings_movies))

[1] 31

items_to_keep <- 5 #number of items to generate recommendations
rating_threshold <- 3 # threshold with the minimum rating that is considered good
n_eval <- 1 #number of times to run evaluation
eval_sets <- evaluationScheme(data = ratings_movies, 
                              method = "split",
                              train = percentage_training, 
                              given = items_to_keep, 
                              goodRating = rating_threshold, 
                              k = n_eval) 
eval_sets

Evaluation scheme with 5 items given
Method: 'split' with 1 run(s).
Training set proportion: 0.800
Good ratings: >=3.000000
Data set: 4247 x 2499 rating matrix of class 'realRatingMatrix' with 918946 ratings.

getData(eval_sets, "train") # training set

3397 x 2499 rating matrix of class 'realRatingMatrix' with 734871 ratings.

getData(eval_sets, "known") # set with the items used to build the recommendations

850 x 2499 rating matrix of class 'realRatingMatrix' with 4142 ratings.

getData(eval_sets, "unknown") # set with the items used to test the recommendations

850 x 2499 rating matrix of class 'realRatingMatrix' with 179933 ratings.

# Bootstrapping the data

#Bootrstrapping is another approach to split the data. The same user can be sampled more than once and, if the training set has the same size as it did earlier, there will be more users in the test set.

eval_sets <- evaluationScheme(data = ratings_movies, 
                              method = "bootstrap", 
                              train = percentage_training, 
                              given = items_to_keep,
                              goodRating = rating_threshold, 
                              k = n_eval)
table_train <- table(eval_sets@runsTrain[[1]])
n_repetitions <- factor(as.vector(table_train))

# Using cross-validation to validate models

#The k-fold cross-validation approach is the most accurate one, we split the data into some chunks, take a chunk out as the test set, and evaluate the accuracy. Then, we can do the same with each other chunk and compute the average accuracy.

n_fold <- 4
eval_sets <- evaluationScheme(data = ratings_movies, 
                              method = "cross-validation",
                              k = n_fold, 
                              given = items_to_keep, 
                              goodRating = rating_threshold)

## Evaluating the ratings

#k-fold approach has been picked for evaluation

eval_sets <- evaluationScheme(data = ratings_movies, 
                              method = "cross-validation",
                              k = n_fold, 
                              given = items_to_keep, 
                              goodRating = rating_threshold)
model_to_evaluate <- "IBCF"
model_parameters <- NULL
eval_recommender <- Recommender(data = getData(eval_sets, "train"),
                                method = model_to_evaluate, 
                                parameter = model_parameters)
items_to_recommend <- 10
eval_prediction <- predict(object = eval_recommender, 
                           newdata = getData(eval_sets, "known"), 
                           n = items_to_recommend, 
                           type = "ratings")

eval_accuracy <- calcPredictionAccuracy(x = eval_prediction, 
                                        data = getData(eval_sets, "unknown"), 
                                        byUser = TRUE)

Accuracy Measures

eval_accuracy <- calcPredictionAccuracy(x = eval_prediction, 
                                        data = getData(eval_sets, "unknown"), 
                                        byUser = FALSE) 
eval_accuracy

     RMSE       MSE       MAE 
1.1405128 1.3007695 0.7906891

The evaluation metrics show the model’s prediction accuracy on the test data. The Root Mean Squared Error (RMSE) is approximately 1.14, indicating the average deviation between predicted and actual ratings. The Mean Squared Error (MSE) and Mean Absolute Error (MAE) values further quantify prediction errors, with lower values reflecting better model performance.

Probability thresholds

Comparing the recommendations with the views having a positive rating.

results <- evaluate(x = eval_sets, 
                    method = model_to_evaluate, 
                    n = seq(10, 100, 10))

IBCF run fold/sample [model time/prediction time]
     1  [73.17sec/0.28sec] 
     2  [72.53sec/0.31sec] 
     3  [72.32sec/0.26sec] 
     4  [71.64sec/0.32sec]

head(getConfusionMatrix(results)[[1]])

           TP        FP       FN       TN       N precision     recall
[1,] 2.497180  4.672932 176.9417 2309.848 2493.96 0.3291064 0.01513258
[2,] 4.103383  8.538534 175.3355 2305.982 2493.96 0.3052514 0.02332487
[3,] 5.390038 11.782895 174.0489 2302.738 2493.96 0.2949786 0.02956454
[4,] 6.417293 14.604323 173.0216 2299.916 2493.96 0.2884792 0.03404508
[5,] 7.247180 17.043233 172.1917 2297.477 2493.96 0.2837337 0.03740871
[6,] 7.872180 19.256579 171.5667 2295.264 2493.96 0.2789557 0.03982976
            TPR         FPR  n
[1,] 0.01513258 0.002032013 10
[2,] 0.02332487 0.003734972 20
[3,] 0.02956454 0.005177913 30
[4,] 0.03404508 0.006441565 40
[5,] 0.03740871 0.007542171 50
[6,] 0.03982976 0.008549408 60

columns_to_sum <- c("TP", "FP", "FN", "TN")
indices_summed <- Reduce("+", getConfusionMatrix(results))[, columns_to_sum]
head(indices_summed)

           TP       FP       FN       TN
[1,] 10.15508 18.82801 698.3393 9248.678
[2,] 16.80357 34.36184 691.6908 9233.144
[3,] 21.92575 47.29417 686.5686 9220.211
[4,] 25.84305 58.20489 682.6513 9209.301
[5,] 29.02914 67.56673 679.4652 9199.939
[6,] 31.52068 75.94173 676.9737 9191.564

plot(results, annotate = TRUE, main = "ROC curve")

plot(results, "prec/rec", annotate = TRUE, main = "Precision-recall")

The ROC and Precision-Recall curves demonstrate the trade-off between precision and recall as the number of recommended items increases. Higher recall comes at the cost of lower precision, highlighting the balance needed to optimize recommendation quality.

Comparing models

To identify the most effective recommendation approach, we compared different collaborative filtering models. This includes both item-based and user-based methods, each evaluated using Cosine similarity and Pearson correlation metrics. Additionally, a random recommendation model was included as a baseline for benchmarking.

The following models are considered for comparison

Item-based collaborative filtering, using the Cosine as the distance function Item-based collaborative filtering, using the Pearson correlation as the distance function User-based collaborative filtering, using the Cosine as the distance function User-based collaborative filtering, using the Pearson correlation as the distance function Random recommendations to have a base line

models_to_evaluate <- list(
IBCF_cos = list(name = "IBCF", 
                param = list(method = "cosine")),
IBCF_cor = list(name = "IBCF", 
                param = list(method = "pearson")),
UBCF_cos = list(name = "UBCF", 
                param = list(method = "cosine")),
UBCF_cor = list(name = "UBCF", 
                param = list(method = "pearson")),
random = list(name = "RANDOM", param=NULL)
)

n_recommendations <- c(1, 5, seq(10, 100, 10))
list_results <- evaluate(x = eval_sets, 
                         method = models_to_evaluate, 
                         n = n_recommendations)

IBCF run fold/sample [model time/prediction time]
     1  [72.33sec/0.31sec] 
     2  [72.07sec/0.53sec] 
     3  [72.57sec/0.24sec] 
     4  [72.09sec/0.25sec] 
IBCF run fold/sample [model time/prediction time]
     1  [51.91sec/0.26sec] 
     2  [51.6sec/0.23sec] 
     3  [52.54sec/0.24sec] 
     4  [52.6sec/0.26sec] 
UBCF run fold/sample [model time/prediction time]
     1  [0.04sec/49.37sec] 
     2  [0.05sec/49.41sec] 
     3  [0.05sec/48.31sec] 
     4  [0.05sec/46.34sec] 
UBCF run fold/sample [model time/prediction time]
     1  [0.05sec/17.29sec] 
     2  [0.04sec/18.27sec] 
     3  [0.04sec/16.94sec] 
     4  [0.04sec/17.37sec] 
RANDOM run fold/sample [model time/prediction time]
     1  [0.04sec/0.9sec] 
     2  [0.05sec/0.97sec] 
     3  [0.03sec/0.92sec] 
     4  [0.06sec/0.94sec]

sapply(list_results, class) == "evaluationResults"

IBCF_cos IBCF_cor UBCF_cos UBCF_cor   random 
    TRUE     TRUE     TRUE     TRUE     TRUE

avg_matrices <- lapply(list_results, avg)
head(avg_matrices$IBCF_cos[, 5:8])

        N precision      recall         TPR
[1,] 2494 0.3787891 0.002106721 0.002106721
[2,] 2494 0.3504439 0.008925973 0.008925973
[3,] 2494 0.3344106 0.015716293 0.015716293
[4,] 2494 0.3109248 0.024629497 0.024629497
[5,] 2494 0.2998529 0.031020490 0.031020490
[6,] 2494 0.2925109 0.035371292 0.035371292

Ideal model

Comparing the chart will give oppurtunity to pick the ideal model

plot(list_results, annotate = 1, legend = "topleft") 
title("ROC curve")

plot(list_results, "prec/rec", annotate = 1, legend = "bottomright")
title("Precision-recall")

The graph shows that,UBCF with cosine distance is best performing model

Spark Implementation

We implemented an Alternating Least Squares (ALS) model using Spark ML to efficiently handle large-scale data and predict user ratings. This scalable approach leverages distributed computing for faster model training and evaluation.

# connect to spark
config <- spark_config()
config$`sparklyr.shell.driver-memory` <- "8G"
config$`sparklyr.shell.executor-memory` <- "8G"

# The memory limit could not be configured on my low resouce laptop.
sc <- spark_connect(master = "local", config = config)

#Assign the dataset to Spark
spark_df <- ratings

#Building Train and Test dataset
train <- sample(x = c(TRUE, FALSE), size = nrow(spark_df),
                      replace = TRUE, prob = c(0.8, 0.2))
train_df <- spark_df[train, ]
test_df <- spark_df[!train, ]

#Copy function to Spark
spark_train <- sdf_copy_to(sc, train_df, "train_ratings", overwrite = TRUE)
spark_test <- sdf_copy_to(sc, test_df, "test_ratings", overwrite = TRUE)


sparkALS <- ml_als(spark_train, max_iter = 5, nonnegative = TRUE, 
                   rating_col = "rating", user_col = "userId", item_col = "movieId")

sparkPred <- sparkALS$.jobj %>%
  invoke("transform", spark_dataframe(spark_test)) %>% collect()



sparkPred <- sparkPred[!is.na(sparkPred$prediction), ] 

#Calculate error
mse_spark <- mean((sparkPred$rating - sparkPred$prediction)^2)
rmse_spark <- sqrt(mse_spark)
mae_spark <- mean(abs(sparkPred$rating - sparkPred$prediction))

# disconnect from spark
spark_disconnect(sc)

Spark Accuracy Measures

spark_pred_datatable <- rename(sparkPred, c("spark_prediction"="prediction"))
spark_pred_datatable <- subset(spark_pred_datatable, select=c(userId,movieId,rating,spark_prediction)) 

spark_error <- data.frame(RMSE = rmse_spark, MSE = mse_spark, MAE = mae_spark)
spark_error

       RMSE      MSE       MAE
1 0.9020105 0.813623 0.7306402

The Spark ALS model achieved strong prediction accuracy with an RMSE of 0.90, MSE of 0.81, and MAE of 0.73, outperforming previous models and demonstrating its effectiveness for rating prediction.

Spark Prediction

library(DT)

datatable(spark_pred_datatable, 
          rownames = FALSE, 
          filter = "top", 
          options = list(pageLength = 5, scrollX = TRUE))

Evaluation Metrics for the Recommended System

In this project, we used Root Mean Square Error (RMSE) as the primary evaluation metric to assess the accuracy of predicted ratings for our recommender models (UBCF, IBCF, and SVD). RMSE measures the average magnitude of prediction errors, making it a natural choice for rating prediction systems.

We selected RMSE because our main objective was to predict user ratings as accurately as possible, rather than to rank items. Among the models tested, the SVD-based recommender achieved the lowest RMSE, demonstrating superior performance in estimating user preferences.

While RMSE effectively evaluates prediction accuracy, it does not fully capture how well recommendations align with user interests in top-N recommendation lists. Future work could incorporate top-N metrics such as Precision@K or Hit Rate to better evaluate recommendation relevance and user experience.

Summary

This project analyzed Collaborative Filtering models using both Item-Based Collaborative Filtering (IBCF) and User-Based Collaborative Filtering (UBCF) on the MovieLens dataset. The results indicate that UBCF with cosine similarity delivered the best performance among the traditional approaches. Additionally, the Spark ALS model produced rating predictions that closely align with the actual ratings in the dataset, demonstrating its effectiveness for large-scale recommendation tasks.

DATA 612 - Final Project :

Bikash Bhowmik , Rupendra Shrestha—- 19 Jul 2025

Column

Column