Data 612 Final Project

Introduction

This project investigates the development and comparison of several recommender system models using the MovieLens 1M dataset, which includes 1 million ratings from over 6,000 users on nearly 4,000 movies. Three types of recommendation algorithms are explored. User-Based Collaborative Filtering (UBCF) predicts user preferences based on the preferences of similar users. Similarity is computed using either the Pearson correlation or Cosine similarity, and models are tested with different neighborhood sizes to observe the effect on performance. Item-Based Collaborative Filtering (IBCF) instead focuses on the similarity between items, recommending movies that are similar to those a user has already rated highly. Like UBCF, IBCF uses Pearson and Cosine similarity to measure item-to-item closeness. The third model, Alternating Least Squares (ALS), is a matrix factorization method implemented with Apache Spark. ALS decomposes the user-item rating matrix into lower-dimensional latent factor representations, allowing it to scale effectively for large datasets. Each model is evaluated based on its prediction accuracy using Root Mean Square Error (RMSE) and the time it takes to train and make predictions. This allows for a comprehensive comparison of both effectiveness and computational efficiency across models.

Load MovieLens 1M Data

Load the data and preview merged dataframe.

# Set your data path
ratings_path <- "/Users/kevindiperna/Desktop/ml-1m"

# Read and parse ratings.dat
ratings_raw <- readLines(file.path(ratings_path, "ratings.dat"))
ratings_split <- strsplit(ratings_raw, "::", fixed = TRUE)
ratings_df <- as.data.frame(do.call(rbind, ratings_split), stringsAsFactors = FALSE)
colnames(ratings_df) <- c("UserID", "MovieID", "Rating", "Timestamp")
ratings_df <- ratings_df %>%
  mutate(across(c(UserID, MovieID, Rating), as.numeric))

# Read and parse movies.dat
movies_raw <- readLines(file.path(ratings_path, "movies.dat"))
movies_split <- strsplit(movies_raw, "::", fixed = TRUE)
movies_df <- as.data.frame(do.call(rbind, movies_split), stringsAsFactors = FALSE)
colnames(movies_df) <- c("MovieID", "Title", "Genres")
movies_df$MovieID <- as.numeric(movies_df$MovieID)

# Merge ratings with movie titles
merged_df <- merge(ratings_df, movies_df, by = "MovieID")

# Preview
head(merged_df)

##   MovieID UserID Rating Timestamp            Title                      Genres
## 1       1   4643      4 963998178 Toy Story (1995) Animation|Children's|Comedy
## 2       1   5359      5 960547997 Toy Story (1995) Animation|Children's|Comedy
## 3       1   1181      3 974948836 Toy Story (1995) Animation|Children's|Comedy
## 4       1   4834      3 979094751 Toy Story (1995) Animation|Children's|Comedy
## 5       1   5661      5 958780306 Toy Story (1995) Animation|Children's|Comedy
## 6       1    646      5 975782835 Toy Story (1995) Animation|Children's|Comedy

Summary of Data

There are 6040 unique users and 3659 unique movies.

# Summary of numeric columns
summary(merged_df[, c("Rating")])

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.000   3.000   4.000   3.581   4.000   5.000

# Number of unique users and movies
cat("Number of unique users:", length(unique(merged_df$UserID)), "\n")

## Number of unique users: 6040

cat("Number of unique movies:", length(unique(merged_df$MovieID)), "\n")

## Number of unique movies: 3659

These are the top 10 most total rated movies

# Top 10 most rated movies
top_movies <- merged_df %>%
  group_by(Title) %>%
  summarise(Num_Ratings = n(), Avg_Rating = round(mean(Rating), 2)) %>%
  arrange(desc(Num_Ratings)) %>%
  slice_head(n = 10)

print(top_movies)

## # A tibble: 10 × 3
##    Title                                                 Num_Ratings Avg_Rating
##    <chr>                                                       <int>      <dbl>
##  1 American Beauty (1999)                                       3428       4.32
##  2 Star Wars: Episode IV - A New Hope (1977)                    2991       4.45
##  3 Star Wars: Episode V - The Empire Strikes Back (1980)        2990       4.29
##  4 Star Wars: Episode VI - Return of the Jedi (1983)            2883       4.02
##  5 Jurassic Park (1993)                                         2672       3.76
##  6 Saving Private Ryan (1998)                                   2653       4.34
##  7 Terminator 2: Judgment Day (1991)                            2649       4.06
##  8 Matrix, The (1999)                                           2590       4.32
##  9 Back to the Future (1985)                                    2583       3.99
## 10 Silence of the Lambs, The (1991)                             2578       4.35

These are the top 10 highest rated movies with at least 500 ratings.

top_rated_movies <- merged_df %>%
  group_by(Title) %>%
  summarise(Avg_Rating = mean(Rating), Num_Ratings = n()) %>%
  filter(Num_Ratings >= 500) %>%
  arrange(desc(Avg_Rating)) %>%
  slice_head(n = 10)

print(top_rated_movies)

## # A tibble: 10 × 3
##    Title                                                  Avg_Rating Num_Ratings
##    <chr>                                                       <dbl>       <int>
##  1 Seven Samurai (The Magnificent Seven) (Shichinin no s…       4.56         628
##  2 Shawshank Redemption, The (1994)                             4.55        2227
##  3 Godfather, The (1972)                                        4.52        2223
##  4 Close Shave, A (1995)                                        4.52         657
##  5 Usual Suspects, The (1995)                                   4.52        1783
##  6 Schindler's List (1993)                                      4.51        2304
##  7 Wrong Trousers, The (1993)                                   4.51         882
##  8 Raiders of the Lost Ark (1981)                               4.48        2514
##  9 Rear Window (1954)                                           4.48        1050
## 10 Star Wars: Episode IV - A New Hope (1977)                    4.45        2991

Here is a distribution of the ratings. It can be 0 through 5 and 4 is considered a good rating.

library(ggplot2)

ggplot(merged_df, aes(x = Rating)) +
  geom_bar(fill = "steelblue") +
  labs(
    title = "Distribution of Movie Ratings",
    x = "Rating",
    y = "Count"
  ) +
  theme_minimal()

Cosine and Pearson Similarity

Cosine and Pearson are similarity measures used in collaborative filtering to determine how alike users or items are. Cosine similarity calculates the angle between two rating vectors, focusing on rating patterns regardless of rating scale, while Pearson correlation adjusts for individual user biases by considering mean-centered ratings. Neighborhood size (e.g., 10, 30) refers to how many similar users or items are considered when making predictions—larger sizes may capture broader patterns but increase computation time. Cosine is faster and works well when rating scales vary widely, while Pearson may provide more personalized results by accounting for user tendencies. However, Pearson can be more sensitive to sparse data.

UBCF Models

This code trains and evaluates four User-Based Collaborative Filtering (UBCF) models using the MovieLens 1M dataset. It tests different similarity methods (cosine and Pearson) and neighborhood sizes (10 and 30). For each model, it measures how accurately it predicts ratings using RMSE and also records how long the process takes. The results are combined into a table for easy comparison.

# Convert to realRatingMatrix if not done already
rating_matrix <- as(merged_df %>%
                      select(UserID, MovieID, Rating) %>%
                      rename(user = UserID, item = MovieID, rating = Rating),
                    "realRatingMatrix")

# Define evaluation schemes
set.seed(123)
schemes <- list(
  cosine_10 = evaluationScheme(rating_matrix, method = "split", train = 0.8, given = 10, goodRating = 4),
  cosine_30 = evaluationScheme(rating_matrix, method = "split", train = 0.8, given = 10, goodRating = 4),
  pearson_10 = evaluationScheme(rating_matrix, method = "split", train = 0.8, given = 10, goodRating = 4),
  pearson_30 = evaluationScheme(rating_matrix, method = "split", train = 0.8, given = 10, goodRating = 4)
)

# Train and evaluate models
results <- lapply(names(schemes), function(name) {
  sim <- ifelse(grepl("cosine", name), "cosine", "pearson")
  nn <- ifelse(grepl("10", name), 10, 30)
  
  start_time <- Sys.time()
  model <- Recommender(getData(schemes[[name]], "train"), method = "UBCF",
                       parameter = list(method = sim, nn = nn))
  pred <- predict(model, getData(schemes[[name]], "known"), type = "ratings")
  acc <- calcPredictionAccuracy(pred, getData(schemes[[name]], "unknown"))
  end_time <- Sys.time()
  
  data.frame(
    Model = paste0("UBCF_", sim, "_nn", nn),
    RMSE = acc["RMSE"],
    Time_Sec = round(as.numeric(difftime(end_time, start_time, units = "secs")), 2)
  )
})

# Combine and print results
ubcf_comparison <- do.call(rbind, results)
print(ubcf_comparison)

##                   Model     RMSE Time_Sec
## RMSE   UBCF_cosine_nn10 1.248572    28.06
## RMSE1  UBCF_cosine_nn30 1.199807    28.48
## RMSE2 UBCF_pearson_nn10 1.207462    25.24
## RMSE3 UBCF_pearson_nn30 1.124321    25.99

The results compare four User-Based Collaborative Filtering (UBCF) models using different similarity measures (cosine and Pearson) and neighborhood sizes (10 and 30). The model using Pearson similarity with 30 neighbors achieved the best performance, with the lowest RMSE of 1.124, indicating more accurate predictions. In general, Pearson outperformed cosine, and increasing the number of neighbors from 10 to 30 improved accuracy for both methods. All models had similar computation times, ranging from about 26 to 31 seconds. This suggests that choosing the right similarity measure and neighborhood size can significantly improve recommendation quality without much added time.

IBCF Model

Evaluates two Item-Based Collaborative Filtering (IBCF) models using cosine and Pearson similarity. Splits the rating data into training and test sets, trains each IBCF model on 80% of the data, and then predicts ratings for the remaining 20%. The code measures both the accuracy (using RMSE) and the time taken to run each model.

ibcf_schemes <- list(
  cosine = evaluationScheme(rating_matrix, method = "split", train = 0.8, given = 10, goodRating = 4),
  pearson = evaluationScheme(rating_matrix, method = "split", train = 0.8, given = 10, goodRating = 4)
)

# Train and evaluate each IBCF model with timing
ibcf_results <- lapply(names(ibcf_schemes), function(name) {
  sim_method <- name
  
  start_time <- Sys.time()
  model <- Recommender(getData(ibcf_schemes[[name]], "train"), method = "IBCF",
                       parameter = list(method = sim_method))
  pred <- predict(model, getData(ibcf_schemes[[name]], "known"), type = "ratings")
  acc <- calcPredictionAccuracy(pred, getData(ibcf_schemes[[name]], "unknown"))
  end_time <- Sys.time()
  
  duration <- round(as.numeric(difftime(end_time, start_time, units = "secs")), 2)
  
  data.frame(
    Model = paste0("IBCF_", sim_method),
    RMSE = acc["RMSE"],
    Time_Sec = duration
  )
})

# Combine and display results
ibcf_comparison <- do.call(rbind, ibcf_results)
print(ibcf_comparison)

##              Model    RMSE Time_Sec
## RMSE   IBCF_cosine 1.53430   125.42
## RMSE1 IBCF_pearson 1.60727    61.15

The results indicate that the Item-Based Collaborative Filtering (IBCF) model using cosine similarity outperformed the one using Pearson similarity in terms of prediction accuracy, achieving a lower RMSE of 1.53430 compared to 1.60727. However, this increased accuracy came at the cost of longer computation time, with the cosine model taking 124.82 seconds versus 59.03 seconds for the Pearson model. This highlights a trade-off between accuracy and efficiency, where the cosine method delivers better recommendations but requires more processing time.

Spark ALS

The ALS (Alternating Least Squares) model is a matrix factorization technique commonly used in collaborative filtering recommender systems. It works by decomposing the user-item interaction matrix into two lower-dimensional matrices—one representing users and the other items—such that their dot product approximates the original ratings. ALS iteratively updates these matrices by fixing one and solving for the other, minimizing the error between predicted and actual ratings. ALS in Spark (often called Spark ALS) leverages distributed computing through Apache Spark, making it scalable for large datasets. Its key advantages include the ability to handle massive data, built-in regularization to prevent overfitting, and native support for parallel computation, which dramatically speeds up training and prediction for big recommender systems.

library(sparklyr)
library(dplyr)
library(recommenderlab)

# Start Spark session (make sure Spark 3.4.1 is installed)
spark_disconnect_all()

## [1] 0

sc <- spark_connect(master = "local", version = "3.4.1")

# Convert rating matrix to data frame
df <- as(rating_matrix, "data.frame") %>%
  mutate(
    user = as.integer(as.factor(user)),
    item = as.integer(as.factor(item)),
    rating = as.numeric(rating)
  ) %>%
  na.omit()

# Upload data to Spark
ratings_tbl <- copy_to(sc, df, name = "ratings_tbl", overwrite = TRUE)

# Split into training and testing
splits <- ratings_tbl %>% sdf_random_split(training = 0.8, test = 0.2, seed = 42)
training_tbl <- splits$training
test_tbl <- splits$test

# Train ALS model
start_time_als <- Sys.time()

als_model <- ml_als(
  training_tbl,
  rating_col = "rating",
  user_col = "user",
  item_col = "item",
  rank = 10,
  reg_param = 0.1,
  max_iter = 10,
  cold_start_strategy = "drop"
)

# Make predictions on test set
predictions <- ml_predict(als_model, test_tbl)

# Evaluate RMSE
rmse_als <- ml_regression_evaluator(
  predictions,
  label_col = "rating",
  prediction_col = "prediction",
  metric_name = "rmse"
)

end_time_als <- Sys.time()
time_als <- round(as.numeric(difftime(end_time_als, start_time_als, units = "secs")), 2)

# Output
cat("ALS RMSE:", round(rmse_als, 4), "\n")

## ALS RMSE: 0.8711

cat("Time Taken (sec):", time_als, "\n")

## Time Taken (sec): 22.01

Comparison graphs

library(ggplot2)
library(dplyr)

# Combine all results into one data frame
als_result <- data.frame(
  Model = "ALS_Spark",
  RMSE = 0.8711,
  Time_Sec = 21.92
)

all_results <- bind_rows(ubcf_comparison, ibcf_comparison, als_result)

# RMSE Comparison Plot
ggplot(all_results, aes(x = reorder(Model, RMSE), y = RMSE, fill = Model)) +
  geom_bar(stat = "identity", width = 0.6) +
  coord_flip() +
  labs(
    title = "RMSE Comparison Across Models",
    x = "Model",
    y = "RMSE"
  ) +
  theme_minimal() +
  theme(legend.position = "none") +
  geom_text(aes(label = round(RMSE, 3)), hjust = -0.1, size = 3.2)

Interpretation

Among all models, the Spark ALS (Alternating Least Squares) model achieves the lowest RMSE of 0.871, indicating the highest prediction accuracy. Within the UBCF (User-Based Collaborative Filtering) models, the Pearson similarity with 30 neighbors performs best, with an RMSE of 1.124, followed by the Cosine similarity with 30 neighbors. The IBCF (Item-Based Collaborative Filtering) models have the highest RMSE values, with IBCF using Pearson similarity being the least accurate. Overall, the graph shows that matrix factorization using Spark ALS outperforms traditional memory-based collaborative filtering methods, and that increasing neighborhood size tends to improve the accuracy of UBCF models.

The MovieLens 1M dataset is large and sparse, with over 1 million ratings from 6,000 users across nearly 4,000 movies. This sparsity makes it challenging for memory-based methods like UBCF and IBCF, which rely on finding similar users or items based on co-rated content. In contrast, the ALS model performs better because it uses matrix factorization to learn latent features that generalize well even with missing data. This ability to capture underlying patterns in sparse datasets explains why ALS achieved the lowest RMSE, while IBCF and UBCF showed higher error rates due to their dependence on direct similarity computations.

# Time Taken Comparison Plot
ggplot(all_results, aes(x = reorder(Model, Time_Sec), y = Time_Sec, fill = Model)) +
  geom_bar(stat = "identity", width = 0.6) +
  coord_flip() +
  labs(
    title = "Computation Time Comparison Across Models",
    x = "Model",
    y = "Time (Seconds)"
  ) +
  theme_minimal() +
  theme(legend.position = "none") +
  geom_text(aes(label = round(Time_Sec, 2)), hjust = -0.1, size = 3.2)

Interpretation

The computation time comparison graph shows that the IBCF models, especially IBCF using cosine similarity, are the most time-consuming to compute, with IBCF_cosine taking over 2 minutes (124.82 seconds). In contrast, the ALS_Spark model was the fastest, completing in just 21.92 seconds. Among the UBCF models, computation times were fairly similar and moderate, ranging from about 26 to 31 seconds, with the Pearson models being slightly faster than cosine. This suggests that while ALS provides the best RMSE performance, it is also highly efficient in computation, making it a strong choice for large-scale, time-sensitive recommender systems.

Spark ALS (Alternating Least Squares) is well-suited for large datasets like MovieLens 1M because it is built to run in a distributed computing environment, allowing data to be processed in parallel across multiple cores or machines. This makes it highly scalable and efficient for handling massive user-item matrices. ALS is also optimized for sparse data, which is common in recommender systems where most users rate only a small portion of items. Additionally, Spark’s implementation of ALS includes performance optimizations, memory efficiency, and built-in strategies to manage missing values, such as the cold start “drop” option.

Data 612 Final Project

Kevin DiPerna

Introduction

Load MovieLens 1M Data

Summary of Data

Cosine and Pearson Similarity

UBCF Models

IBCF Model

Spark ALS

Comparison graphs

Interpretation

Interpretation