DATA 612 Project 5 — Recommender System on Spark

Goal

In Project 3 I built matrix factorization recommenders (SVD and SGD) in plain R. I wanted them to be fast, I had to shrink the data down to 1,000 users and 800 movies.

This project runs the same idea on Apache Spark instead. I learned that Spark handles the full MovieLens 10M dataset without subsampling, cool! I use Spark’s built-in ALS model, which is matrix factorization, so in the same family as Project 3, and compare the results.

Setup

library(sparklyr)
library(dplyr)

# Install Spark once if you don't have it:
# spark_install(version = "3.5.0")

# Connect to Spark in local mode (single node, which is fine for this project)
sc <- spark_connect(master = "local")

Load the data

MovieLens 10M ships as a ::-separated file. I read it in, then copy it into Spark. The spark_read_csv step is where the data moves into Spark’s engine.

# Download once, then unzip:
# download.file("https://files.grouplens.org/datasets/movielens/ml-10m.zip",
#               "ml-10m.zip")
# unzip("ml-10m.zip")

# The file uses "::" separators, so read locally and rewrite as clean CSV first
ratings_raw <- read.delim(
  "ml-10M100K/ratings.dat",
  sep = ":", header = FALSE, colClasses = c(
    "integer", "NULL", "integer", "NULL", "numeric", "NULL", "character"
  )
)
colnames(ratings_raw) <- c("userId", "movieId", "rating", "timestamp")
ratings_raw$timestamp <- NULL

write.csv(ratings_raw, "ratings_clean.csv", row.names = FALSE)

# Now hand it to Spark
ratings <- spark_read_csv(sc, name = "ratings", path = "ratings_clean.csv")

sdf_dim(ratings)

## [1] 10000054        3

Train/test split

Spark has its own split function. Same 80/20 idea as Project 3.

splits <- sdf_random_split(ratings, training = 0.8, test = 0.2, seed = 42)
train <- splits$training
test  <- splits$test

Train the ALS model

ALS (Alternating Least Squares) is Spark’s matrix factorization recommender. Instead of learning everything at once what ti does is fix the user factors and solves for the movie factors, then flips and repeats. That back-and-forth is what makes it run well in parallel.

I time the training so I can compare it to Project 3.

start <- Sys.time()

als_model <- ml_als(
  train,
  rating_col = "rating",
  user_col   = "userId",
  item_col   = "movieId",
  rank       = 20,      # 20 latent factors, same as Project 3's k
  max_iter   = 10,
  reg_param  = 0.1,
  cold_start_strategy = "drop"
)

train_time <- Sys.time() - start
train_time

## Time difference of 11.39835 secs

Evaluate

Predict on the test set and compute RMSE, the same metric as Project 3.

predictions <- ml_predict(als_model, test)

rmse <- ml_regression_evaluator(
  predictions,
  label_col      = "rating",
  prediction_col = "prediction",
  metric_name    = "rmse"
)

rmse

## [1] 0.81918

Compare to Project 3

Method	Data	RMSE
Project 3: Classic SVD	1K users, subsampled	0.972
Project 3: SGD-SVD	1K users, subsampled	0.989
Project 5: Spark ALS	Full 10M ratings	0.819

Conclusion

The Spark ALS model trained on the full 10M dataset and hit an RMSE of 0.819. That beats Project 3’s 0.972 and 0.989. I think the reason is data. Project 3 had to shrink to 1,000 users to run in plain R so it makes sense why it was worse. Less data means worse predictions. Spark trained on all 10 million ratings so it learned more and predicted better. Spark did something base R could not and got a better result doing it.

But Spark is not free. Setup takes longer. The code is more involved. For a small dataset like Project 3’s sample plain R is faster and simpler, because you skip all the overhead of starting a Spark session and moving data into it.

So when is Spark worth it? Not at Project 3’s size. A matrix of a few thousand users fits in memory and runs fine in R. Spark starts to pay off when the data no longer fits on one machine, or when training in R gets too slow to iterate on. For MovieLens, that shift happens somewhere between the 1M and 10M versions. Below that though the overhead costs more than it saves.

spark_disconnect(sc)