In Project 3 I built matrix factorization recommenders (SVD and SGD) in plain R. I wanted them to be fast, I had to shrink the data down to 1,000 users and 800 movies.
This project runs the same idea on Apache Spark instead. I learned that Spark handles the full MovieLens 10M dataset without subsampling, cool! I use Spark’s built-in ALS model, which is matrix factorization, so in the same family as Project 3, and compare the results.
library(sparklyr)
library(dplyr)
# Install Spark once if you don't have it:
# spark_install(version = "3.5.0")
# Connect to Spark in local mode (single node, which is fine for this project)
sc <- spark_connect(master = "local")
MovieLens 10M ships as a ::-separated file. I read it
in, then copy it into Spark. The spark_read_csv step is
where the data moves into Spark’s engine.
# Download once, then unzip:
# download.file("https://files.grouplens.org/datasets/movielens/ml-10m.zip",
# "ml-10m.zip")
# unzip("ml-10m.zip")
# The file uses "::" separators, so read locally and rewrite as clean CSV first
ratings_raw <- read.delim(
"ml-10M100K/ratings.dat",
sep = ":", header = FALSE, colClasses = c(
"integer", "NULL", "integer", "NULL", "numeric", "NULL", "character"
)
)
colnames(ratings_raw) <- c("userId", "movieId", "rating", "timestamp")
ratings_raw$timestamp <- NULL
write.csv(ratings_raw, "ratings_clean.csv", row.names = FALSE)
# Now hand it to Spark
ratings <- spark_read_csv(sc, name = "ratings", path = "ratings_clean.csv")
sdf_dim(ratings)
## [1] 10000054 3
Spark has its own split function. Same 80/20 idea as Project 3.
splits <- sdf_random_split(ratings, training = 0.8, test = 0.2, seed = 42)
train <- splits$training
test <- splits$test
ALS (Alternating Least Squares) is Spark’s matrix factorization recommender. Instead of learning everything at once what ti does is fix the user factors and solves for the movie factors, then flips and repeats. That back-and-forth is what makes it run well in parallel.
I time the training so I can compare it to Project 3.
start <- Sys.time()
als_model <- ml_als(
train,
rating_col = "rating",
user_col = "userId",
item_col = "movieId",
rank = 20, # 20 latent factors, same as Project 3's k
max_iter = 10,
reg_param = 0.1,
cold_start_strategy = "drop"
)
train_time <- Sys.time() - start
train_time
## Time difference of 11.39835 secs
Predict on the test set and compute RMSE, the same metric as Project 3.
predictions <- ml_predict(als_model, test)
rmse <- ml_regression_evaluator(
predictions,
label_col = "rating",
prediction_col = "prediction",
metric_name = "rmse"
)
rmse
## [1] 0.81918
| Method | Data | RMSE |
|---|---|---|
| Project 3: Classic SVD | 1K users, subsampled | 0.972 |
| Project 3: SGD-SVD | 1K users, subsampled | 0.989 |
| Project 5: Spark ALS | Full 10M ratings | 0.819 |
The Spark ALS model trained on the full 10M dataset and hit an RMSE of 0.819. That beats Project 3’s 0.972 and 0.989. I think the reason is data. Project 3 had to shrink to 1,000 users to run in plain R so it makes sense why it was worse. Less data means worse predictions. Spark trained on all 10 million ratings so it learned more and predicted better. Spark did something base R could not and got a better result doing it.
But Spark is not free. Setup takes longer. The code is more involved. For a small dataset like Project 3’s sample plain R is faster and simpler, because you skip all the overhead of starting a Spark session and moving data into it.
So when is Spark worth it? Not at Project 3’s size. A matrix of a few thousand users fits in memory and runs fine in R. Spark starts to pay off when the data no longer fits on one machine, or when training in R gets too slow to iterate on. For MovieLens, that shift happens somewhere between the 1M and 10M versions. Below that though the overhead costs more than it saves.
spark_disconnect(sc)