The purpose of this project is to build a new interactive live recommendation system using a large dataset.
We will leverage the distributed computing method to mimic Spark cluster. Specifically packages like Sparklyr, Databricks, H20 etc. The dataset would be a large data set and our goal would be to produce quality set of recommendations by extracting rankings from a large movie database.
Our dataset contains movie ratings from grouplens.org. We picked a stable benchmark dataset containig 1 million ratings from 6000 users on 4000 movies. The ratings data dates back to Feb 2003.
https://grouplens.org/datasets/movielens/1m/
library(data.table)
library(sparklyr)
library(dplyr)
ratings <- fread("data\\1m\\ratings.dat",header = F,sep=':')
head(ratings)
## V1 V2 V3 V4 V5 V6 V7
## 1: 1 NA 1193 NA 5 NA 978300760
## 2: 1 NA 661 NA 3 NA 978302109
## 3: 1 NA 914 NA 3 NA 978301968
## 4: 1 NA 3408 NA 4 NA 978300275
## 5: 1 NA 2355 NA 5 NA 978824291
## 6: 1 NA 1197 NA 3 NA 978302268
movies <- fread("data\\large\\movies.csv",header = T,sep=',')
head(movies)
## movieId title
## 1: 1 Toy Story (1995)
## 2: 2 Jumanji (1995)
## 3: 3 Grumpier Old Men (1995)
## 4: 4 Waiting to Exhale (1995)
## 5: 5 Father of the Bride Part II (1995)
## 6: 6 Heat (1995)
## genres
## 1: Adventure|Animation|Children|Comedy|Fantasy
## 2: Adventure|Children|Fantasy
## 3: Comedy|Romance
## 4: Comedy|Drama|Romance
## 5: Comedy
## 6: Action|Crime|Thriller
Remove extra columns and timestamp column. Give column names and then converting columns to the appropriate standard data types.
ratings$V2 <- NULL
ratings$V4 <- NULL
ratings$V6 <- NULL
ratings$V7 <- NULL
colnames(ratings) <- c("userId","movieId","rating")
# converting columns to datatypes
ratings$userId <- as.numeric(ratings$userId)
ratings$movieId <- as.numeric(ratings$movieId)
ratings$rating <- as.numeric(ratings$rating)
sc <- spark_connect(master = "local")
Copy the userid, rating, movie name and genre from the dataframe into Spark. And split the data to create the Training and Test data partitions.
data_ratings <- sdf_copy_to(sc, ratings, overwrite = TRUE)
# create test and training partitions
partitions <- data_ratings %>% sdf_partition(training = 0.75, test = 0.25, seed = 1099)
training <- partitions$training
test <- partitions$test
Create a model using the built in ml_als_factorization function. The iterations had to be set to 4 as one of us was getting stack overflow errors due to low resources.
model <- ml_als_factorization(training, rating.column = "rating",
user.column = "userId",
item.column = "movieId",regularization.parameter = 0.01, iter.max = 4)
summary(model)
## Length Class Mode
## item.factors 11 data.frame list
## user.factors 11 data.frame list
## data 2 spark_jobj environment
## ml.options 6 ml_options list
## model.parameters 2 -none- list
## .call 7 -none- call
## .model 2 spark_jobj environment
Create predictions for the Test dataset and save the file as a csv output. This will be used by our Shiny UI for the interactive movie recommendations for the users.
predictions <- model$.model %>%
invoke("transform", spark_dataframe(test)) %>%
collect()
# add movie title to predictions Df
predictions_final <- setDT(predictions)[setDT(movies), title := i.title, on=c("movieId", "movieId")]
# add genre to predictions df
predictions_final <- setDT(predictions)[setDT(movies), genre := genres, on=c("movieId", "movieId")]
head(predictions_final)
## userId movieId rating prediction title
## 1: 9 1 5 4.002312 Toy Story (1995)
## 2: 21 1 3 3.874613 Toy Story (1995)
## 3: 34 1 5 4.646433 Toy Story (1995)
## 4: 51 1 5 4.244669 Toy Story (1995)
## 5: 76 1 5 4.282124 Toy Story (1995)
## 6: 92 1 4 3.457653 Toy Story (1995)
## genre
## 1: Adventure|Animation|Children|Comedy|Fantasy
## 2: Adventure|Animation|Children|Comedy|Fantasy
## 3: Adventure|Animation|Children|Comedy|Fantasy
## 4: Adventure|Animation|Children|Comedy|Fantasy
## 5: Adventure|Animation|Children|Comedy|Fantasy
## 6: Adventure|Animation|Children|Comedy|Fantasy
# write predictions to csv
write.csv(predictions_final, file = "output\\predictions.csv",row.names=FALSE, na="")
We will use the predictions output file to recommend the top movies for any given user using the Shiny UI interface. Here is the Shiny URL: