true

1.0 Objective

The purpose of this project is to build a new interactive live recommendation system using a large dataset.

2.0 Implementation

We will leverage the distributed computing method to mimic Spark cluster. Specifically packages like Sparklyr, Databricks, H20 etc. The dataset would be a large data set and our goal would be to produce quality set of recommendations by extracting rankings from a large movie database.

3.0 Data Sourcing and Loading

Our dataset contains movie ratings from grouplens.org. We picked a stable benchmark dataset containig 1 million ratings from 6000 users on 4000 movies. The ratings data dates back to Feb 2003.

https://grouplens.org/datasets/movielens/1m/

library(data.table)
library(sparklyr)
library(dplyr)

3.1 Read ratings and movies data

ratings <- fread("data\\1m\\ratings.dat",header = F,sep=':')
head(ratings)

##    V1 V2   V3 V4 V5 V6        V7
## 1:  1 NA 1193 NA  5 NA 978300760
## 2:  1 NA  661 NA  3 NA 978302109
## 3:  1 NA  914 NA  3 NA 978301968
## 4:  1 NA 3408 NA  4 NA 978300275
## 5:  1 NA 2355 NA  5 NA 978824291
## 6:  1 NA 1197 NA  3 NA 978302268

movies <- fread("data\\large\\movies.csv",header = T,sep=',')
head(movies)

##    movieId                              title
## 1:       1                   Toy Story (1995)
## 2:       2                     Jumanji (1995)
## 3:       3            Grumpier Old Men (1995)
## 4:       4           Waiting to Exhale (1995)
## 5:       5 Father of the Bride Part II (1995)
## 6:       6                        Heat (1995)
##                                         genres
## 1: Adventure|Animation|Children|Comedy|Fantasy
## 2:                  Adventure|Children|Fantasy
## 3:                              Comedy|Romance
## 4:                        Comedy|Drama|Romance
## 5:                                      Comedy
## 6:                       Action|Crime|Thriller

3.3 Data Munging

Remove extra columns and timestamp column. Give column names and then converting columns to the appropriate standard data types.

ratings$V2 <- NULL
ratings$V4 <- NULL
ratings$V6 <- NULL
ratings$V7 <- NULL

colnames(ratings) <- c("userId","movieId","rating")

# converting columns to datatypes
ratings$userId <- as.numeric(ratings$userId)
ratings$movieId <- as.numeric(ratings$movieId)
ratings$rating <- as.numeric(ratings$rating)

4.0 Connecting to Spark

sc <- spark_connect(master = "local")

4.1 Copy Dataframe into Spark

Copy the userid, rating, movie name and genre from the dataframe into Spark. And split the data to create the Training and Test data partitions.

data_ratings <- sdf_copy_to(sc, ratings, overwrite = TRUE)

# create test and training partitions
partitions <- data_ratings %>%  sdf_partition(training = 0.75, test = 0.25, seed = 1099)

training <- partitions$training
test <- partitions$test

5.0 Create a Model

Create a model using the built in ml_als_factorization function. The iterations had to be set to 4 as one of us was getting stack overflow errors due to low resources.

model <- ml_als_factorization(training, rating.column = "rating", 
                              user.column = "userId",
                              item.column = "movieId",regularization.parameter = 0.01, iter.max = 4)

summary(model)

##                  Length Class      Mode       
## item.factors     11     data.frame list       
## user.factors     11     data.frame list       
## data              2     spark_jobj environment
## ml.options        6     ml_options list       
## model.parameters  2     -none-     list       
## .call             7     -none-     call       
## .model            2     spark_jobj environment

6.0 Create the prediction dataframe

Create predictions for the Test dataset and save the file as a csv output. This will be used by our Shiny UI for the interactive movie recommendations for the users.

predictions <- model$.model %>%
  invoke("transform", spark_dataframe(test)) %>%
  collect()

# add movie title to predictions Df
predictions_final <- setDT(predictions)[setDT(movies), title := i.title, on=c("movieId", "movieId")]

# add genre to predictions df
predictions_final <- setDT(predictions)[setDT(movies), genre := genres, on=c("movieId", "movieId")]

head(predictions_final)

##    userId movieId rating prediction            title
## 1:      9       1      5   4.002312 Toy Story (1995)
## 2:     21       1      3   3.874613 Toy Story (1995)
## 3:     34       1      5   4.646433 Toy Story (1995)
## 4:     51       1      5   4.244669 Toy Story (1995)
## 5:     76       1      5   4.282124 Toy Story (1995)
## 6:     92       1      4   3.457653 Toy Story (1995)
##                                          genre
## 1: Adventure|Animation|Children|Comedy|Fantasy
## 2: Adventure|Animation|Children|Comedy|Fantasy
## 3: Adventure|Animation|Children|Comedy|Fantasy
## 4: Adventure|Animation|Children|Comedy|Fantasy
## 5: Adventure|Animation|Children|Comedy|Fantasy
## 6: Adventure|Animation|Children|Comedy|Fantasy

# write predictions to csv
write.csv(predictions_final, file = "output\\predictions.csv",row.names=FALSE, na="")

7.0 Summary

We will use the predictions output file to recommend the top movies for any given user using the Shiny UI interface. Here is the Shiny URL:

https://vizualize.shinyapps.io/SI_Data643_Final_Project/

DATA 643 Final Project | Large Dataset based Recommender System

Jason Joseph, Srini Illapani

July 16, 2017