Part I: Exploratory Analysis

Read in the data

We start by loading some libraries and reading in the two data files.

library(recommenderlab)
library(data.table)
library(dplyr)
library(tidyr)
library(ggplot2)
library(stringr)
library(DT)
library(mltools)
library(knitr)
library(grid)
library(gridExtra)
library(corrplot)
library(qgraph)
library(methods)
library(Matrix)
library(sparklyr)

#make sure there's a spark local install 
spark_install(version = "2.4.3")

Let’s start by setting up sparklyr and their environment variables.

Sys.setenv(SPARK_HOME = "/usr/local/spark")
#Sys.setenv(HADOOP_CONF_DIR = '/etc/hadoop/conf.cloudera.hdfs')
#Sys.setenv(YARN_CONF_DIR = '/etc/hadoop/conf.cloudera.yarn')
 
config <- spark_config()
config$spark.executor.instances <- 4
config$spark.executor.cores <- 4
config$spark.executor.memory <- "4G"

sc <- spark_connect(master="local", config=config, version = '2.4.3')

Using Spark to build the sets

data("MovieLense")

check if there is abnormal ratings in the data

table(MovieLense@data@x[] > 5)
## 
## FALSE 
## 99392

As we can see all data falls from 1 to 5, we can consider 0 as the null variable or NA

movie_df <- as(MovieLense, 'data.frame')
movie_df$user <- sapply(movie_df$user,function(x) as.numeric(as.character(x)))
movie_df$item  <- sapply(movie_df$item,function(x) as.character(x))
movie_mx <- spread(movie_df, item, rating)
movie_mx$user <- sapply(movie_mx$user,function(x) as.numeric(x))
movie_mx[is.na(movie_mx)]<- 0

#copy data to spark table
movie_tbl <- sdf_copy_to(sc,movie_mx, "movie_DF", overwrite=T)

movies <- paste(colnames(movie_mx)[-1])

#execute ML model and PCA analysis
pca_model <- ml_pca(movie_tbl,features = paste(colnames(movie_tbl)[2:1000]))

Let’s plot the first Proincipal components and analyze them.

pca_df <-as.data.frame(pca_model$pc)
ggplot(pca_df, aes(x = PC1, y = PC2, color = "blue", label = "red")) +
  geom_point(size = 2, alpha = 0.6) +
  labs(title = "Where the Movies Fall on the First Two Principal Components", x = "PC1",y =  "PC2") +
  guides(fill = FALSE, color = FALSE)

In order to use SVD we need to take it from the H2o library, but it started giving me problems with version so i could not go further.

In conclusion, i think there must be an extra reason such as lack of performance in r or python to start playing with Scala to suit the ML architecture to the company needs. The extra overhead can be of varying weights and sometimes can stall the progress.