We start by loading some libraries and reading in the two data files.
library(recommenderlab)
library(data.table)
library(dplyr)
library(tidyr)
library(ggplot2)
library(stringr)
library(DT)
library(mltools)
library(knitr)
library(grid)
library(gridExtra)
library(corrplot)
library(qgraph)
library(methods)
library(Matrix)
library(sparklyr)
#make sure there's a spark local install
spark_install(version = "2.4.3")
Let’s start by setting up sparklyr and their environment variables.
Sys.setenv(SPARK_HOME = "/usr/local/spark")
#Sys.setenv(HADOOP_CONF_DIR = '/etc/hadoop/conf.cloudera.hdfs')
#Sys.setenv(YARN_CONF_DIR = '/etc/hadoop/conf.cloudera.yarn')
config <- spark_config()
config$spark.executor.instances <- 4
config$spark.executor.cores <- 4
config$spark.executor.memory <- "4G"
sc <- spark_connect(master="local", config=config, version = '2.4.3')
data("MovieLense")
table(MovieLense@data@x[] > 5)
##
## FALSE
## 99392
As we can see all data falls from 1 to 5, we can consider 0 as the null variable or NA
movie_df <- as(MovieLense, 'data.frame')
movie_df$user <- sapply(movie_df$user,function(x) as.numeric(as.character(x)))
movie_df$item <- sapply(movie_df$item,function(x) as.character(x))
movie_mx <- spread(movie_df, item, rating)
movie_mx$user <- sapply(movie_mx$user,function(x) as.numeric(x))
movie_mx[is.na(movie_mx)]<- 0
#copy data to spark table
movie_tbl <- sdf_copy_to(sc,movie_mx, "movie_DF", overwrite=T)
movies <- paste(colnames(movie_mx)[-1])
#execute ML model and PCA analysis
pca_model <- ml_pca(movie_tbl,features = paste(colnames(movie_tbl)[2:1000]))
Let’s plot the first Proincipal components and analyze them.
pca_df <-as.data.frame(pca_model$pc)
ggplot(pca_df, aes(x = PC1, y = PC2, color = "blue", label = "red")) +
geom_point(size = 2, alpha = 0.6) +
labs(title = "Where the Movies Fall on the First Two Principal Components", x = "PC1",y = "PC2") +
guides(fill = FALSE, color = FALSE)
In order to use SVD we need to take it from the H2o library, but it started giving me problems with version so i could not go further.
In conclusion, i think there must be an extra reason such as lack of performance in r or python to start playing with Scala to suit the ML architecture to the company needs. The extra overhead can be of varying weights and sometimes can stall the progress.