10 Nov 2016
install.packages("sparklyr")
Permite instalar diferentes versiones de spark
library(sparklyr) tail(spark_available_versions(),10)
## spark hadoop install ## 15 1.6.0 2.3 spark_install(version = "1.6.0", hadoop_version = "2.3") ## 16 1.6.0 cdh4 spark_install(version = "1.6.0", hadoop_version = "cdh4") ## 17 2.0.0 2.7 spark_install(version = "2.0.0", hadoop_version = "2.7") ## 18 2.0.0 2.6 spark_install(version = "2.0.0", hadoop_version = "2.6") ## 19 2.0.0 2.4 spark_install(version = "2.0.0", hadoop_version = "2.4") ## 20 2.0.0 2.3 spark_install(version = "2.0.0", hadoop_version = "2.3") ## 21 2.0.1 2.7 spark_install(version = "2.0.1", hadoop_version = "2.7") ## 22 2.0.1 2.6 spark_install(version = "2.0.1", hadoop_version = "2.6") ## 23 2.0.1 2.4 spark_install(version = "2.0.1", hadoop_version = "2.4") ## 24 2.0.1 2.3 spark_install(version = "2.0.1", hadoop_version = "2.3")
En un cluster que tenga spark instalado hay que indicar las variables de entorno, dónde está instalado spark y hadoop. Y si no tenemos permisos para instalar sparklyr
podemos crear un directorio para instalar librerÃas de R. Un ejemplo
.libPaths( c( "~/R/x86_64-pc-linux-gnu-library/3.3/", .libPaths()) ) install.packages("sparklyr", lib = "~/R/x86_64-pc-linux-gnu-library/3.3/") Sys.setenv(SPARK_HOME = "/opt/cloudera/parcels/CDH/lib/spark") Sys.setenv(HADOOP_CONF_DIR = '/etc/hadoop/conf.cloudera.hdfs') Sys.setenv(YARN_CONF_DIR = '/etc/hadoop/conf.cloudera.yarn') library(sparklyr) library(dplyr)
Inicialización de un spark context. Puede ser en local o utilizando yarn o la url
sc <- spark_connect(master= "local", version= "1.6.0") sc <- spark_connect(master= "yarn-client", version= "1.6.0") sc <- spark_connect(master = "spark://local:7077")
sparklyr
permite acceder a los modelos implementados en MLlib utilizando la sintaxis de R. incluso utilizar fórmulas!!
ml_logistic_regression() ml_generalized_linear_regression() ml_decision_tree() ml_random_forest() ml_gradient_boosted_trees() ml_kmeans() ml_pca() ml_lda() ml_survival_regression()
Relacionado con sparklyr
existe el paquete rsparkling
que permite convertir un Spark DataFrame en un H2O Frame y utilizar H2O sparkling-water que esté instalado en un cluster. Ojo, sólo funciona con una versión determinada de H2O.
H2O se integra con R bastante bien, además ofrece mayor información de los modelos ajustados que la que da MLlib.
Hay que instalar la versión 3.10.0.6 de H2O
# Remove previous versions of h2o R package if ("package:h2o" %in% search()) detach("package:h2o", unload=TRUE) if ("h2o" %in% rownames(installed.packages())) remove.packages("h2o") # Next, we download R package dependencies pkgs <- c("methods","statmod","stats","graphics", "RCurl","jsonlite","tools","utils") for (pkg in pkgs) { if (!(pkg %in% rownames(installed.packages()))) install.packages(pkg) } # Download h2o package version 3.10.0.6 install.packages("h2o", type = "source", repos = "http://h2o-release.s3.amazonaws.com/h2o/rel-turing/6/R")
library(devtools) devtools::install_github("h2oai/sparkling-water", subdir = "/r/rsparkling")
sparklyr
es el equivalente a pyspark
para RSparkR
abriendo dos spark contextConsejos