Automated Scoring Machine Learning Tool (ASMLT) with R Spark and H2O

Carlo Morales
01/25/2017

Motivation

'Logo'

  • ASMLT was a shiny app I created last year
  • Has pleasant visualizations but uses weak ML modeling techinques OLS and LDA
  • Built a new version ASMLT
    • ML models include RF, GBM and NN
  • Added a spell corrector

Topics

'Logo'

  • R Programing Language
  • Apache Spark
  • H20
  • Relevent R packages
  • Demo of shiny app

R Programming Language

'Logo'

  • A popular open source programming language
    • Data Anlaysis
    • Statistical Modeling
    • Data Visualization
  • Easy to use.
  • Historically R is limited to analyzing data sets that can be stored in memory

Apache Spark

'Logo'

  • Open source computing engine
  • Supports distributed data storage and distributed computing
  • Currently provides APIs in Scala, Java, R and Python
  • Has built in machine learning

H20

'Logo'

  • H20 is an open source AI platform
  • H20 is the world’s leading open source machine learning platform
  • The scalable machine learning algorithms of H20 can be combined with the capabilities of Spark

spraklyr and rsparkling

'Logo' 'Logo'

  • sparklyr is used to connect R to spark

    • The package provides a complete dplyr backend
  • rsparkling provides an R interface to the H20's machine learning library

Example code of connecting to Spark from R

install.packages('sparklyr')

library(sparklyr)

spark_install()

sc <- spark_connect("local")

my_tbl <- copy_tbl(sc , iris)

Demo

  • Developed a shiny R app using rsparkling and sparklyr
  • This app is used to build simple automated scoring models
    • Performs model tuning via cross validation
    • Provides various metrics of model performance
    • Allows user to export final optimal models from H20 as pojo files