SparkR 1.5 MLlib Logistic Regression Example

Objectives

Play with some predictions with SparkR MLlib logistic regression model.

Background

Databricks announced in June that Spark 1.4 would integrate with R. This became possible after merging AMPLab SparkR iniciative. Now in Spark 1.5 release, SparkR comes with it’s first integration with MLlib: regression models

First impressions

SparkR is a R package, and for that reason, MLlib algorithms should be more R-user frendly and a little bit different than Java, Scala or Python implementations. This is just first integration and only regression models with glm() function is available. Behind the scenes glm() is an abstraction that make calls to MLlib execute it’s regression model implementations.

Ok, show us some code, you say!!

Loading SparkContext

If you’re running Spark >= 1.4 from ./bin/sparkR, SparkContext and SQLContext are automatically loaded. Otherwise, if you’re using RStudio, you need to setup environment:

#Environment variables. It is mandatory SPARK_HOME and JAVA_HOME being set
Sys.setenv(SPARK_HOME="/usr/local/lib/R/site-library/spark-1.5/")

.libPaths(c(file.path(Sys.getenv("SPARK_HOME"), "R", "lib"), .libPaths()))

Sys.setenv(JAVA_HOME="/usr/lib/jvm/java-8-oracle/")

#Sys.setenv('SPARKR_SUBMIT_ARGS'='"--packages" "com.databricks:spark-csv_2.10:1.0.3" "sparkr-shell"')

library(SparkR)

#Just stop some running SparkContext
#sparkR.stop()

#Init SparkContext and SQLContext
sc <- sparkR.init(master = "local")

## Launching java with spark-submit command /usr/local/lib/R/site-library/spark-1.5//bin/spark-submit   "--packages" "com.databricks:spark-csv_2.10:1.0.3" "sparkr-shell" /tmp/Rtmp5Ezmhx/backend_port3b885818ca1a

sqlContext <- sparkRSQL.init(sc)

After creating SparkContext, we can instantiate a data frame with createDataFrame() function. If you head created data frame, you can see that is a Spark managed data frame. In this example, I worked with mtcars dataset.

mtcarsDF <- createDataFrame(sqlContext, mtcars)
head(mtcarsDF)

##    mpg cyl disp  hp drat    wt  qsec vs am gear carb
## 1 21.0   6  160 110 3.90 2.620 16.46  0  1    4    4
## 2 21.0   6  160 110 3.90 2.875 17.02  0  1    4    4
## 3 22.8   4  108  93 3.85 2.320 18.61  1  1    4    1
## 4 21.4   6  258 110 3.08 3.215 19.44  1  0    3    1
## 5 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2
## 6 18.1   6  225 105 2.76 3.460 20.22  1  0    3    1

Logistic Regression

Now, let’s predict mtcars vs (if car has a V engine) using other variables as predictors. Just set glm() function with binomial family. glm() will call MLlib LogisticRegressionWithLBFGS implementation and output a model.

Notice that summary is supposed to print model coefficients and other coefficientes, but its not implemented yet: https://issues.apache.org/jira/browse/SPARK-9492

#binomial glm for classification problem
model <- glm(vs ~ mpg + disp + hp + wt , data = mtcarsDF, family = "binomial")

#Warning: suppose to print model coefficients, but its not implemented: https://issues.apache.org/jira/browse/SPARK-9492
#summary(model)

#prediction over same dataset. Just for fun :)
predictions <- predict(model, newData = mtcarsDF )

#select just vs real value and predicted
modelPrediction <- select(predictions, "vs", "prediction")
head(modelPrediction)

##   vs prediction
## 1  0          0
## 2  0          1
## 3  1          1
## 4  1          1
## 5  0          0
## 6  1          1

Evaluation

SparkR does not provide interface for MLlib evaluation methods yet. So in order to just give some accuracy metric (and test SparkSQL abstraction), absolute error rate is computed manually:

#error variable: when vs and predicted differs
modelPrediction$error <- abs(modelPrediction$vs - modelPrediction$prediction)

#modelPrediction is now visible to SQLContext
registerTempTable(modelPrediction, "modelPrediction")

num_errors <- sql(sqlContext, "SELECT count(error) FROM modelPrediction WHERE error = 1")
total_errors <- sql(sqlContext, "SELECT count(error) FROM modelPrediction")

#model error rate
training_acc <- collect(num_errors) / collect(total_errors)
training_acc

##      _c0
## 1 0.0625

Conclusions

SparkR still embrionary with just regression models using MLlib, however it’s intended to be a powerful data science tool for R users in the future.