DUE DATE March 11, 2016, 11:59pm PST
NOTE I expect you to be working on your assignment and assignment 4 is an easy one. The entire solution for is provided in the solution folder as a markdown, but you are encouraged to think through how you would solve it yourself as an intellectual exercise. We will discuss parts of the solution in a class lecture.
Submission Instructions: If you want to try on your own, feel free to peek at the solution if you get stuck. In the worst case, you can even run the solution (markdown) completely and submit and you will get all credit.
For this example, upload both the markdown (ass4.Rmd) and the html output (ass4.html) in your private directory on stat290.stanford.edu.
We briefly discussed talked about the glmnet package in class. See the paper Regularization Paths for Generalized Linear Models via Coordinate Descent by Friedman, Hastie, Tibshirani, in the Journal of Statistical Software, Vol 33, issue 1, where the Internet Ad document classification problem with mostly binary features is discussed.
Install the standard glmnet package from CRAN. The code you want to run is provided in the solution folder along with the data. There is obviously no one correct answer and you may or may not see differences in the times because we are only doing a small experiment. Note also that we’re doing 10-fold validation with a small number of workers.
Experiment with 2, 3, 4 workers and two types of workers (using parallel and snow) on your own machine. The associated functions are runParallel and runSnow.
## SNOW Run
library(glmnet)
## Loading required package: Matrix
## Loading required package: foreach
## Loaded glmnet 2.0-2
runSnow <- function(n, seed = 12345) {
catn <- function(...) cat(..., "\n")
internetAd <- readRDS("internetAd.RDS")
catn("Percent non-zero per glmnet paper",
(sum(internetAd$x > 0) + sum(internetAd$y > 0)) /
(prod(dim(internetAd$x)) + length(internetAd$y)))
stopifnot(require(doSNOW))
cl <- makeCluster(n)
registerDoSNOW(cl)
set.seed(seed)
time <- system.time(cv <- cv.glmnet(internetAd$x, internetAd$y,
family = "binomial", type.measure = "class",
parallel = TRUE))
stopCluster(cl)
list(time = time, cv = cv)
}
resultsSnow <- lapply(2:5, runSnow)
## Percent non-zero per glmnet paper 0.01166918
## Loading required package: doSNOW
## Loading required package: iterators
## Loading required package: snow
## Percent non-zero per glmnet paper 0.01166918
## Percent non-zero per glmnet paper 0.01166918
## Percent non-zero per glmnet paper 0.01166918
A plot. Don’t be surprised if you see no performance gains. The overhead is significant.
library(ggplot2)
d <- data.frame(nWorkers = 2:5, t(sapply(resultsSnow, function(x) x$time)))
qplot(x = nWorkers, y = user.self, geom="line", data=d)
Check results.
lapply(resultsSnow, function(x) x$cv$lambda.min)
## [[1]]
## [1] 0.001919066
##
## [[2]]
## [1] 0.001919066
##
## [[3]]
## [1] 0.001919066
##
## [[4]]
## [1] 0.001919066
### Parallel Run
runParallel <- function(n, seed = 12345) {
catn <- function(...) cat(..., "\n")
internetAd <- readRDS("internetAd.RDS")
catn("Percent non-zero per glmnet paper",
(sum(internetAd$x > 0) + sum(internetAd$y > 0)) /
(prod(dim(internetAd$x)) + length(internetAd$y)))
stopifnot(require(doParallel))
registerDoParallel(n)
set.seed(seed)
time <- system.time(cv <- cv.glmnet(internetAd$x, internetAd$y,
family = "binomial", type.measure = "class",
parallel = TRUE))
list(time = time, cv = cv)
}
resultsParallel <- lapply(2:5, runParallel)
## Percent non-zero per glmnet paper 0.01166918
## Loading required package: doParallel
## Loading required package: parallel
##
## Attaching package: 'parallel'
## The following objects are masked from 'package:snow':
##
## clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,
## clusterExport, clusterMap, clusterSplit, makeCluster,
## parApply, parCapply, parLapply, parRapply, parSapply,
## splitIndices, stopCluster
## Percent non-zero per glmnet paper 0.01166918
## Percent non-zero per glmnet paper 0.01166918
## Percent non-zero per glmnet paper 0.01166918
A plot.
d <- data.frame(nWorkers = 2:5, t(sapply(resultsParallel, function(x) x$time)))
qplot(x = nWorkers, y = user.self, geom="line", data=d)
Results.
lapply(resultsParallel, function(x) x$cv$lambda.min)
## [[1]]
## [1] 0.001919066
##
## [[2]]
## [1] 0.001919066
##
## [[3]]
## [1] 0.001919066
##
## [[4]]
## [1] 0.001919066
For $n=3$ workers, produce a worker utilization plot forrunSnowusingsnow.time. Replacesystem.timein the function withsnow.timeand justplot` the resulting object. Rename the function if that helps.
Just for clarity, let’s rename the function to snowPlot after we make the changes.
snowPlot <- function(n, seed = 12345) {
catn <- function(...) cat(..., "\n")
internetAd <- readRDS("internetAd.RDS")
catn("Percent non-zero per glmnet paper",
(sum(internetAd$x > 0) + sum(internetAd$y > 0)) /
(prod(dim(internetAd$x)) + length(internetAd$y)))
stopifnot(require(doSNOW))
cl <- makeCluster(n)
registerDoSNOW(cl)
set.seed(seed)
plot <- snow.time(cv <- cv.glmnet(internetAd$x, internetAd$y,
family = "binomial", type.measure = "class",
parallel = TRUE))
stopCluster(cl)
list(plot = plot, cv = cv)
}
snowResults <- snowPlot(n = 3)
## Percent non-zero per glmnet paper 0.01166918
Utilization plot for snow.
plot(snowResults$plot)
sessionInfo()
## R version 3.2.3 (2015-12-10)
## Platform: x86_64-apple-darwin13.4.0 (64-bit)
## Running under: OS X 10.10.5 (Yosemite)
##
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
##
## attached base packages:
## [1] parallel stats graphics grDevices utils datasets methods
## [8] base
##
## other attached packages:
## [1] hash_2.2.6 doParallel_1.0.10 ggplot2_2.0.0 doSNOW_1.0.14
## [5] snow_0.4-1 iterators_1.0.8 glmnet_2.0-2 foreach_1.4.3
## [9] Matrix_1.2-3
##
## loaded via a namespace (and not attached):
## [1] Rcpp_0.12.2 knitr_1.12.3 magrittr_1.5 munsell_0.4.2
## [5] colorspace_1.2-6 lattice_0.20-33 stringr_1.0.0 plyr_1.8.3
## [9] tools_3.2.3 grid_3.2.3 gtable_0.1.2 htmltools_0.3
## [13] digest_0.6.9 formatR_1.2.1 codetools_0.2-14 evaluate_0.8
## [17] rmarkdown_0.9.2 labeling_0.3 stringi_1.0-1 compiler_3.2.3
## [21] scales_0.3.0