Spike Computational cost

This spike is intended to provide ideas how of to make your code more efficient. The data used is the indoor

It will approach 3 methods: Smart samples, parallel processing, modeling without caret, opmitization of random forest (mtry)

SMART SAMPLES

First, imagine you want to sample the data to try different models faster. You could use the functions sample_n, but you would inccur the risk of having a undistributed sample (such as all observations in building 0)

Load data

pacman::p_load(readr, dplyr, caret, plotly, htmltools)


train <- read_csv("trainingData.csv", na = c("N/A"))

Sample data

sample <- train %>% group_by(FLOOR, BUILDINGID) %>% sample_n(10)

check frequency floor

table(sample$FLOOR)
## 
##  0  1  2  3  4 
## 30 30 30 30 10

check frequency building

table(sample$BUILDINGID)
## 
##  0  1  2 
## 40 40 50

plot sample - Building 0, Building 1, Building 2

sample$BUILDINGID <- as.character(sample$BUILDINGID)

    
a <- htmltools::tagList()    
for(i in unique(sample$BUILDINGID)){
a[[i]] <- sample %>% dplyr:: filter(BUILDINGID == i) %>% plot_ly(type = "scatter3d",
        x = ~ LATITUDE,
        y = ~ LONGITUDE,
        z = ~ FLOOR,
        mode = 'markers')


}    
## Warning: package 'bindrcpp' was built under R version 3.4.4
a[[1]] # Building 0
a[[2]] # Building 1
a[[3]] # Building 2

SPECIFIC PACKAGES

Random Forest: package randomForest

This is the most usual package for training a random forest. It’s very user friendly and robust. If you want to learn more about other packages check this resource.

Let’s see which are the main parameters of the function randomForest:
  • ntree: number of trees to grow
  • mtry: how many random variables will be selected to grow in a single tree
  • importance: should importance of predictors be assessed? Keep in mind that if your data includes categorical variables with different number of levels, random forests are biased in favor of those variables with more levels.

Another useful function from this package is tuneRF(). Starting with the default value of mtry, it searchs for the optimal value.

Your turn! Try to obtain the best mtry for your data and train a random forest using this package and the caret package.

# Load package
library(randomForest)

# Saving the waps in a vector
WAPs<-grep("WAP", names(train), value=T)

# Get the best mtry
bestmtry_rf<-tuneRF(sample[WAPs], sample$LONGITUDE, ntreeTry=100,stepFactor=2,improve=0.05,trace=TRUE, plot=T) 

# Train a random forest using that mtry
system.time(rf_reg<-randomForest(y=sample$LONGITUDE,x=sample[WAPs],importance=T,method="rf", ntree=100, mtry=22))

# Train a random forest using caret package
system.time(rf_reg_caret<-train(y=sample$LONGITUDE, x=sample[WAPs], data = sample, method="rf", ntree=100,tuneGrid=expand.grid(.mtry=22)))

KNN: caret package

Explore the main parameters of these functions knn3() for classification and knnreg() for regression:

Train two knn models with these packages and the caret package:

# Load the package
library(caret)

# Saving the waps in a vector
WAPs<-grep("WAP", names(sample), value=T)

# calculate the pre-process parameters from the dataset
preprocessParams <- preProcess(sample[WAPs], method=c("center", "scale"))

# transform the waps using the parameters
stand_waps <- predict(preprocessParams, sample[WAPs])

# complete dataset
stand_dataset<-cbind(stand_waps, BUILDINGID=sample$BUILDINGID,LONGITUDE=sample$LONGITUDE)

# Train two classification knn (with knn3 and train)
system.time(knn_clasif <- knn3(BUILDINGID ~ as.matrix(stand_dataset[WAPs]), data = stand_dataset))

system.time(knn_clasif_caret<-train(y=stand_dataset$BUILDINGID, x=stand_dataset[WAPs], data = stand_dataset, method="knn"))

# Train two regression knn (with knnreg and caret)
system.time(knn_reg<-knnreg(LONGITUDE ~ as.matrix(stand_dataset[WAPs]), data = stand_dataset))

system.time(knn_reg_caret<-train(y=stand_dataset$LONGITUDE, x=stand_dataset[WAPs], data = stand_dataset, method="knn"))

SVM: e1071 package

Explore the main parameters of these functions svm() for classification and regression.

Read this resource for more info svm() for classification and regression.

Train two svm models with these packages and the caret package

# Load the packages
library(e1071)
library(caret)

# Saving the waps in a vector
WAPs<-grep("WAP", names(sample), value=T)

# Train two classification svm (with svm and train)
system.time(svm_clasif <- svm(y = stand_dataset$BUILDINGID, x=stand_dataset[WAPs]))

system.time(svm_clasif_caret<-train(y=stand_dataset$BUILDINGID, x=stand_dataset[WAPs], data = stand_dataset, method="svmLinear"))

# Train two regression svm (with svm and train)
system.time(svm_reg <- svm(y = stand_dataset$LONGITUDE, x=stand_dataset[WAPs]))

system.time(svm_reg_caret<-train(y=stand_dataset$LONGITUDE, x=as.matrix(stand_dataset[WAPs], data = stand_dataset, method="svmLinear")))

PARALLEL PROCESSING

A computer usually has multiple cores. Tipically, R is going to use only one of them, but we can increase this number, allowing us to execute more computations at the same time.

How to do it on Windows
  • Install the doParallel package
  • Check how many cores you have with the function detectCores().
  • Save the number of cores that you would like to execute with the function makeCluster(). A good practice is to leave one for other tasks.
  • Register the cluster with the function registerDoParallel()
How to do it on Mac/Linux
  • Install the doMC package
  • Check how many cores you have with the function getDoParWorkers()
  • Save the number of cores that you would like to execute with the function makeCluster(). A good practice is to leave one for other tasks.
  • Register the cluster with the function registerDoMC()

Now you can apply parallel processing! For example, you can use it in the cross validation or in the RF with the parameter “allowParallel = TRUE”.

Challenge: Train the same sample with parallel processing

# Load the library
library(doParallel)

# Check number of cores
detectCores()

# Save the number of cores I'm going to use
cluster <- makeCluster(detectCores() - 1)

# Register the cluster
registerDoParallel(cluster)

# Apply it on the cross validation
fitControl <- trainControl(method = "repeatedcv", number = 10, repeats = 3, allowParallel = TRUE)

SAVING AND LOADING MODELS

You can save your best models to a file. This way, you will be able to load/share them later.
  • For saving a model you can mainly use two functions: save(____.rda) or saveRDS(____.rds)
  • For loading a model you will need to use load(____.rda) or readRDS(____.rds)

Your turn! Try to save and load some models.

# Save a model
saveRDS(RF_Model, file="RF_Model.rds")

# Load a model
final_model<-readRD("RF_Model.rds")

Gabriel Ristow Cidral / Sara Marin Lopez

11/04/2019