Introduction

Using devices such as Jawbone Up, Nike FuelBand, and Fitbit it is now possible to collect a large amount of data about personal activity relatively inexpensively. These type of devices are part of the quantified self movement — a group of enthusiasts who take measurements about themselves regularly to improve their health, to find patterns in their behavior, or because they are tech geeks. One thing that people regularly do is quantify how much of a particular activity they do, but they rarely quantify how well they do it. In this project, your goal will be to use data from accelerometers on the belt, forearm, arm, and dumbell of 6 participants. They were asked to perform barbell lifts correctly and incorrectly in 5 different ways. More information is available from the website here: http://groupware.les.inf.puc-rio.br/har

The Datasets

The training data for this project are available here:

https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv

The test data are available here:

https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv

The original article is located at Wearable Computing: Accelerometers’ Data Classification of Body Postures and Movements

Task Description

You should create a report describing how you built your model, how you used cross validation, what you think the expected out of sample error is, and why you made the choices you did. You will also use your prediction model to predict 20 different test cases.

Getting and Cleaning Data, Exploratory Analysis

# remove dplyr if loaded detach_package(dplyr)
# first load plyr 
RequireOrInstall("plyr")
# then load dplyr
RequireOrInstall("dplyr")

# NAVIGATE TO DATASETS DIRECTORY USING setwd() command
training <- read.csv("pml-training.csv")
testing <- read.csv("pml-testing.csv")
dim(training) # 19622 rows, 160 cols
## [1] 19622   160
dim(testing) # 20 rows, 160 cols
## [1]  20 160
fulldata <- bind_rows(training,testing)

check_for_NAs <- function(x,useceil=F){
  if(useceil==T){
  apply(as.data.frame(x),2,FUN=function(x){ceiling(sum(is.na(x))/length(x))})
  } else {
    apply(as.data.frame(x),2,FUN=function(x){sum(is.na(x))/length(x)})
  }
}

NAs <- check_for_NAs(fulldata,useceil = T)
NAs_w <- check_for_NAs(fulldata,useceil = F)
NAs_n <- as.numeric(NAs) # 67 NA variables -- so we have 160 - 67 = 93 that are mostly free of NA's

featuredata <- fulldata%>%select(which(NAs_n==0))
dim(featuredata)
## [1] 19642    59
# these variables have less than 50% NA variables also subsetted out
features <- names(fulldata[,colSums(is.na(fulldata)) == 0])
features
##  [1] "X"                    "user_name"            "raw_timestamp_part_1"
##  [4] "raw_timestamp_part_2" "cvtd_timestamp"       "new_window"          
##  [7] "num_window"           "roll_belt"            "pitch_belt"          
## [10] "yaw_belt"             "total_accel_belt"     "gyros_belt_x"        
## [13] "gyros_belt_y"         "gyros_belt_z"         "accel_belt_x"        
## [16] "accel_belt_y"         "accel_belt_z"         "magnet_belt_x"       
## [19] "magnet_belt_y"        "magnet_belt_z"        "roll_arm"            
## [22] "pitch_arm"            "yaw_arm"              "total_accel_arm"     
## [25] "gyros_arm_x"          "gyros_arm_y"          "gyros_arm_z"         
## [28] "accel_arm_x"          "accel_arm_y"          "accel_arm_z"         
## [31] "magnet_arm_x"         "magnet_arm_y"         "magnet_arm_z"        
## [34] "roll_dumbbell"        "pitch_dumbbell"       "yaw_dumbbell"        
## [37] "total_accel_dumbbell" "gyros_dumbbell_x"     "gyros_dumbbell_y"    
## [40] "gyros_dumbbell_z"     "accel_dumbbell_x"     "accel_dumbbell_y"    
## [43] "accel_dumbbell_z"     "magnet_dumbbell_x"    "magnet_dumbbell_y"   
## [46] "magnet_dumbbell_z"    "roll_forearm"         "pitch_forearm"       
## [49] "yaw_forearm"          "total_accel_forearm"  "gyros_forearm_x"     
## [52] "gyros_forearm_y"      "gyros_forearm_z"      "accel_forearm_x"     
## [55] "accel_forearm_y"      "accel_forearm_z"      "magnet_forearm_x"    
## [58] "magnet_forearm_y"     "magnet_forearm_z"
length(features) # so only 59 of 93 variables contain no NA's
## [1] 59
featuredata<- subset(featuredata, select = features) 
featuredata
## Source: local data frame [19,642 x 59]
## 
##        X user_name raw_timestamp_part_1 raw_timestamp_part_2
##    (int)    (fctr)                (int)                (int)
## 1      1  carlitos           1323084231               788290
## 2      2  carlitos           1323084231               808298
## 3      3  carlitos           1323084231               820366
## 4      4  carlitos           1323084232               120339
## 5      5  carlitos           1323084232               196328
## 6      6  carlitos           1323084232               304277
## 7      7  carlitos           1323084232               368296
## 8      8  carlitos           1323084232               440390
## 9      9  carlitos           1323084232               484323
## 10    10  carlitos           1323084232               484434
## ..   ...       ...                  ...                  ...
## Variables not shown: cvtd_timestamp (chr), new_window (chr), num_window
##   (int), roll_belt (dbl), pitch_belt (dbl), yaw_belt (dbl),
##   total_accel_belt (int), gyros_belt_x (dbl), gyros_belt_y (dbl),
##   gyros_belt_z (dbl), accel_belt_x (int), accel_belt_y (int), accel_belt_z
##   (int), magnet_belt_x (int), magnet_belt_y (int), magnet_belt_z (int),
##   roll_arm (dbl), pitch_arm (dbl), yaw_arm (dbl), total_accel_arm (int),
##   gyros_arm_x (dbl), gyros_arm_y (dbl), gyros_arm_z (dbl), accel_arm_x
##   (int), accel_arm_y (int), accel_arm_z (int), magnet_arm_x (int),
##   magnet_arm_y (int), magnet_arm_z (int), roll_dumbbell (dbl),
##   pitch_dumbbell (dbl), yaw_dumbbell (dbl), total_accel_dumbbell (int),
##   gyros_dumbbell_x (dbl), gyros_dumbbell_y (dbl), gyros_dumbbell_z (dbl),
##   accel_dumbbell_x (int), accel_dumbbell_y (int), accel_dumbbell_z (int),
##   magnet_dumbbell_x (int), magnet_dumbbell_y (int), magnet_dumbbell_z
##   (dbl), roll_forearm (dbl), pitch_forearm (dbl), yaw_forearm (dbl),
##   total_accel_forearm (int), gyros_forearm_x (dbl), gyros_forearm_y (dbl),
##   gyros_forearm_z (dbl), accel_forearm_x (int), accel_forearm_y (int),
##   accel_forearm_z (int), magnet_forearm_x (int), magnet_forearm_y (dbl),
##   magnet_forearm_z (dbl)
#dropping the features related to timeseries
train <- as.data.frame(mutate(featuredata[1:19622,],classe=training$classe)[,c(2,8:60)])
test <- as.data.frame(featuredata[19623:19642,c(2,8:60)])

Data Preprocessing – Removing Redundant Multi-Collinearities from Data

I have not opted for to use the famous Principal Component Analysis preprocessing technique here since rotating the feature space will possibly result in a loss of descriptive power that I have not planned here. Thus, instead, I have chosen to remove the unnecessary collinearities in the data in the following way:

set.seed(123)
RequireOrInstall("caret")
numericset_names <- names(train[,sapply(train,is.numeric)|sapply(train,is.integer)])

numericset <- train[,sapply(train,is.numeric)|sapply(train,is.integer)]

#findCorrelation  searches through a correlation matrix and returns a vector of integers corresponding to columns to remove to reduce pair-wise correlations.

highCor_feat <- names(numericset[,findCorrelation(abs(cor(numericset)),0.88)])
lowcor_features <- setdiff(setdiff(names(train),"classe"),highCor_feat)

train_lowcor <- train[,c(lowcor_features,"classe")]
dim(train_lowcor)
## [1] 19622    46
#### ----------- 19623:19642 TESTING, the rest TRAINING ---------- ########

inTrain <- createDataPartition(train_lowcor$classe, p = 0.6, list = F)
train_ = train_lowcor[inTrain,]
validate_ = train_lowcor[-inTrain,]
classe_idx = which(names(train_) == "classe") 
classe_idx #index corresponds to classe
## [1] 46
str(train_)
## 'data.frame':    11776 obs. of  46 variables:
##  $ user_name           : Factor w/ 6 levels "adelmo","carlitos",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ yaw_belt            : num  -94.4 -94.4 -94.4 -94.4 -94.4 -94.4 -94.4 -94.4 -94.4 -94.4 ...
##  $ total_accel_belt    : int  3 3 3 3 3 3 3 3 3 3 ...
##  $ gyros_belt_x        : num  0 0.02 0.02 0.03 0.03 0.02 0.02 0 0 0 ...
##  $ gyros_belt_y        : num  0 0 0 0 0 0 0 0 0.02 0 ...
##  $ gyros_belt_z        : num  -0.02 -0.03 -0.02 0 -0.02 -0.02 0 -0.02 0 -0.02 ...
##  $ magnet_belt_x       : int  -2 -6 0 -3 -5 -2 -3 -6 1 -3 ...
##  $ magnet_belt_y       : int  600 604 603 609 596 602 606 598 600 603 ...
##  $ magnet_belt_z       : int  -305 -310 -312 -308 -317 -319 -309 -317 -316 -313 ...
##  $ roll_arm            : num  -128 -128 -128 -128 -128 -128 -128 -129 -129 -129 ...
##  $ pitch_arm           : num  22.5 22.1 22 21.6 21.5 21.5 21.4 21.3 21.2 21.2 ...
##  $ yaw_arm             : num  -161 -161 -161 -161 -161 -161 -161 -161 -161 -161 ...
##  $ total_accel_arm     : int  34 34 34 34 34 34 34 34 34 34 ...
##  $ gyros_arm_y         : num  -0.02 -0.03 -0.03 -0.03 -0.03 -0.03 -0.02 0 -0.02 -0.02 ...
##  $ gyros_arm_z         : num  -0.02 0.02 0 -0.02 0 0 -0.02 -0.02 -0.03 -0.02 ...
##  $ accel_arm_x         : int  -289 -289 -289 -288 -290 -288 -287 -289 -288 -289 ...
##  $ accel_arm_y         : int  110 111 111 110 110 111 111 110 108 109 ...
##  $ accel_arm_z         : int  -126 -123 -122 -124 -123 -123 -124 -122 -124 -122 ...
##  $ magnet_arm_x        : int  -368 -372 -369 -376 -366 -363 -372 -371 -373 -369 ...
##  $ magnet_arm_y        : int  344 344 342 334 339 343 338 337 336 340 ...
##  $ magnet_arm_z        : int  513 512 513 516 509 520 509 512 510 509 ...
##  $ roll_dumbbell       : num  12.9 13.4 13.4 13.3 13.1 ...
##  $ pitch_dumbbell      : num  -70.3 -70.4 -70.8 -70.9 -70.6 ...
##  $ yaw_dumbbell        : num  -85.1 -84.9 -84.5 -84.4 -84.7 ...
##  $ total_accel_dumbbell: int  37 37 37 37 37 37 37 37 36 37 ...
##  $ gyros_dumbbell_y    : num  -0.02 -0.02 -0.02 -0.02 -0.02 -0.02 -0.02 -0.02 -0.02 -0.02 ...
##  $ accel_dumbbell_x    : int  -232 -232 -234 -235 -233 -233 -234 -233 -231 -233 ...
##  $ accel_dumbbell_y    : int  46 48 48 48 47 47 48 47 47 47 ...
##  $ accel_dumbbell_z    : int  -270 -269 -269 -270 -269 -270 -269 -272 -268 -271 ...
##  $ magnet_dumbbell_x   : int  -561 -552 -558 -558 -564 -554 -552 -551 -557 -559 ...
##  $ magnet_dumbbell_y   : int  298 303 294 291 299 291 302 296 292 295 ...
##  $ magnet_dumbbell_z   : num  -63 -60 -66 -69 -64 -65 -69 -56 -62 -74 ...
##  $ roll_forearm        : num  28.3 28.1 27.9 27.7 27.6 27.5 27.2 27.1 27 26.9 ...
##  $ pitch_forearm       : num  -63.9 -63.9 -63.9 -63.8 -63.8 -63.8 -63.9 -64 -64 -64 ...
##  $ yaw_forearm         : num  -152 -152 -152 -152 -152 -152 -151 -151 -151 -151 ...
##  $ total_accel_forearm : int  36 36 36 36 36 36 36 36 36 36 ...
##  $ gyros_forearm_x     : num  0.03 0.02 0.02 0.02 0.02 0.02 0 0.02 0.02 0.02 ...
##  $ gyros_forearm_y     : num  -0.02 -0.02 -0.02 0 -0.02 0.02 0 -0.02 0 0 ...
##  $ gyros_forearm_z     : num  0 0 -0.03 -0.02 -0.02 -0.03 -0.03 0 -0.02 -0.02 ...
##  $ accel_forearm_x     : int  196 189 193 190 193 191 193 192 192 192 ...
##  $ accel_forearm_y     : int  204 206 203 205 205 203 205 204 206 203 ...
##  $ accel_forearm_z     : int  -213 -214 -215 -215 -214 -215 -215 -213 -216 -216 ...
##  $ magnet_forearm_x    : int  -18 -16 -9 -22 -17 -11 -15 -13 -16 -10 ...
##  $ magnet_forearm_y    : num  658 658 660 656 657 657 655 653 653 657 ...
##  $ magnet_forearm_z    : num  469 469 478 473 465 478 472 481 472 466 ...
##  $ classe              : Factor w/ 5 levels "A","B","C","D",..: 1 1 1 1 1 1 1 1 1 1 ...
dim(validate_)
## [1] 7846   46

Machine Learning Modeling

I used the default randomForest settings to build a classification algorithm – since it is a very powerful nonlinear model with potentially low bias, and the variance of which can be easily reduced using the cross-validation technique. I did not use a special cross validation set in parallel mode as I had split the training data into training and validation sets (60% and 40% of the initial training data size respectively), but I also performed cross validation for the random forest in serial mode.

I experimented with RandomForest and Support Vector Machine (SVM) classifiers.

I tried out a few runs of SVM, the first one with a sigmoidal and linear kernels (interestingly, parallelSVM doesn’t have a generic Gaussian kernel?), but I couldn’t figure it out fast how to make the 1-hot encoding for the parallel SVM to recognize multiple class response variable – that’s probably why it only made a binary classification (like it should by default) and that’s why the accuracy was only around 20% instead of near 100 %.

What I really opted for was the Random Forest classifier. I had some bug that I wasn’t able to fix – for some reason, cross validation did not work for my parallel impementation of Random Forest – I got some sort of confusing a runtime termination error. But the cross validation worked nicely with the Caret package, so I used cross-validation with the train function of Caret.

printconfusion <- function(ml_model,validate_){
confusionMatrix(predict(ml_model,validate_),validate_$classe)
}
RequireOrInstall <- function(package) {
  suppressWarnings({
    if (!require(package,character.only=TRUE)) {
      install.packages(package, dep=TRUE)
      require(package,character.only=TRUE)  
    }})
}
RequireOrInstall("randomForest")
RequireOrInstall("caret")
set.seed(123)
dim(train_)
## [1] 11776    46
dim(validate_)
## [1] 7846   46
classe_idx
## [1] 46
samsung.randomforest2 <-train(train_[,-classe_idx],train_[,classe_idx],method="rf",trControl=trainControl(method="cv",number=5))

samsung.randomforest <- randomForest(train_[,-classe_idx], train_[,classe_idx], importance = T, ntree = 700 )
print(samsung.randomforest, digits=3)
## 
## Call:
##  randomForest(x = train_[, -classe_idx], y = train_[, classe_idx],      ntree = 700, importance = T) 
##                Type of random forest: classification
##                      Number of trees: 700
## No. of variables tried at each split: 6
## 
##         OOB estimate of  error rate: 0.89%
## Confusion matrix:
##      A    B    C    D    E class.error
## A 3344    1    1    0    2 0.001194743
## B   17 2255    7    0    0 0.010530935
## C    0   27 2024    3    0 0.014605648
## D    2    0   34 1892    2 0.019689119
## E    0    0    2    7 2156 0.004157044
# PLOT TRAINING ERROR
plot(samsung.randomforest, ylim=c(0,0.15))
legend('topright', colnames(samsung.randomforest$err.rate), col=1:5, fill=1:5)

printconfusion(samsung.randomforest,validate_)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 2232    7    0    4    0
##          B    0 1508   17    1    0
##          C    0    3 1351   23    2
##          D    0    0    0 1256    3
##          E    0    0    0    2 1437
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9921          
##                  95% CI : (0.9899, 0.9939)
##     No Information Rate : 0.2845          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.99            
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            1.0000   0.9934   0.9876   0.9767   0.9965
## Specificity            0.9980   0.9972   0.9957   0.9995   0.9997
## Pos Pred Value         0.9951   0.9882   0.9797   0.9976   0.9986
## Neg Pred Value         1.0000   0.9984   0.9974   0.9954   0.9992
## Prevalence             0.2845   0.1935   0.1744   0.1639   0.1838
## Detection Rate         0.2845   0.1922   0.1722   0.1601   0.1832
## Detection Prevalence   0.2859   0.1945   0.1758   0.1605   0.1834
## Balanced Accuracy      0.9990   0.9953   0.9916   0.9881   0.9981
#mtry -- number of attributes randomly bootstrapped at each node
#nodesize -- minimum size of terminal nodes
#ntrees(howmany trees, how many computing nodes)
timer <- proc.time()


############ PARALLEL FOREST ################
cores <- 4 
RequireOrInstall("doSNOW")
cluster1<-makeCluster(cores) #<- # of processors / hyperthreads on machine
registerDoSNOW(cluster1)


samsung.parallelforest <- foreach(ntree=rep(400, cores), .combine=combine,.multicombine=TRUE,.packages='randomForest') %dopar%
randomForest(train_[,-classe_idx],train_[,classe_idx], importance = F,do.trace=TRUE,ntree=ntree, nodesize=150,mtry=8)

#   command failed -- claims that response is not a factor !?
proc.time()-timer
##    user  system elapsed 
##    0.41    0.16   44.42
printconfusion(samsung.parallelforest,validate_)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 2167  113    5   17    5
##          B   11 1186   89   10   26
##          C   16  177 1266  153   96
##          D   33   39    7 1087   48
##          E    5    3    1   19 1267
## 
## Overall Statistics
##                                           
##                Accuracy : 0.8887          
##                  95% CI : (0.8816, 0.8956)
##     No Information Rate : 0.2845          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.8592          
##  Mcnemar's Test P-Value : < 2.2e-16       
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            0.9709   0.7813   0.9254   0.8453   0.8786
## Specificity            0.9751   0.9785   0.9318   0.9806   0.9956
## Pos Pred Value         0.9393   0.8971   0.7412   0.8954   0.9784
## Neg Pred Value         0.9883   0.9491   0.9834   0.9700   0.9733
## Prevalence             0.2845   0.1935   0.1744   0.1639   0.1838
## Detection Rate         0.2762   0.1512   0.1614   0.1385   0.1615
## Detection Prevalence   0.2940   0.1685   0.2177   0.1547   0.1651
## Balanced Accuracy      0.9730   0.8799   0.9286   0.9129   0.9371
#######################################

# Print the Predictions 

pred_rf <- predict(samsung.randomforest2,newdata=test)
pred_rf # the predictions
##  [1] B A B A A E D B A A B C B A E E A B B B
## Levels: A B C D E

We see that although samsung.parallelforest depicts a 4-ensemble of 400 trees random forest model, it does worse than the samsung.randomforest model of 700 trees. The optimal maximum node size in the parallel version seems to be definitely less than 200, it may start to overfit with a node size around 200, I believe.

Out of Sample Error

Out of sample error can be defined as the error on the cases not in the training set. Since the original testing set here doesn’t contain the true labels, thus it will be impossible to estimate the Out-of-sample error beforehand. To be able to give an estimate for it, I split the initial training set randomly into two – into training and validation sets and I use the validation set records to estimate the out of sample error. So I could say that the out-of-the-bag error is either 1- accuracy of the model on the validation set (here: 1-0.9921=0.0079) or the cross validation error. OOB is also estimated when running the randomForest algorithm of the randomForest package.

Visualizing the most important features in the Random Forest Classifier

RequireOrInstall("plyr")
RequireOrInstall("dplyr")
RequireOrInstall("ggplot2")
importance <- varImp(samsung.randomforest)
importance <- data.frame(Variables = row.names(importance),
                        Importance = round(samsung.randomforest$importance[ ,'MeanDecreaseGini'],2))

# Create a rank variable based on decreasing importance
rankImportance <- (importance %>%
  mutate(Rank =dense_rank(desc(Importance))))%>%arrange(Rank)
first7 <- rankImportance[1:7,1]
inImp = createDataPartition(train_$classe, p = 0.05)[[1]]

featurePlot(train_[inImp,first7],train_$classe[inImp], plot = "pairs", xlab="Most Important Variables",ylab="Activity Type (Classe)")

# Use ggplot2 to visualize the relative importance of variables
ggplot(rankImportance[1:7,], aes(x = reorder(Variables, Importance), 
    y = Importance)) +
  geom_bar(stat='identity', colour = 'green') +
  geom_text(aes(x = Variables, y = 0.5, label = Rank),
    hjust=0, vjust=0.55, size = 4, colour = 'lavender',
    fontface = 'bold') +
  labs(x = 'Variables', title = 'Relative Variable Importance') +
  coord_flip()

Example of the tree rules for the 500’th tree in the random forest classifier

## [1] "total_accel_belt > 21.5"

Remarks

Removing the 7 multicollinear features (that is, using train_lowcor) from the training set lowered the accuracy of the classification by 0.0017 in the best scenario (from 0.9938 to 0.9921), which conveys the point that when one has built a versatile classifier (which the random forest is), compressing the feature space 3.5 times may not reduce the selected classification metric (e.g. accuracy) almost by any significant margin. Thus, the effective dimensionality reduction performed is from 159-dimensional to 46-dimensional feature space while achieving 99.3% of accuracy on the validation set. Thus, for me, the project has fulfilled its purpose.