1. Summary

With various wearable devices, it is now possible to collect a large amount of data about personal activity relatively inexpensively. The data in this project is measured from accelerometers on the belt, forearm, arm, and dumbell of 6 participants (http://groupware.les.inf.puc-rio.br/har). They were asked to perform barbell lifts correctly and incorrectly in 5 different ways.The goal of your project is to predict the manner in which they did the exercise. This is the “classe” variable in the training set. The work was done firstly by Velloso, E.; Bulling, A.; Gellersen, H.; Ugulino, W.; Fuks, H., 2013.

The project includes some intuitions into the data, exploratory analysis, actual prediction prediction and interpretation.

My cross validation approach is using random subsampling into training and validating sets. I try 3 different prediction methods, namely Decision tree, Random forest, and Gradient boosting machine (gbm) with and without principle component analysis (pca). The results show that Random forest is the best method on this data with validation accuracy of 99%.

2. Some details on the data and initial thoughts

The data has 5 classes:

The 4 sensors are attached to: arm (bicep/tricep), forearm (on glove, close to the wrist), belt, and weight (dumbbell).

My thoughts:

3. Data exploratory and cleaning

Loading library and data

set.seed(12345)
library(caret); library(e1071); library(gbm); library(randomForest)
## Loading required package: lattice
## Loading required package: ggplot2
## Warning: package 'ggplot2' was built under R version 3.5.1
## Loading required package: survival
## 
## Attaching package: 'survival'
## The following object is masked from 'package:caret':
## 
##     cluster
## Loading required package: splines
## Loading required package: parallel
## Loaded gbm 2.1.3
## randomForest 4.6-14
## Type rfNews() to see new features/changes/bug fixes.
## 
## Attaching package: 'randomForest'
## The following object is masked from 'package:ggplot2':
## 
##     margin
library(parallel); library(doParallel)
## Warning: package 'doParallel' was built under R version 3.5.1
## Loading required package: foreach
## Loading required package: iterators
tr <- read.csv("pml-training.csv", na.strings = c("NA",""," "))

Now let’s have a look at the data:

names(tr)
##   [1] "X"                        "user_name"               
##   [3] "raw_timestamp_part_1"     "raw_timestamp_part_2"    
##   [5] "cvtd_timestamp"           "new_window"              
##   [7] "num_window"               "roll_belt"               
##   [9] "pitch_belt"               "yaw_belt"                
##  [11] "total_accel_belt"         "kurtosis_roll_belt"      
##  [13] "kurtosis_picth_belt"      "kurtosis_yaw_belt"       
##  [15] "skewness_roll_belt"       "skewness_roll_belt.1"    
##  [17] "skewness_yaw_belt"        "max_roll_belt"           
##  [19] "max_picth_belt"           "max_yaw_belt"            
##  [21] "min_roll_belt"            "min_pitch_belt"          
##  [23] "min_yaw_belt"             "amplitude_roll_belt"     
##  [25] "amplitude_pitch_belt"     "amplitude_yaw_belt"      
##  [27] "var_total_accel_belt"     "avg_roll_belt"           
##  [29] "stddev_roll_belt"         "var_roll_belt"           
##  [31] "avg_pitch_belt"           "stddev_pitch_belt"       
##  [33] "var_pitch_belt"           "avg_yaw_belt"            
##  [35] "stddev_yaw_belt"          "var_yaw_belt"            
##  [37] "gyros_belt_x"             "gyros_belt_y"            
##  [39] "gyros_belt_z"             "accel_belt_x"            
##  [41] "accel_belt_y"             "accel_belt_z"            
##  [43] "magnet_belt_x"            "magnet_belt_y"           
##  [45] "magnet_belt_z"            "roll_arm"                
##  [47] "pitch_arm"                "yaw_arm"                 
##  [49] "total_accel_arm"          "var_accel_arm"           
##  [51] "avg_roll_arm"             "stddev_roll_arm"         
##  [53] "var_roll_arm"             "avg_pitch_arm"           
##  [55] "stddev_pitch_arm"         "var_pitch_arm"           
##  [57] "avg_yaw_arm"              "stddev_yaw_arm"          
##  [59] "var_yaw_arm"              "gyros_arm_x"             
##  [61] "gyros_arm_y"              "gyros_arm_z"             
##  [63] "accel_arm_x"              "accel_arm_y"             
##  [65] "accel_arm_z"              "magnet_arm_x"            
##  [67] "magnet_arm_y"             "magnet_arm_z"            
##  [69] "kurtosis_roll_arm"        "kurtosis_picth_arm"      
##  [71] "kurtosis_yaw_arm"         "skewness_roll_arm"       
##  [73] "skewness_pitch_arm"       "skewness_yaw_arm"        
##  [75] "max_roll_arm"             "max_picth_arm"           
##  [77] "max_yaw_arm"              "min_roll_arm"            
##  [79] "min_pitch_arm"            "min_yaw_arm"             
##  [81] "amplitude_roll_arm"       "amplitude_pitch_arm"     
##  [83] "amplitude_yaw_arm"        "roll_dumbbell"           
##  [85] "pitch_dumbbell"           "yaw_dumbbell"            
##  [87] "kurtosis_roll_dumbbell"   "kurtosis_picth_dumbbell" 
##  [89] "kurtosis_yaw_dumbbell"    "skewness_roll_dumbbell"  
##  [91] "skewness_pitch_dumbbell"  "skewness_yaw_dumbbell"   
##  [93] "max_roll_dumbbell"        "max_picth_dumbbell"      
##  [95] "max_yaw_dumbbell"         "min_roll_dumbbell"       
##  [97] "min_pitch_dumbbell"       "min_yaw_dumbbell"        
##  [99] "amplitude_roll_dumbbell"  "amplitude_pitch_dumbbell"
## [101] "amplitude_yaw_dumbbell"   "total_accel_dumbbell"    
## [103] "var_accel_dumbbell"       "avg_roll_dumbbell"       
## [105] "stddev_roll_dumbbell"     "var_roll_dumbbell"       
## [107] "avg_pitch_dumbbell"       "stddev_pitch_dumbbell"   
## [109] "var_pitch_dumbbell"       "avg_yaw_dumbbell"        
## [111] "stddev_yaw_dumbbell"      "var_yaw_dumbbell"        
## [113] "gyros_dumbbell_x"         "gyros_dumbbell_y"        
## [115] "gyros_dumbbell_z"         "accel_dumbbell_x"        
## [117] "accel_dumbbell_y"         "accel_dumbbell_z"        
## [119] "magnet_dumbbell_x"        "magnet_dumbbell_y"       
## [121] "magnet_dumbbell_z"        "roll_forearm"            
## [123] "pitch_forearm"            "yaw_forearm"             
## [125] "kurtosis_roll_forearm"    "kurtosis_picth_forearm"  
## [127] "kurtosis_yaw_forearm"     "skewness_roll_forearm"   
## [129] "skewness_pitch_forearm"   "skewness_yaw_forearm"    
## [131] "max_roll_forearm"         "max_picth_forearm"       
## [133] "max_yaw_forearm"          "min_roll_forearm"        
## [135] "min_pitch_forearm"        "min_yaw_forearm"         
## [137] "amplitude_roll_forearm"   "amplitude_pitch_forearm" 
## [139] "amplitude_yaw_forearm"    "total_accel_forearm"     
## [141] "var_accel_forearm"        "avg_roll_forearm"        
## [143] "stddev_roll_forearm"      "var_roll_forearm"        
## [145] "avg_pitch_forearm"        "stddev_pitch_forearm"    
## [147] "var_pitch_forearm"        "avg_yaw_forearm"         
## [149] "stddev_yaw_forearm"       "var_yaw_forearm"         
## [151] "gyros_forearm_x"          "gyros_forearm_y"         
## [153] "gyros_forearm_z"          "accel_forearm_x"         
## [155] "accel_forearm_y"          "accel_forearm_z"         
## [157] "magnet_forearm_x"         "magnet_forearm_y"        
## [159] "magnet_forearm_z"         "classe"
dim(tr)
## [1] 19622   160
head(tr[,5:15],3)
##     cvtd_timestamp new_window num_window roll_belt pitch_belt yaw_belt
## 1 05/12/2011 11:23         no         11      1.41       8.07    -94.4
## 2 05/12/2011 11:23         no         11      1.41       8.07    -94.4
## 3 05/12/2011 11:23         no         11      1.42       8.07    -94.4
##   total_accel_belt kurtosis_roll_belt kurtosis_picth_belt
## 1                3               <NA>                <NA>
## 2                3               <NA>                <NA>
## 3                3               <NA>                <NA>
##   kurtosis_yaw_belt skewness_roll_belt
## 1              <NA>               <NA>
## 2              <NA>               <NA>
## 3              <NA>               <NA>

The data has 19622 observations of 160 variables (the last one is the outcome). There are some variables with many NAs. I also notice that new_window has 406 values of yes:

summary(tr$new_window)
##    no   yes 
## 19216   406

and it corresponds to the number of NA in other variables, for example:

summary(tr$kurtosis_yaw_belt)
## #DIV/0!    NA's 
##     406   19216
summary(tr$amplitude_yaw_belt)
## #DIV/0!    0.00  0.0000    NA's 
##      10      12     384   19216

(19622 - 406 = 19216). Very interesting! Let’s have a look at the some variables where new_window is ‘yes’

idx <- which(tr$new_window=='yes')[1:5]
tr[idx,c("new_window","kurtosis_roll_belt","kurtosis_yaw_belt","amplitude_yaw_belt","roll_belt","gyros_belt_x")]
##     new_window kurtosis_roll_belt kurtosis_yaw_belt amplitude_yaw_belt
## 24         yes           5.587755           #DIV/0!             0.0000
## 52         yes          -0.997130           #DIV/0!             0.0000
## 76         yes           7.515290           #DIV/0!             0.0000
## 165        yes          -2.121212           #DIV/0!             0.0000
## 210        yes          -1.122273           #DIV/0!             0.0000
##     roll_belt gyros_belt_x
## 24       1.51         0.02
## 52       1.27         0.00
## 76       1.18         0.02
## 165      1.01         0.03
## 210    129.00        -0.43

and where new_window is ‘no’

idx <- which(tr$new_window=='no')[1:5]
tr[idx,c("new_window","kurtosis_roll_belt","kurtosis_yaw_belt","amplitude_yaw_belt","roll_belt","gyros_belt_x")]
##   new_window kurtosis_roll_belt kurtosis_yaw_belt amplitude_yaw_belt
## 1         no               <NA>              <NA>               <NA>
## 2         no               <NA>              <NA>               <NA>
## 3         no               <NA>              <NA>               <NA>
## 4         no               <NA>              <NA>               <NA>
## 5         no               <NA>              <NA>               <NA>
##   roll_belt gyros_belt_x
## 1      1.41         0.00
## 2      1.41         0.02
## 3      1.42         0.00
## 4      1.48         0.02
## 5      1.48         0.02

Now we have a better understand about the data structure. This is how it looks like:

A lot of variables contain mostly NAs (19216 with time_window = ‘no’) In these variables, where the data is not NA, (406, with time_window = ‘yes’), some variables have problematic measurements. For example: DIV/0, zeros. These variables apparently are the one with NA problem. I think these variables have problems in measurements: normally it record nothing (NAs are for NA, or missing data, or space) or problematic numbers.

My strategy to clean the data: remove columns with mostly NAs

na_count <- data.frame(sapply(tr, function(tr) sum(length(which(is.na(tr))))))
skipcol <- na_count == 19216
skipcol <- as.logical(skipcol)
tr1 <- tr[, -which(skipcol == 1)]
dim(tr1)
## [1] 19622    60

The data now has 60 variables. Let’s have a look at some variables:

wtype <- as.factor(tr1[,60]) #last column is the class
plot(tr1$roll_belt, pch=20, col=wtype)
legend("topleft", legend=levels(wtype), pch=16, col=unique(wtype))

The data seems to be sorted by class. I now plot all data to see if any variable could be factorized. A boxplot is useful to detect outlier but a regular plot even shows outliers better and indicate factorable variables:

par(mfrow = c(6,10), mar=c(1,1,1,1))
for (i in 1:60)
{
    plot(tr1[,i], pch=20, col="blue")
}

We can see some outliers in the data at variables 38, 39, 40, 45, 51, 52, 53. I will remove the points which is beyond 75th accummulated quantile from the data, then remove the first 7 variables that have nothing to do in prediction (time and index).

ol38<-as.numeric(tr1$gyros_dumbbell_x<(quantile(tr1$gyros_dumbbell_x)[1]+quantile(tr1$gyros_dumbbell_x)[2])/2)
ol39<-as.numeric(tr1$gyros_dumbbell_y>(quantile(tr1$gyros_dumbbell_y)[4]+quantile(tr1$gyros_dumbbell_y)[5])/2)
ol40<-as.numeric(tr1$gyros_dumbbell_z>(quantile(tr1$gyros_dumbbell_z)[4]+quantile(tr1$gyros_dumbbell_z)[5])/2)
ol45<-as.numeric(tr1$magnet_dumbbell_y<(quantile(tr1$magnet_dumbbell_y)[1]+quantile(tr1$magnet_dumbbell_y)[2])/2)
ol51<-as.numeric(tr1$gyros_forearm_x<(quantile(tr1$gyros_forearm_x)[1]+quantile(tr1$gyros_forearm_x)[2])/2)
ol52<-as.numeric(tr1$gyros_forearm_y>(quantile(tr1$gyros_forearm_y)[4]+quantile(tr1$gyros_forearm_y)[5])/2)
ol53<-as.numeric(tr1$gyros_forearm_z>(quantile(tr1$gyros_forearm_z)[4]+quantile(tr1$gyros_forearm_z)[5])/2)

skiprow <- ol38 + ol39 + ol40 + ol45 + ol51 + ol52 + ol53

tr2 <- tr1[-which(skiprow > 0),]

dim(tr2)
## [1] 19620    60

The data now has 19620 observations of 60 variables (the last column is the outcome). Let’s have a look at the data again, with color of classes:

par(mfrow = c(6,10), mar=c(1,1,1,1))
for (i in 1:60)
{
    plot(tr2[,i], pch=20, col=wtype)
}

Now there is no outlier, so we don’t miss anything.

Clearly now the data is sorted by classes (plot 1, 4, 60). Probably when measuring, the volunteers were asked to do a certain type of movement before moving to the others. They also indicate that the classes are not skewed.

Now let me remove the first 7 variables: timing and volunteer object.

tr4 <- tr2[,-c(1:7)]
dim(tr4)
## [1] 19620    53
par(mfrow = c(4,13), mar=c(1,1,1,1))
for (i in 1:52)
{
    plot(tr4[,i], pch=20, col=wtype)
}

names(tr4)
##  [1] "roll_belt"            "pitch_belt"           "yaw_belt"            
##  [4] "total_accel_belt"     "gyros_belt_x"         "gyros_belt_y"        
##  [7] "gyros_belt_z"         "accel_belt_x"         "accel_belt_y"        
## [10] "accel_belt_z"         "magnet_belt_x"        "magnet_belt_y"       
## [13] "magnet_belt_z"        "roll_arm"             "pitch_arm"           
## [16] "yaw_arm"              "total_accel_arm"      "gyros_arm_x"         
## [19] "gyros_arm_y"          "gyros_arm_z"          "accel_arm_x"         
## [22] "accel_arm_y"          "accel_arm_z"          "magnet_arm_x"        
## [25] "magnet_arm_y"         "magnet_arm_z"         "roll_dumbbell"       
## [28] "pitch_dumbbell"       "yaw_dumbbell"         "total_accel_dumbbell"
## [31] "gyros_dumbbell_x"     "gyros_dumbbell_y"     "gyros_dumbbell_z"    
## [34] "accel_dumbbell_x"     "accel_dumbbell_y"     "accel_dumbbell_z"    
## [37] "magnet_dumbbell_x"    "magnet_dumbbell_y"    "magnet_dumbbell_z"   
## [40] "roll_forearm"         "pitch_forearm"        "yaw_forearm"         
## [43] "total_accel_forearm"  "gyros_forearm_x"      "gyros_forearm_y"     
## [46] "gyros_forearm_z"      "accel_forearm_x"      "accel_forearm_y"     
## [49] "accel_forearm_z"      "magnet_forearm_x"     "magnet_forearm_y"    
## [52] "magnet_forearm_z"     "classe"

The data now has 19620 observations of 53 variables (the last column is the outcome). Each sensor has 13 measurements and are displayed at each row, the order is: belt, arm, dumbbell, and forearm.

Findings

  • If we look closely to the plots, we can see that my initial thoughts of separating different classes do not work. There is no separation in any variable betwen the classes.
  • In terms of the amplitude ranges of measurements:
    • Belt measurement value ranges are separated from the other three sensors
    • Forearm and dumbbell measurements (last and second last rows) seem to be not similar
    • Forearm and arm measurements are more similar, especially in ‘total_accel’, ‘gyros_z’, ‘accel_x’, accel_z’.
    • If the data is written correctly, my intuition was wrong, and that’s important to not trusting the gut.

4. Cross validation and data preprocessing

I divide the training data into 3 parts: training, validation and test sets (random subsampling). Since the sample size is moderate (about 20,000), my dividing ratio between training/validation/test samples is: 60%/20%/20%.

inTrain <- createDataPartition(y=tr4$classe, p = 0.6, list=FALSE)
tr4tr <- tr4[inTrain, ]
tr4tmp <- tr4[-inTrain, ]
inVal <- createDataPartition(y=tr4tmp$classe, p = 0.5, list=FALSE)
tr4val <- tr4tmp[inVal, ]
tr4test <- tr4tmp[-inVal, ]

Now I normalize the data (actually this is not required when using tree-based methods but as a habit, I usually normalize the data before feeding it to any model).

preprocess <- preProcess(tr4tr, method = c("center", "scale"))   # build the preprocess "template"
tr4trp <- predict(preprocess, tr4tr)                             # apply the template on the training set
tr4valp <- predict(preprocess, tr4val)                           # apply the template on the validation set
tr4testp <- predict(preprocess, tr4test)                         # apply the template on the test set

5. Analysis strategy

I will train different models using training set. Three models will be tested:

Decision tree is tested but I’m afraid it won’t perform very well. Other methods may be better but is harder to interpret.

For demonstration, they are trained without parameter tuning. Keep in mind that the more sophisticated BGM performs better than simple RF if being tuned properly. However RF can be paralleled while GBM theoretically can not.

Notes: - Resampling: from boostrapping to k-fold cross-validation. - Parallelization: because trees are selected independently (unlike GBM where a tree depends on the previous ones), this will help to reduce computing time.

These trained models are then applied on the validation set. I will then pick the best model based on its accuracy performance on the validation set. Please note that model optimization and tuning is not focused on this project. So, most of the model parameters are kept as default.

Because training is time consuming, I extend the project by applying PCA before model training. PC transformation is applied on the training data so that 90% of the variance is preserved, then the transformed PC will be used to train, using the best methods seclected before.

At the end, I will compare all methods, in terms of performance and time duration. The best one will be applied on the test set for prediction.

fitControl <- trainControl(method = "cv", number = 5, allowParallel = TRUE)
conMtxPlot <- function(a){
    a <- prop.table(a,2)
    name <- c("A","B","C","D","E")
    conf <- as.data.frame(as.table(a))
    plot <- ggplot(conf)
    plot + geom_tile(aes(x=Reference, y = Prediction, fill=Freq))+
            scale_x_discrete(name = "Actual Class") + 
            scale_y_discrete(name="Predicted class") + 
            scale_fill_gradient(breaks=seq(from=-.5, to=4, by=.2)) + 
            labs(fill="Normalized\nFrequency") +
            geom_text(aes(x =Reference , y=Prediction, label = sprintf("%1.2f",Freq)), vjust = 1)
}

6. Train and validate models without PCA

Decision tree

# Training
start_time <- Sys.time()
fit_dt <- train(classe ~ ., data=tr4trp, method="rpart", trControl=trainControl(method="none"), tuneGrid=data.frame(cp=0.01))
Sys.time() - start_time
## Time difference of 1.27821 secs
# Validating
confusionMatrix(tr4valp$classe, predict(fit_dt, tr4valp))
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 1032   26   17   18   23
##          B  136  414   76   31  102
##          C   44   32  447  124   37
##          D   61   16   42  413  111
##          E   22   23   20   58  598
## 
## Overall Statistics
##                                           
##                Accuracy : 0.7402          
##                  95% CI : (0.7262, 0.7539)
##     No Information Rate : 0.3301          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.6697          
##  Mcnemar's Test P-Value : < 2.2e-16       
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            0.7969   0.8102   0.7425   0.6413   0.6866
## Specificity            0.9680   0.8989   0.9286   0.9299   0.9597
## Pos Pred Value         0.9247   0.5455   0.6535   0.6423   0.8294
## Neg Pred Value         0.9063   0.9693   0.9521   0.9296   0.9147
## Prevalence             0.3301   0.1303   0.1535   0.1642   0.2220
## Detection Rate         0.2631   0.1055   0.1139   0.1053   0.1524
## Detection Prevalence   0.2845   0.1935   0.1744   0.1639   0.1838
## Balanced Accuracy      0.8825   0.8545   0.8356   0.7856   0.8231
conMtxPlot(confusionMatrix(tr4valp$classe, predict(fit_dt, tr4valp))$table)

Random forest

cluster <- makeCluster(detectCores()-1) 
registerDoParallel(cluster)

# Training
start_time <- Sys.time()
fit_rf_par_rs <- train(classe ~ ., data = tr4trp, method = "rf", trControl = fitControl)
Sys.time() - start_time
## Time difference of 7.690241 mins
stopCluster(cluster)
registerDoSEQ()

# Validaing
confusionMatrix(tr4valp$classe, predict(fit_rf_par_rs, tr4valp))
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 1112    4    0    0    0
##          B    3  751    5    0    0
##          C    0    4  676    4    0
##          D    0    1    9  632    1
##          E    0    0    3    1  717
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9911          
##                  95% CI : (0.9876, 0.9938)
##     No Information Rate : 0.2842          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.9887          
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            0.9973   0.9882   0.9755   0.9922   0.9986
## Specificity            0.9986   0.9975   0.9975   0.9967   0.9988
## Pos Pred Value         0.9964   0.9895   0.9883   0.9829   0.9945
## Neg Pred Value         0.9989   0.9972   0.9948   0.9985   0.9997
## Prevalence             0.2842   0.1937   0.1767   0.1624   0.1830
## Detection Rate         0.2835   0.1914   0.1723   0.1611   0.1828
## Detection Prevalence   0.2845   0.1935   0.1744   0.1639   0.1838
## Balanced Accuracy      0.9979   0.9928   0.9865   0.9944   0.9987
conMtxPlot(confusionMatrix(tr4valp$classe, predict(fit_rf_par_rs, tr4valp))$table)

GBM

cluster <- makeCluster(detectCores()-1) 
registerDoParallel(cluster)

# Training
start_time <- Sys.time()
fit_gbm_rs <- train(classe ~ ., data = tr4trp, method="gbm", verbose=FALSE, trControl = fitControl)
Sys.time() - start_time
## Time difference of 3.691285 mins
stopCluster(cluster)
registerDoSEQ()

# Validation
confusionMatrix(tr4valp$classe, predict(fit_gbm_rs, tr4valp))
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 1101   11    2    2    0
##          B   17  721   21    0    0
##          C    0   22  656    5    1
##          D    1    0   25  611    6
##          E    2    7    7    9  696
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9648          
##                  95% CI : (0.9586, 0.9704)
##     No Information Rate : 0.2858          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.9555          
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            0.9822   0.9474   0.9226   0.9745   0.9900
## Specificity            0.9946   0.9880   0.9913   0.9903   0.9922
## Pos Pred Value         0.9866   0.9499   0.9591   0.9502   0.9653
## Neg Pred Value         0.9929   0.9874   0.9830   0.9951   0.9978
## Prevalence             0.2858   0.1940   0.1812   0.1598   0.1792
## Detection Rate         0.2807   0.1838   0.1672   0.1557   0.1774
## Detection Prevalence   0.2845   0.1935   0.1744   0.1639   0.1838
## Balanced Accuracy      0.9884   0.9677   0.9570   0.9824   0.9911
conMtxPlot(confusionMatrix(tr4valp$classe, predict(fit_gbm_rs, tr4valp))$table)

Interpretation

Comparing 3 methods, the validation accuracy of Decision tree, Random forest and GBM are: 74%, 99%, 96%, respectively.

Based from this result, the Random forest is the most robust method. However it is the most time consuming one (~6 minutes) compared to Decision tree (~1 second) and GBM (~3 minutes).

The next step of my analysis is apply PCA on the best 2 methods (Random forest and GBM), then train the models again to see how they perform.

7. Train and validate Random forest and GBM with PCA

PCA processing

preprocessPCA <- preProcess(tr4tr[,-53], method = "pca", thresh = 0.9)
preprocessPCA
## Created from 11775 samples and 52 variables
## 
## Pre-processing:
##   - centered (52)
##   - ignored (0)
##   - principal component signal extraction (52)
##   - scaled (52)
## 
## PCA needed 20 components to capture 90 percent of the variance
trainPCA <- predict(preprocessPCA, tr4tr[,-53])
valPCA <- predict(preprocessPCA,tr4val[,-53])
testPCA <- predict(preprocessPCA,tr4test[,-53])

Created from 11775 samples and 52 variables

Pre-processing: - centered (52) - ignored (0) - principal component signal extraction (52) - scaled (52)

PCA needed 20 components to capture 90 percent of the variance The analysis suggests that with 90% variance coverage, it requires only 20 principle components (instead of 52 variables) for training.

Random forest

cluster <- makeCluster(detectCores()-1) # convention to leave 1 core for OS
registerDoParallel(cluster)

start_time <- Sys.time()
fit_pca_rf <- train(tr4tr$classe, method = "rf", verbose=FALSE, x=trainPCA, trControl = fitControl)
Sys.time() - start_time
## Time difference of 4.108429 mins
stopCluster(cluster)
registerDoSEQ()

confusionMatrix(tr4val$classe, predict(fit_pca_rf, valPCA))
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 1101    3    4    4    4
##          B    9  740   10    0    0
##          C    5    8  664    4    3
##          D    3    0   31  609    0
##          E    0    4    9    3  705
## 
## Overall Statistics
##                                          
##                Accuracy : 0.9735         
##                  95% CI : (0.968, 0.9783)
##     No Information Rate : 0.285          
##     P-Value [Acc > NIR] : < 2.2e-16      
##                                          
##                   Kappa : 0.9665         
##  Mcnemar's Test P-Value : NA             
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            0.9848   0.9801   0.9248   0.9823   0.9902
## Specificity            0.9947   0.9940   0.9938   0.9897   0.9950
## Pos Pred Value         0.9866   0.9750   0.9708   0.9471   0.9778
## Neg Pred Value         0.9939   0.9953   0.9833   0.9966   0.9978
## Prevalence             0.2850   0.1925   0.1830   0.1580   0.1815
## Detection Rate         0.2807   0.1886   0.1693   0.1552   0.1797
## Detection Prevalence   0.2845   0.1935   0.1744   0.1639   0.1838
## Balanced Accuracy      0.9897   0.9871   0.9593   0.9860   0.9926

GBM

cluster <- makeCluster(detectCores()-1) 
registerDoParallel(cluster)

start_time <- Sys.time()
fit_pca_gbm <- train(tr4tr$classe, method="gbm", verbose=FALSE, x=trainPCA, trControl = fitControl)
Sys.time() - start_time
## Time difference of 2.104061 mins
stopCluster(cluster)
registerDoSEQ()

confusionMatrix(tr4val$classe,predict(fit_pca_gbm, valPCA))
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   A   B   C   D   E
##          A 980  38  33  57   8
##          B  71 590  56  22  20
##          C  56  58 551  13   6
##          D  38  20  82 481  22
##          E  44  58  47  30 542
## 
## Overall Statistics
##                                           
##                Accuracy : 0.8014          
##                  95% CI : (0.7886, 0.8138)
##     No Information Rate : 0.3031          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.7482          
##  Mcnemar's Test P-Value : < 2.2e-16       
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            0.8242   0.7723   0.7165   0.7977   0.9064
## Specificity            0.9503   0.9465   0.9578   0.9512   0.9462
## Pos Pred Value         0.8781   0.7773   0.8056   0.7481   0.7517
## Neg Pred Value         0.9255   0.9450   0.9327   0.9628   0.9825
## Prevalence             0.3031   0.1947   0.1960   0.1537   0.1524
## Detection Rate         0.2498   0.1504   0.1405   0.1226   0.1382
## Detection Prevalence   0.2845   0.1935   0.1744   0.1639   0.1838
## Balanced Accuracy      0.8872   0.8594   0.8372   0.8744   0.9263

Interpretation

The results show that with more than half of the features, running time is reduced, but the validation accuracy is also reduced to 97% and 80% accordingly.

In reality, the selection between Random forest with or without PCA depends on the project (sample size, model update requirement…). For this project, and even when I didn’t perform any deep parameter tuning, I pick Random forest without PCA as my final model.

8. Prediction on Test set

# Apply on test set

confusionMatrix(tr4testp$classe, predict(fit_rf_par_rs, tr4testp))
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 1112    3    0    0    0
##          B    5  753    1    0    0
##          C    0    3  677    4    0
##          D    0    0    9  633    1
##          E    0    1    3    1  716
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9921          
##                  95% CI : (0.9888, 0.9946)
##     No Information Rate : 0.2848          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.99            
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            0.9955   0.9908   0.9812   0.9922   0.9986
## Specificity            0.9989   0.9981   0.9978   0.9970   0.9984
## Pos Pred Value         0.9973   0.9921   0.9898   0.9844   0.9931
## Neg Pred Value         0.9982   0.9978   0.9960   0.9985   0.9997
## Prevalence             0.2848   0.1938   0.1759   0.1627   0.1828
## Detection Rate         0.2835   0.1920   0.1726   0.1614   0.1826
## Detection Prevalence   0.2843   0.1935   0.1744   0.1639   0.1838
## Balanced Accuracy      0.9972   0.9944   0.9895   0.9946   0.9985
conMtxPlot(confusionMatrix(tr4testp$classe, predict(fit_rf_par_rs, tr4testp))$table)