Summary

With various wearable devices, it is now possible to collect a large amount of data about personal activity relatively inexpensively. The data in this project is measured from accelerometers on the belt, forearm, arm, and dumbell of 6 participants (http://groupware.les.inf.puc-rio.br/har). They were asked to perform barbell lifts correctly and incorrectly in 5 different ways.The goal of your project is to predict the manner in which they did the exercise. This is the “classe” variable in the training set.

My cross validation approach is using random subsampling into training and validating sets. I try 3 different prediction methods, namely Decision tree, Random forest, and Gradient boosting machine (gbm) with and without principle component analysis (pca). The results show that Random forest is the best method on this data with validation accuracy of 99%.

For this project, I found out that no figure is necessary in term of conveying the model information, except the decision tree. However it turns out that this method performs worst, therefore in the end I decice not to include any figure for the training process.

Loading library and data

set.seed(12345)
library(caret); library(e1071); library(randomForest); library(gbm)
## Loading required package: lattice
## Loading required package: ggplot2
## randomForest 4.6-14
## Type rfNews() to see new features/changes/bug fixes.
## 
## Attaching package: 'randomForest'
## The following object is masked from 'package:ggplot2':
## 
##     margin
## Loading required package: survival
## 
## Attaching package: 'survival'
## The following object is masked from 'package:caret':
## 
##     cluster
## Loading required package: splines
## Loading required package: parallel
## Loaded gbm 2.1.3
# Load training data
tr <- read.csv("pml-training.csv", na.strings = c("NA",""," "))

Exploratory analysis and data cleaning

Now let’s have a look at the data:

dim(tr)
## [1] 19622   160
head(tr[,5:15],3)   # a first few rows and columns
##     cvtd_timestamp new_window num_window roll_belt pitch_belt yaw_belt
## 1 05/12/2011 11:23         no         11      1.41       8.07    -94.4
## 2 05/12/2011 11:23         no         11      1.41       8.07    -94.4
## 3 05/12/2011 11:23         no         11      1.42       8.07    -94.4
##   total_accel_belt kurtosis_roll_belt kurtosis_picth_belt
## 1                3               <NA>                <NA>
## 2                3               <NA>                <NA>
## 3                3               <NA>                <NA>
##   kurtosis_yaw_belt skewness_roll_belt
## 1              <NA>               <NA>
## 2              <NA>               <NA>
## 3              <NA>               <NA>

The data has 19622 observations of 160 variables (the last one is the outcome). There are some variables with many NAs. I also notice that new_window has 406 values of yes:

summary(tr$new_window)
##    no   yes 
## 19216   406

and it corresponds to the number of NA in other variables, for example:

summary(tr$kurtosis_yaw_belt)
## #DIV/0!    NA's 
##     406   19216

(19622 - 406 = 19216). Therefore, my first step is removing these columns:

na_count <- data.frame(sapply(tr, function(tr) sum(length(which(is.na(tr))))))
skipcol <- na_count == 19216
skipcol <- as.logical(skipcol)
tr1 <- tr[, -which(skipcol == 1)]

The data now has 60 variables. I now plot all data to see if any variable could be factorized. A boxplot is useful to detect outlier but a regular plot even shows outliers better and indicate factorable variables:

par(mfrow = c(6,10), mar=c(1,1,1,1))
for (i in 1:60)
{
    plot(tr1[,i], pch = 20, col = "blue")
}

We can see some outliers in the data at columns 38, 39, 40, 45, 51, 52, 53. I will remove the points which is beyond 75th accummulated quantile from the data, then remove the first 7 variables that have nothing to do in prediction (time and index).

ol38<-as.numeric(tr1$gyros_dumbbell_x<(quantile(tr1$gyros_dumbbell_x)[1]+quantile(tr1$gyros_dumbbell_x)[2])/2)
ol39<-as.numeric(tr1$gyros_dumbbell_y>(quantile(tr1$gyros_dumbbell_y)[4]+quantile(tr1$gyros_dumbbell_y)[5])/2)
ol40<-as.numeric(tr1$gyros_dumbbell_z>(quantile(tr1$gyros_dumbbell_z)[4]+quantile(tr1$gyros_dumbbell_z)[5])/2)
ol45<-as.numeric(tr1$magnet_dumbbell_y<(quantile(tr1$magnet_dumbbell_y)[1]+quantile(tr1$magnet_dumbbell_y)[2])/2)
ol51<-as.numeric(tr1$gyros_forearm_x<(quantile(tr1$gyros_forearm_x)[1]+quantile(tr1$gyros_forearm_x)[2])/2)
ol52<-as.numeric(tr1$gyros_forearm_y>(quantile(tr1$gyros_forearm_y)[4]+quantile(tr1$gyros_forearm_y)[5])/2)
ol53<-as.numeric(tr1$gyros_forearm_z>(quantile(tr1$gyros_forearm_z)[4]+quantile(tr1$gyros_forearm_z)[5])/2)
skiprow <- ol38 + ol39 + ol40 + ol45 + ol51 + ol52 + ol53

tr2 <- tr1[-which(skiprow > 0),]
tr4 <- tr2[,-c(1:7)]

dim(tr4)
## [1] 19620    53

The data now has 19620 observations of 53 variables (the last column is the outcome).

Cross validation and data preprocessing

I divide the training data into 2 parts: training and validation sets. For simplicity I use random subsampling instead of k-fold. Since the sample size is moderate (about 20000), my dividing ratio between training/validation samples is: 70% vs. 30%.

inTrain <- createDataPartition(y=tr4$classe, p=0.7, list=FALSE)
tr4tr <- tr4[inTrain, ]
tr4val <- tr4[-inTrain, ]

Now I normalize the data

preprocess <- preProcess(tr4tr, method = c("center", "scale"))   # build the preprocess "template"
tr4trp <- predict(preprocess, tr4tr)                             # apply the template on the training data
tr4valp <- predict(preprocess, tr4val)                           # apply the template on the validation data

Strategy

I will train different models using training set. Three models will be tested:

These trained models are then applied on the validation set. I will then pick the best model based on its accuracy performance on the validation set. Please note that model optimization and tuning is not focused on this project. So, most of the model parameters are kept as default.

Because training is time consuming, I extend the project by applying PCA before model training. PC transformation is applied on the training data so that 90% of the variance is preserved, then the transformed PC will be used to train, using the best methods seclected before.

At the end, I will compare all methods, in terms of performance and time duration. The best one will be applied on the test set for prediction.

Train and validate models without PCA

Decision tree

# Training
start_time <- Sys.time()
fit_dt <- train(classe ~ ., data=tr4trp, method="rpart", trControl=trainControl(method="none"), tuneGrid=data.frame(cp=0.01))
Sys.time() - start_time
## Time difference of 1.003906 secs
# Validating
predTr_dt <- predict(fit_dt, tr4valp)
confusionMatrix(predTr_dt, tr4valp$classe)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 1498  177   24   53   14
##          B   51  671   90   81   91
##          C   46  161  798  133  133
##          D   22   83   79  639   54
##          E   56   46   35   58  790
## 
## Overall Statistics
##                                           
##                Accuracy : 0.7472          
##                  95% CI : (0.7359, 0.7583)
##     No Information Rate : 0.2844          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.6798          
##  Mcnemar's Test P-Value : < 2.2e-16       
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            0.8954   0.5896   0.7778   0.6629   0.7301
## Specificity            0.9363   0.9340   0.9026   0.9516   0.9594
## Pos Pred Value         0.8482   0.6819   0.6279   0.7286   0.8020
## Neg Pred Value         0.9575   0.9047   0.9506   0.9351   0.9404
## Prevalence             0.2844   0.1934   0.1744   0.1639   0.1839
## Detection Rate         0.2546   0.1141   0.1356   0.1086   0.1343
## Detection Prevalence   0.3002   0.1673   0.2160   0.1491   0.1674
## Balanced Accuracy      0.9159   0.7618   0.8402   0.8072   0.8448

Random forest

# Training
start_time <- Sys.time()
fit_rf <- train(classe ~ ., data = tr4trp, method = "rf")
Sys.time() - start_time
## Time difference of 52.32324 mins
# Validaing
predTr_rf <- predict(fit_rf, tr4valp)
confusionMatrix(predTr_rf, tr4valp$classe)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 1668   16    0    0    0
##          B    4 1119   17    0    0
##          C    0    3 1006   24    0
##          D    1    0    3  940    4
##          E    0    0    0    0 1078
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9878          
##                  95% CI : (0.9846, 0.9904)
##     No Information Rate : 0.2844          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.9845          
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            0.9970   0.9833   0.9805   0.9751   0.9963
## Specificity            0.9962   0.9956   0.9944   0.9984   1.0000
## Pos Pred Value         0.9905   0.9816   0.9739   0.9916   1.0000
## Neg Pred Value         0.9988   0.9960   0.9959   0.9951   0.9992
## Prevalence             0.2844   0.1934   0.1744   0.1639   0.1839
## Detection Rate         0.2835   0.1902   0.1710   0.1598   0.1832
## Detection Prevalence   0.2862   0.1938   0.1756   0.1611   0.1832
## Balanced Accuracy      0.9966   0.9894   0.9875   0.9867   0.9982

GBM

# Training
start_time <- Sys.time()
fit_gbm <- train(classe ~ ., data = tr4trp, method="gbm", verbose=FALSE)
Sys.time() - start_time
## Time difference of 22.97172 mins
# Validation
predTr_gbm <- predict(fit_gbm, tr4valp)
confusionMatrix(predTr_gbm, tr4valp$classe)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 1651   37    0    4    2
##          B   12 1060   32    4   11
##          C    5   38  976   36    9
##          D    4    3   18  916   16
##          E    1    0    0    4 1044
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9599          
##                  95% CI : (0.9546, 0.9648)
##     No Information Rate : 0.2844          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.9492          
##  Mcnemar's Test P-Value : 1.168e-07       
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            0.9868   0.9315   0.9513   0.9502   0.9649
## Specificity            0.9898   0.9876   0.9819   0.9917   0.9990
## Pos Pred Value         0.9746   0.9473   0.9173   0.9572   0.9952
## Neg Pred Value         0.9947   0.9836   0.9896   0.9903   0.9921
## Prevalence             0.2844   0.1934   0.1744   0.1639   0.1839
## Detection Rate         0.2806   0.1802   0.1659   0.1557   0.1775
## Detection Prevalence   0.2879   0.1902   0.1809   0.1627   0.1783
## Balanced Accuracy      0.9883   0.9595   0.9666   0.9709   0.9819

Interpretation

Comparing 3 methods, the validation accuracy of Decision tree, Random forest and GBM are: 75%, 99%, 96%, respectively. The expected out of sample error for each of them are therefore: 25%, 1%, 4%, respectively.

Based from this result, the Random forest is the most robust method. However it is the most time consuming one (~53 minutes) compared to Decision tree (~1 second) and GBM (~23 minutes).

The next step of my analysis is apply PCA on the best 2 methods (Random forest and GBM), then train the models again to see how they perform.

Note: I can show a figure of the decision tree, but since I don’t use it, there is no need to show it.

Train and validate Random forest and GBM with PCA

PCA processing

preprocessPCA <- preProcess(tr4tr[,-53], method = "pca", thresh = 0.9)
preprocessPCA
## Created from 13737 samples and 52 variables
## 
## Pre-processing:
##   - centered (52)
##   - ignored (0)
##   - principal component signal extraction (52)
##   - scaled (52)
## 
## PCA needed 20 components to capture 90 percent of the variance
trainPCA <- predict(preprocessPCA, tr4tr[,-53])
valPCA <- predict(preprocessPCA,tr4val[,-53])

The analysis suggests that with 90% variance coverage, it requires only 20 principle components (instead of 52 variables) for training.

Random forest

start_time <- Sys.time()
fit_pca_rf <- train(tr4tr$classe, method = "rf", verbose=FALSE, x=trainPCA)
Sys.time() - start_time
## Time difference of 24.07629 mins
confusionMatrix(tr4val$classe,predict(fit_pca_rf, valPCA))
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 1655    8    5    4    1
##          B   29 1095   14    0    0
##          C    0   17  990   17    2
##          D    3    1   46  908    6
##          E    1   12    5    9 1055
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9694          
##                  95% CI : (0.9647, 0.9737)
##     No Information Rate : 0.2869          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.9613          
##  Mcnemar's Test P-Value : 1.704e-06       
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            0.9805   0.9665   0.9340   0.9680   0.9915
## Specificity            0.9957   0.9909   0.9925   0.9887   0.9944
## Pos Pred Value         0.9892   0.9622   0.9649   0.9419   0.9750
## Neg Pred Value         0.9922   0.9920   0.9856   0.9939   0.9981
## Prevalence             0.2869   0.1926   0.1802   0.1594   0.1809
## Detection Rate         0.2813   0.1861   0.1683   0.1543   0.1793
## Detection Prevalence   0.2844   0.1934   0.1744   0.1639   0.1839
## Balanced Accuracy      0.9881   0.9787   0.9632   0.9783   0.9930

GBM

start_time <- Sys.time()
fit_pca_gbm <- train(tr4tr$classe, method="gbm", verbose=FALSE, x=trainPCA)
Sys.time() - start_time
## Time difference of 12.14939 mins
confusionMatrix(tr4val$classe,predict(fit_pca_gbm, valPCA))
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 1458   67   47   84   17
##          B  129  852   92   25   40
##          C   68   77  827   40   14
##          D   58   30  107  741   28
##          E   57   97   60   59  809
## 
## Overall Statistics
##                                           
##                Accuracy : 0.7967          
##                  95% CI : (0.7862, 0.8069)
##     No Information Rate : 0.3009          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.7424          
##  Mcnemar's Test P-Value : < 2.2e-16       
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            0.8237   0.7587   0.7299   0.7808   0.8910
## Specificity            0.9477   0.9399   0.9581   0.9548   0.9451
## Pos Pred Value         0.8715   0.7487   0.8060   0.7687   0.7477
## Neg Pred Value         0.9259   0.9429   0.9370   0.9577   0.9794
## Prevalence             0.3009   0.1909   0.1926   0.1613   0.1543
## Detection Rate         0.2478   0.1448   0.1406   0.1260   0.1375
## Detection Prevalence   0.2844   0.1934   0.1744   0.1639   0.1839
## Balanced Accuracy      0.8857   0.8493   0.8440   0.8678   0.9180

Interpretation

The results show that with more than half of the features, training time is almost haved: ~24 minutes for Random forest and ~13 minutes for GBM. The validation accuracy is reduced to 97% and 80% accordingly. With PCA, Random forest performs similarly to GBM without PCA.

In reality, the selection between Random forest with or without PCA depends on the project (sample size, model update requirement…). For this project, and even when I didn’t perform any deep parameter tuning, I pick Random forest without PCA as my final model.

Prediction on Test set

As the training set is normalized prior to training, the test set needs to be processed that way.

# Process
ts <- read.csv("pml-testing.csv", na.strings = c("NA",""," "))
ts1 <- ts[, -which(skipcol == 1)]
ts4 <- ts1[,-c(1:7)]
ts4valp <- predict(preprocess, ts4)
# Prediction
predTs <- predict(fit_rf, ts4valp)
predTs
##  [1] B A B A A E D B A A B C B A E E A B B B
## Levels: A B C D E