With various wearable devices, it is now possible to collect a large amount of data about personal activity relatively inexpensively. The data in this project is measured from accelerometers on the belt, forearm, arm, and dumbell of 6 participants (http://groupware.les.inf.puc-rio.br/har). They were asked to perform barbell lifts correctly and incorrectly in 5 different ways.The goal of your project is to predict the manner in which they did the exercise. This is the “classe” variable in the training set.
My cross validation approach is using random subsampling into training and validating sets. I try 3 different prediction methods, namely Decision tree, Random forest, and Gradient boosting machine (gbm) with and without principle component analysis (pca). The results show that Random forest is the best method on this data with validation accuracy of 99%.
For this project, I found out that no figure is necessary in term of conveying the model information, except the decision tree. However it turns out that this method performs worst, therefore in the end I decice not to include any figure for the training process.
set.seed(12345)
library(caret); library(e1071); library(randomForest); library(gbm)
## Loading required package: lattice
## Loading required package: ggplot2
## randomForest 4.6-14
## Type rfNews() to see new features/changes/bug fixes.
##
## Attaching package: 'randomForest'
## The following object is masked from 'package:ggplot2':
##
## margin
## Loading required package: survival
##
## Attaching package: 'survival'
## The following object is masked from 'package:caret':
##
## cluster
## Loading required package: splines
## Loading required package: parallel
## Loaded gbm 2.1.3
# Load training data
tr <- read.csv("pml-training.csv", na.strings = c("NA",""," "))
Now let’s have a look at the data:
dim(tr)
## [1] 19622 160
head(tr[,5:15],3) # a first few rows and columns
## cvtd_timestamp new_window num_window roll_belt pitch_belt yaw_belt
## 1 05/12/2011 11:23 no 11 1.41 8.07 -94.4
## 2 05/12/2011 11:23 no 11 1.41 8.07 -94.4
## 3 05/12/2011 11:23 no 11 1.42 8.07 -94.4
## total_accel_belt kurtosis_roll_belt kurtosis_picth_belt
## 1 3 <NA> <NA>
## 2 3 <NA> <NA>
## 3 3 <NA> <NA>
## kurtosis_yaw_belt skewness_roll_belt
## 1 <NA> <NA>
## 2 <NA> <NA>
## 3 <NA> <NA>
The data has 19622 observations of 160 variables (the last one is the outcome). There are some variables with many NAs. I also notice that new_window has 406 values of yes:
summary(tr$new_window)
## no yes
## 19216 406
and it corresponds to the number of NA in other variables, for example:
summary(tr$kurtosis_yaw_belt)
## #DIV/0! NA's
## 406 19216
(19622 - 406 = 19216). Therefore, my first step is removing these columns:
na_count <- data.frame(sapply(tr, function(tr) sum(length(which(is.na(tr))))))
skipcol <- na_count == 19216
skipcol <- as.logical(skipcol)
tr1 <- tr[, -which(skipcol == 1)]
The data now has 60 variables. I now plot all data to see if any variable could be factorized. A boxplot is useful to detect outlier but a regular plot even shows outliers better and indicate factorable variables:
par(mfrow = c(6,10), mar=c(1,1,1,1))
for (i in 1:60)
{
plot(tr1[,i], pch = 20, col = "blue")
}
We can see some outliers in the data at columns 38, 39, 40, 45, 51, 52, 53. I will remove the points which is beyond 75th accummulated quantile from the data, then remove the first 7 variables that have nothing to do in prediction (time and index).
ol38<-as.numeric(tr1$gyros_dumbbell_x<(quantile(tr1$gyros_dumbbell_x)[1]+quantile(tr1$gyros_dumbbell_x)[2])/2)
ol39<-as.numeric(tr1$gyros_dumbbell_y>(quantile(tr1$gyros_dumbbell_y)[4]+quantile(tr1$gyros_dumbbell_y)[5])/2)
ol40<-as.numeric(tr1$gyros_dumbbell_z>(quantile(tr1$gyros_dumbbell_z)[4]+quantile(tr1$gyros_dumbbell_z)[5])/2)
ol45<-as.numeric(tr1$magnet_dumbbell_y<(quantile(tr1$magnet_dumbbell_y)[1]+quantile(tr1$magnet_dumbbell_y)[2])/2)
ol51<-as.numeric(tr1$gyros_forearm_x<(quantile(tr1$gyros_forearm_x)[1]+quantile(tr1$gyros_forearm_x)[2])/2)
ol52<-as.numeric(tr1$gyros_forearm_y>(quantile(tr1$gyros_forearm_y)[4]+quantile(tr1$gyros_forearm_y)[5])/2)
ol53<-as.numeric(tr1$gyros_forearm_z>(quantile(tr1$gyros_forearm_z)[4]+quantile(tr1$gyros_forearm_z)[5])/2)
skiprow <- ol38 + ol39 + ol40 + ol45 + ol51 + ol52 + ol53
tr2 <- tr1[-which(skiprow > 0),]
tr4 <- tr2[,-c(1:7)]
dim(tr4)
## [1] 19620 53
The data now has 19620 observations of 53 variables (the last column is the outcome).
I divide the training data into 2 parts: training and validation sets. For simplicity I use random subsampling instead of k-fold. Since the sample size is moderate (about 20000), my dividing ratio between training/validation samples is: 70% vs. 30%.
inTrain <- createDataPartition(y=tr4$classe, p=0.7, list=FALSE)
tr4tr <- tr4[inTrain, ]
tr4val <- tr4[-inTrain, ]
Now I normalize the data
preprocess <- preProcess(tr4tr, method = c("center", "scale")) # build the preprocess "template"
tr4trp <- predict(preprocess, tr4tr) # apply the template on the training data
tr4valp <- predict(preprocess, tr4val) # apply the template on the validation data
I will train different models using training set. Three models will be tested:
Decision tree
Random forest
Gradient boosting machine (gbm).
These trained models are then applied on the validation set. I will then pick the best model based on its accuracy performance on the validation set. Please note that model optimization and tuning is not focused on this project. So, most of the model parameters are kept as default.
Because training is time consuming, I extend the project by applying PCA before model training. PC transformation is applied on the training data so that 90% of the variance is preserved, then the transformed PC will be used to train, using the best methods seclected before.
At the end, I will compare all methods, in terms of performance and time duration. The best one will be applied on the test set for prediction.
# Training
start_time <- Sys.time()
fit_dt <- train(classe ~ ., data=tr4trp, method="rpart", trControl=trainControl(method="none"), tuneGrid=data.frame(cp=0.01))
Sys.time() - start_time
## Time difference of 1.003906 secs
# Validating
predTr_dt <- predict(fit_dt, tr4valp)
confusionMatrix(predTr_dt, tr4valp$classe)
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 1498 177 24 53 14
## B 51 671 90 81 91
## C 46 161 798 133 133
## D 22 83 79 639 54
## E 56 46 35 58 790
##
## Overall Statistics
##
## Accuracy : 0.7472
## 95% CI : (0.7359, 0.7583)
## No Information Rate : 0.2844
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.6798
## Mcnemar's Test P-Value : < 2.2e-16
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 0.8954 0.5896 0.7778 0.6629 0.7301
## Specificity 0.9363 0.9340 0.9026 0.9516 0.9594
## Pos Pred Value 0.8482 0.6819 0.6279 0.7286 0.8020
## Neg Pred Value 0.9575 0.9047 0.9506 0.9351 0.9404
## Prevalence 0.2844 0.1934 0.1744 0.1639 0.1839
## Detection Rate 0.2546 0.1141 0.1356 0.1086 0.1343
## Detection Prevalence 0.3002 0.1673 0.2160 0.1491 0.1674
## Balanced Accuracy 0.9159 0.7618 0.8402 0.8072 0.8448
# Training
start_time <- Sys.time()
fit_rf <- train(classe ~ ., data = tr4trp, method = "rf")
Sys.time() - start_time
## Time difference of 52.32324 mins
# Validaing
predTr_rf <- predict(fit_rf, tr4valp)
confusionMatrix(predTr_rf, tr4valp$classe)
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 1668 16 0 0 0
## B 4 1119 17 0 0
## C 0 3 1006 24 0
## D 1 0 3 940 4
## E 0 0 0 0 1078
##
## Overall Statistics
##
## Accuracy : 0.9878
## 95% CI : (0.9846, 0.9904)
## No Information Rate : 0.2844
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.9845
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 0.9970 0.9833 0.9805 0.9751 0.9963
## Specificity 0.9962 0.9956 0.9944 0.9984 1.0000
## Pos Pred Value 0.9905 0.9816 0.9739 0.9916 1.0000
## Neg Pred Value 0.9988 0.9960 0.9959 0.9951 0.9992
## Prevalence 0.2844 0.1934 0.1744 0.1639 0.1839
## Detection Rate 0.2835 0.1902 0.1710 0.1598 0.1832
## Detection Prevalence 0.2862 0.1938 0.1756 0.1611 0.1832
## Balanced Accuracy 0.9966 0.9894 0.9875 0.9867 0.9982
# Training
start_time <- Sys.time()
fit_gbm <- train(classe ~ ., data = tr4trp, method="gbm", verbose=FALSE)
Sys.time() - start_time
## Time difference of 22.97172 mins
# Validation
predTr_gbm <- predict(fit_gbm, tr4valp)
confusionMatrix(predTr_gbm, tr4valp$classe)
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 1651 37 0 4 2
## B 12 1060 32 4 11
## C 5 38 976 36 9
## D 4 3 18 916 16
## E 1 0 0 4 1044
##
## Overall Statistics
##
## Accuracy : 0.9599
## 95% CI : (0.9546, 0.9648)
## No Information Rate : 0.2844
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.9492
## Mcnemar's Test P-Value : 1.168e-07
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 0.9868 0.9315 0.9513 0.9502 0.9649
## Specificity 0.9898 0.9876 0.9819 0.9917 0.9990
## Pos Pred Value 0.9746 0.9473 0.9173 0.9572 0.9952
## Neg Pred Value 0.9947 0.9836 0.9896 0.9903 0.9921
## Prevalence 0.2844 0.1934 0.1744 0.1639 0.1839
## Detection Rate 0.2806 0.1802 0.1659 0.1557 0.1775
## Detection Prevalence 0.2879 0.1902 0.1809 0.1627 0.1783
## Balanced Accuracy 0.9883 0.9595 0.9666 0.9709 0.9819
Comparing 3 methods, the validation accuracy of Decision tree, Random forest and GBM are: 75%, 99%, 96%, respectively. The expected out of sample error for each of them are therefore: 25%, 1%, 4%, respectively.
Based from this result, the Random forest is the most robust method. However it is the most time consuming one (~53 minutes) compared to Decision tree (~1 second) and GBM (~23 minutes).
The next step of my analysis is apply PCA on the best 2 methods (Random forest and GBM), then train the models again to see how they perform.
Note: I can show a figure of the decision tree, but since I don’t use it, there is no need to show it.
preprocessPCA <- preProcess(tr4tr[,-53], method = "pca", thresh = 0.9)
preprocessPCA
## Created from 13737 samples and 52 variables
##
## Pre-processing:
## - centered (52)
## - ignored (0)
## - principal component signal extraction (52)
## - scaled (52)
##
## PCA needed 20 components to capture 90 percent of the variance
trainPCA <- predict(preprocessPCA, tr4tr[,-53])
valPCA <- predict(preprocessPCA,tr4val[,-53])
The analysis suggests that with 90% variance coverage, it requires only 20 principle components (instead of 52 variables) for training.
start_time <- Sys.time()
fit_pca_rf <- train(tr4tr$classe, method = "rf", verbose=FALSE, x=trainPCA)
Sys.time() - start_time
## Time difference of 24.07629 mins
confusionMatrix(tr4val$classe,predict(fit_pca_rf, valPCA))
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 1655 8 5 4 1
## B 29 1095 14 0 0
## C 0 17 990 17 2
## D 3 1 46 908 6
## E 1 12 5 9 1055
##
## Overall Statistics
##
## Accuracy : 0.9694
## 95% CI : (0.9647, 0.9737)
## No Information Rate : 0.2869
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.9613
## Mcnemar's Test P-Value : 1.704e-06
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 0.9805 0.9665 0.9340 0.9680 0.9915
## Specificity 0.9957 0.9909 0.9925 0.9887 0.9944
## Pos Pred Value 0.9892 0.9622 0.9649 0.9419 0.9750
## Neg Pred Value 0.9922 0.9920 0.9856 0.9939 0.9981
## Prevalence 0.2869 0.1926 0.1802 0.1594 0.1809
## Detection Rate 0.2813 0.1861 0.1683 0.1543 0.1793
## Detection Prevalence 0.2844 0.1934 0.1744 0.1639 0.1839
## Balanced Accuracy 0.9881 0.9787 0.9632 0.9783 0.9930
start_time <- Sys.time()
fit_pca_gbm <- train(tr4tr$classe, method="gbm", verbose=FALSE, x=trainPCA)
Sys.time() - start_time
## Time difference of 12.14939 mins
confusionMatrix(tr4val$classe,predict(fit_pca_gbm, valPCA))
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 1458 67 47 84 17
## B 129 852 92 25 40
## C 68 77 827 40 14
## D 58 30 107 741 28
## E 57 97 60 59 809
##
## Overall Statistics
##
## Accuracy : 0.7967
## 95% CI : (0.7862, 0.8069)
## No Information Rate : 0.3009
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.7424
## Mcnemar's Test P-Value : < 2.2e-16
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 0.8237 0.7587 0.7299 0.7808 0.8910
## Specificity 0.9477 0.9399 0.9581 0.9548 0.9451
## Pos Pred Value 0.8715 0.7487 0.8060 0.7687 0.7477
## Neg Pred Value 0.9259 0.9429 0.9370 0.9577 0.9794
## Prevalence 0.3009 0.1909 0.1926 0.1613 0.1543
## Detection Rate 0.2478 0.1448 0.1406 0.1260 0.1375
## Detection Prevalence 0.2844 0.1934 0.1744 0.1639 0.1839
## Balanced Accuracy 0.8857 0.8493 0.8440 0.8678 0.9180
The results show that with more than half of the features, training time is almost haved: ~24 minutes for Random forest and ~13 minutes for GBM. The validation accuracy is reduced to 97% and 80% accordingly. With PCA, Random forest performs similarly to GBM without PCA.
In reality, the selection between Random forest with or without PCA depends on the project (sample size, model update requirement…). For this project, and even when I didn’t perform any deep parameter tuning, I pick Random forest without PCA as my final model.
As the training set is normalized prior to training, the test set needs to be processed that way.
# Process
ts <- read.csv("pml-testing.csv", na.strings = c("NA",""," "))
ts1 <- ts[, -which(skipcol == 1)]
ts4 <- ts1[,-c(1:7)]
ts4valp <- predict(preprocess, ts4)
# Prediction
predTs <- predict(fit_rf, ts4valp)
predTs
## [1] B A B A A E D B A A B C B A E E A B B B
## Levels: A B C D E