With various wearable devices, it is now possible to collect a large amount of data about personal activity relatively inexpensively. The data in this project is measured from accelerometers on the belt, forearm, arm, and dumbell of 6 participants (http://groupware.les.inf.puc-rio.br/har). They were asked to perform barbell lifts correctly and incorrectly in 5 different ways.The goal of your project is to predict the manner in which they did the exercise. This is the “classe” variable in the training set. The work was done firstly by Velloso, E.; Bulling, A.; Gellersen, H.; Ugulino, W.; Fuks, H., 2013.
The project includes some intuitions into the data, exploratory analysis, actual prediction prediction and interpretation.
My cross validation approach is using random subsampling into training and validating sets. I try 3 different prediction methods, namely Decision tree, Random forest, and Gradient boosting machine (gbm) with and without principle component analysis (pca). The results show that Random forest is the best method on this data with validation accuracy of 99%.
The data has 5 classes:
A: correct movement
B: throwing elbows to the front
C: lift up 1/2
D: lower down 1/2
E: throwing hip to the front
The 4 sensors are attached to: arm (bicep/tricep), forearm (on glove, close to the wrist), belt, and weight (dumbbell).
My thoughts:
since the forearm and weight sensors are close together, their measurements should be similar (except the weight measurements have more torque / twist movements)
class E should be separated by hip-sensor measurements
class B should be separated by arm measurements (X, Y, Z)
classes A, C, D should be separated by forearm and weight measurements
some classification methods: tree-based, SVM (KNN for clustering, complicated NN maybe later)
Loading library and data
set.seed(12345)
library(caret); library(e1071); library(gbm); library(randomForest)
## Loading required package: lattice
## Loading required package: ggplot2
## Warning: package 'ggplot2' was built under R version 3.5.1
## Loading required package: survival
##
## Attaching package: 'survival'
## The following object is masked from 'package:caret':
##
## cluster
## Loading required package: splines
## Loading required package: parallel
## Loaded gbm 2.1.3
## randomForest 4.6-14
## Type rfNews() to see new features/changes/bug fixes.
##
## Attaching package: 'randomForest'
## The following object is masked from 'package:ggplot2':
##
## margin
library(parallel); library(doParallel)
## Warning: package 'doParallel' was built under R version 3.5.1
## Loading required package: foreach
## Loading required package: iterators
tr <- read.csv("pml-training.csv", na.strings = c("NA",""," "))
Now let’s have a look at the data:
names(tr)
## [1] "X" "user_name"
## [3] "raw_timestamp_part_1" "raw_timestamp_part_2"
## [5] "cvtd_timestamp" "new_window"
## [7] "num_window" "roll_belt"
## [9] "pitch_belt" "yaw_belt"
## [11] "total_accel_belt" "kurtosis_roll_belt"
## [13] "kurtosis_picth_belt" "kurtosis_yaw_belt"
## [15] "skewness_roll_belt" "skewness_roll_belt.1"
## [17] "skewness_yaw_belt" "max_roll_belt"
## [19] "max_picth_belt" "max_yaw_belt"
## [21] "min_roll_belt" "min_pitch_belt"
## [23] "min_yaw_belt" "amplitude_roll_belt"
## [25] "amplitude_pitch_belt" "amplitude_yaw_belt"
## [27] "var_total_accel_belt" "avg_roll_belt"
## [29] "stddev_roll_belt" "var_roll_belt"
## [31] "avg_pitch_belt" "stddev_pitch_belt"
## [33] "var_pitch_belt" "avg_yaw_belt"
## [35] "stddev_yaw_belt" "var_yaw_belt"
## [37] "gyros_belt_x" "gyros_belt_y"
## [39] "gyros_belt_z" "accel_belt_x"
## [41] "accel_belt_y" "accel_belt_z"
## [43] "magnet_belt_x" "magnet_belt_y"
## [45] "magnet_belt_z" "roll_arm"
## [47] "pitch_arm" "yaw_arm"
## [49] "total_accel_arm" "var_accel_arm"
## [51] "avg_roll_arm" "stddev_roll_arm"
## [53] "var_roll_arm" "avg_pitch_arm"
## [55] "stddev_pitch_arm" "var_pitch_arm"
## [57] "avg_yaw_arm" "stddev_yaw_arm"
## [59] "var_yaw_arm" "gyros_arm_x"
## [61] "gyros_arm_y" "gyros_arm_z"
## [63] "accel_arm_x" "accel_arm_y"
## [65] "accel_arm_z" "magnet_arm_x"
## [67] "magnet_arm_y" "magnet_arm_z"
## [69] "kurtosis_roll_arm" "kurtosis_picth_arm"
## [71] "kurtosis_yaw_arm" "skewness_roll_arm"
## [73] "skewness_pitch_arm" "skewness_yaw_arm"
## [75] "max_roll_arm" "max_picth_arm"
## [77] "max_yaw_arm" "min_roll_arm"
## [79] "min_pitch_arm" "min_yaw_arm"
## [81] "amplitude_roll_arm" "amplitude_pitch_arm"
## [83] "amplitude_yaw_arm" "roll_dumbbell"
## [85] "pitch_dumbbell" "yaw_dumbbell"
## [87] "kurtosis_roll_dumbbell" "kurtosis_picth_dumbbell"
## [89] "kurtosis_yaw_dumbbell" "skewness_roll_dumbbell"
## [91] "skewness_pitch_dumbbell" "skewness_yaw_dumbbell"
## [93] "max_roll_dumbbell" "max_picth_dumbbell"
## [95] "max_yaw_dumbbell" "min_roll_dumbbell"
## [97] "min_pitch_dumbbell" "min_yaw_dumbbell"
## [99] "amplitude_roll_dumbbell" "amplitude_pitch_dumbbell"
## [101] "amplitude_yaw_dumbbell" "total_accel_dumbbell"
## [103] "var_accel_dumbbell" "avg_roll_dumbbell"
## [105] "stddev_roll_dumbbell" "var_roll_dumbbell"
## [107] "avg_pitch_dumbbell" "stddev_pitch_dumbbell"
## [109] "var_pitch_dumbbell" "avg_yaw_dumbbell"
## [111] "stddev_yaw_dumbbell" "var_yaw_dumbbell"
## [113] "gyros_dumbbell_x" "gyros_dumbbell_y"
## [115] "gyros_dumbbell_z" "accel_dumbbell_x"
## [117] "accel_dumbbell_y" "accel_dumbbell_z"
## [119] "magnet_dumbbell_x" "magnet_dumbbell_y"
## [121] "magnet_dumbbell_z" "roll_forearm"
## [123] "pitch_forearm" "yaw_forearm"
## [125] "kurtosis_roll_forearm" "kurtosis_picth_forearm"
## [127] "kurtosis_yaw_forearm" "skewness_roll_forearm"
## [129] "skewness_pitch_forearm" "skewness_yaw_forearm"
## [131] "max_roll_forearm" "max_picth_forearm"
## [133] "max_yaw_forearm" "min_roll_forearm"
## [135] "min_pitch_forearm" "min_yaw_forearm"
## [137] "amplitude_roll_forearm" "amplitude_pitch_forearm"
## [139] "amplitude_yaw_forearm" "total_accel_forearm"
## [141] "var_accel_forearm" "avg_roll_forearm"
## [143] "stddev_roll_forearm" "var_roll_forearm"
## [145] "avg_pitch_forearm" "stddev_pitch_forearm"
## [147] "var_pitch_forearm" "avg_yaw_forearm"
## [149] "stddev_yaw_forearm" "var_yaw_forearm"
## [151] "gyros_forearm_x" "gyros_forearm_y"
## [153] "gyros_forearm_z" "accel_forearm_x"
## [155] "accel_forearm_y" "accel_forearm_z"
## [157] "magnet_forearm_x" "magnet_forearm_y"
## [159] "magnet_forearm_z" "classe"
dim(tr)
## [1] 19622 160
head(tr[,5:15],3)
## cvtd_timestamp new_window num_window roll_belt pitch_belt yaw_belt
## 1 05/12/2011 11:23 no 11 1.41 8.07 -94.4
## 2 05/12/2011 11:23 no 11 1.41 8.07 -94.4
## 3 05/12/2011 11:23 no 11 1.42 8.07 -94.4
## total_accel_belt kurtosis_roll_belt kurtosis_picth_belt
## 1 3 <NA> <NA>
## 2 3 <NA> <NA>
## 3 3 <NA> <NA>
## kurtosis_yaw_belt skewness_roll_belt
## 1 <NA> <NA>
## 2 <NA> <NA>
## 3 <NA> <NA>
The data has 19622 observations of 160 variables (the last one is the outcome). There are some variables with many NAs. I also notice that new_window has 406 values of yes:
summary(tr$new_window)
## no yes
## 19216 406
and it corresponds to the number of NA in other variables, for example:
summary(tr$kurtosis_yaw_belt)
## #DIV/0! NA's
## 406 19216
summary(tr$amplitude_yaw_belt)
## #DIV/0! 0.00 0.0000 NA's
## 10 12 384 19216
(19622 - 406 = 19216). Very interesting! Let’s have a look at the some variables where new_window is ‘yes’
idx <- which(tr$new_window=='yes')[1:5]
tr[idx,c("new_window","kurtosis_roll_belt","kurtosis_yaw_belt","amplitude_yaw_belt","roll_belt","gyros_belt_x")]
## new_window kurtosis_roll_belt kurtosis_yaw_belt amplitude_yaw_belt
## 24 yes 5.587755 #DIV/0! 0.0000
## 52 yes -0.997130 #DIV/0! 0.0000
## 76 yes 7.515290 #DIV/0! 0.0000
## 165 yes -2.121212 #DIV/0! 0.0000
## 210 yes -1.122273 #DIV/0! 0.0000
## roll_belt gyros_belt_x
## 24 1.51 0.02
## 52 1.27 0.00
## 76 1.18 0.02
## 165 1.01 0.03
## 210 129.00 -0.43
and where new_window is ‘no’
idx <- which(tr$new_window=='no')[1:5]
tr[idx,c("new_window","kurtosis_roll_belt","kurtosis_yaw_belt","amplitude_yaw_belt","roll_belt","gyros_belt_x")]
## new_window kurtosis_roll_belt kurtosis_yaw_belt amplitude_yaw_belt
## 1 no <NA> <NA> <NA>
## 2 no <NA> <NA> <NA>
## 3 no <NA> <NA> <NA>
## 4 no <NA> <NA> <NA>
## 5 no <NA> <NA> <NA>
## roll_belt gyros_belt_x
## 1 1.41 0.00
## 2 1.41 0.02
## 3 1.42 0.00
## 4 1.48 0.02
## 5 1.48 0.02
Now we have a better understand about the data structure. This is how it looks like:
A lot of variables contain mostly NAs (19216 with time_window = ‘no’) In these variables, where the data is not NA, (406, with time_window = ‘yes’), some variables have problematic measurements. For example: DIV/0, zeros. These variables apparently are the one with NA problem. I think these variables have problems in measurements: normally it record nothing (NAs are for NA, or missing data, or space) or problematic numbers.
My strategy to clean the data: remove columns with mostly NAs
na_count <- data.frame(sapply(tr, function(tr) sum(length(which(is.na(tr))))))
skipcol <- na_count == 19216
skipcol <- as.logical(skipcol)
tr1 <- tr[, -which(skipcol == 1)]
dim(tr1)
## [1] 19622 60
The data now has 60 variables. Let’s have a look at some variables:
wtype <- as.factor(tr1[,60]) #last column is the class
plot(tr1$roll_belt, pch=20, col=wtype)
legend("topleft", legend=levels(wtype), pch=16, col=unique(wtype))
The data seems to be sorted by class. I now plot all data to see if any variable could be factorized. A boxplot is useful to detect outlier but a regular plot even shows outliers better and indicate factorable variables:
par(mfrow = c(6,10), mar=c(1,1,1,1))
for (i in 1:60)
{
plot(tr1[,i], pch=20, col="blue")
}
We can see some outliers in the data at variables 38, 39, 40, 45, 51, 52, 53. I will remove the points which is beyond 75th accummulated quantile from the data, then remove the first 7 variables that have nothing to do in prediction (time and index).
ol38<-as.numeric(tr1$gyros_dumbbell_x<(quantile(tr1$gyros_dumbbell_x)[1]+quantile(tr1$gyros_dumbbell_x)[2])/2)
ol39<-as.numeric(tr1$gyros_dumbbell_y>(quantile(tr1$gyros_dumbbell_y)[4]+quantile(tr1$gyros_dumbbell_y)[5])/2)
ol40<-as.numeric(tr1$gyros_dumbbell_z>(quantile(tr1$gyros_dumbbell_z)[4]+quantile(tr1$gyros_dumbbell_z)[5])/2)
ol45<-as.numeric(tr1$magnet_dumbbell_y<(quantile(tr1$magnet_dumbbell_y)[1]+quantile(tr1$magnet_dumbbell_y)[2])/2)
ol51<-as.numeric(tr1$gyros_forearm_x<(quantile(tr1$gyros_forearm_x)[1]+quantile(tr1$gyros_forearm_x)[2])/2)
ol52<-as.numeric(tr1$gyros_forearm_y>(quantile(tr1$gyros_forearm_y)[4]+quantile(tr1$gyros_forearm_y)[5])/2)
ol53<-as.numeric(tr1$gyros_forearm_z>(quantile(tr1$gyros_forearm_z)[4]+quantile(tr1$gyros_forearm_z)[5])/2)
skiprow <- ol38 + ol39 + ol40 + ol45 + ol51 + ol52 + ol53
tr2 <- tr1[-which(skiprow > 0),]
dim(tr2)
## [1] 19620 60
The data now has 19620 observations of 60 variables (the last column is the outcome). Let’s have a look at the data again, with color of classes:
par(mfrow = c(6,10), mar=c(1,1,1,1))
for (i in 1:60)
{
plot(tr2[,i], pch=20, col=wtype)
}
Now there is no outlier, so we don’t miss anything.
Clearly now the data is sorted by classes (plot 1, 4, 60). Probably when measuring, the volunteers were asked to do a certain type of movement before moving to the others. They also indicate that the classes are not skewed.
Now let me remove the first 7 variables: timing and volunteer object.
tr4 <- tr2[,-c(1:7)]
dim(tr4)
## [1] 19620 53
par(mfrow = c(4,13), mar=c(1,1,1,1))
for (i in 1:52)
{
plot(tr4[,i], pch=20, col=wtype)
}
names(tr4)
## [1] "roll_belt" "pitch_belt" "yaw_belt"
## [4] "total_accel_belt" "gyros_belt_x" "gyros_belt_y"
## [7] "gyros_belt_z" "accel_belt_x" "accel_belt_y"
## [10] "accel_belt_z" "magnet_belt_x" "magnet_belt_y"
## [13] "magnet_belt_z" "roll_arm" "pitch_arm"
## [16] "yaw_arm" "total_accel_arm" "gyros_arm_x"
## [19] "gyros_arm_y" "gyros_arm_z" "accel_arm_x"
## [22] "accel_arm_y" "accel_arm_z" "magnet_arm_x"
## [25] "magnet_arm_y" "magnet_arm_z" "roll_dumbbell"
## [28] "pitch_dumbbell" "yaw_dumbbell" "total_accel_dumbbell"
## [31] "gyros_dumbbell_x" "gyros_dumbbell_y" "gyros_dumbbell_z"
## [34] "accel_dumbbell_x" "accel_dumbbell_y" "accel_dumbbell_z"
## [37] "magnet_dumbbell_x" "magnet_dumbbell_y" "magnet_dumbbell_z"
## [40] "roll_forearm" "pitch_forearm" "yaw_forearm"
## [43] "total_accel_forearm" "gyros_forearm_x" "gyros_forearm_y"
## [46] "gyros_forearm_z" "accel_forearm_x" "accel_forearm_y"
## [49] "accel_forearm_z" "magnet_forearm_x" "magnet_forearm_y"
## [52] "magnet_forearm_z" "classe"
The data now has 19620 observations of 53 variables (the last column is the outcome). Each sensor has 13 measurements and are displayed at each row, the order is: belt, arm, dumbbell, and forearm.
I divide the training data into 3 parts: training, validation and test sets (random subsampling). Since the sample size is moderate (about 20,000), my dividing ratio between training/validation/test samples is: 60%/20%/20%.
inTrain <- createDataPartition(y=tr4$classe, p = 0.6, list=FALSE)
tr4tr <- tr4[inTrain, ]
tr4tmp <- tr4[-inTrain, ]
inVal <- createDataPartition(y=tr4tmp$classe, p = 0.5, list=FALSE)
tr4val <- tr4tmp[inVal, ]
tr4test <- tr4tmp[-inVal, ]
Now I normalize the data (actually this is not required when using tree-based methods but as a habit, I usually normalize the data before feeding it to any model).
preprocess <- preProcess(tr4tr, method = c("center", "scale")) # build the preprocess "template"
tr4trp <- predict(preprocess, tr4tr) # apply the template on the training set
tr4valp <- predict(preprocess, tr4val) # apply the template on the validation set
tr4testp <- predict(preprocess, tr4test) # apply the template on the test set
I will train different models using training set. Three models will be tested:
Decision tree
Random forest
Gradient boosting machine (gbm).
Decision tree is tested but I’m afraid it won’t perform very well. Other methods may be better but is harder to interpret.
For demonstration, they are trained without parameter tuning. Keep in mind that the more sophisticated BGM performs better than simple RF if being tuned properly. However RF can be paralleled while GBM theoretically can not.
Notes: - Resampling: from boostrapping to k-fold cross-validation. - Parallelization: because trees are selected independently (unlike GBM where a tree depends on the previous ones), this will help to reduce computing time.
These trained models are then applied on the validation set. I will then pick the best model based on its accuracy performance on the validation set. Please note that model optimization and tuning is not focused on this project. So, most of the model parameters are kept as default.
Because training is time consuming, I extend the project by applying PCA before model training. PC transformation is applied on the training data so that 90% of the variance is preserved, then the transformed PC will be used to train, using the best methods seclected before.
At the end, I will compare all methods, in terms of performance and time duration. The best one will be applied on the test set for prediction.
fitControl <- trainControl(method = "cv", number = 5, allowParallel = TRUE)
conMtxPlot <- function(a){
a <- prop.table(a,2)
name <- c("A","B","C","D","E")
conf <- as.data.frame(as.table(a))
plot <- ggplot(conf)
plot + geom_tile(aes(x=Reference, y = Prediction, fill=Freq))+
scale_x_discrete(name = "Actual Class") +
scale_y_discrete(name="Predicted class") +
scale_fill_gradient(breaks=seq(from=-.5, to=4, by=.2)) +
labs(fill="Normalized\nFrequency") +
geom_text(aes(x =Reference , y=Prediction, label = sprintf("%1.2f",Freq)), vjust = 1)
}
# Training
start_time <- Sys.time()
fit_dt <- train(classe ~ ., data=tr4trp, method="rpart", trControl=trainControl(method="none"), tuneGrid=data.frame(cp=0.01))
Sys.time() - start_time
## Time difference of 1.27821 secs
# Validating
confusionMatrix(tr4valp$classe, predict(fit_dt, tr4valp))
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 1032 26 17 18 23
## B 136 414 76 31 102
## C 44 32 447 124 37
## D 61 16 42 413 111
## E 22 23 20 58 598
##
## Overall Statistics
##
## Accuracy : 0.7402
## 95% CI : (0.7262, 0.7539)
## No Information Rate : 0.3301
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.6697
## Mcnemar's Test P-Value : < 2.2e-16
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 0.7969 0.8102 0.7425 0.6413 0.6866
## Specificity 0.9680 0.8989 0.9286 0.9299 0.9597
## Pos Pred Value 0.9247 0.5455 0.6535 0.6423 0.8294
## Neg Pred Value 0.9063 0.9693 0.9521 0.9296 0.9147
## Prevalence 0.3301 0.1303 0.1535 0.1642 0.2220
## Detection Rate 0.2631 0.1055 0.1139 0.1053 0.1524
## Detection Prevalence 0.2845 0.1935 0.1744 0.1639 0.1838
## Balanced Accuracy 0.8825 0.8545 0.8356 0.7856 0.8231
conMtxPlot(confusionMatrix(tr4valp$classe, predict(fit_dt, tr4valp))$table)
cluster <- makeCluster(detectCores()-1)
registerDoParallel(cluster)
# Training
start_time <- Sys.time()
fit_rf_par_rs <- train(classe ~ ., data = tr4trp, method = "rf", trControl = fitControl)
Sys.time() - start_time
## Time difference of 7.690241 mins
stopCluster(cluster)
registerDoSEQ()
# Validaing
confusionMatrix(tr4valp$classe, predict(fit_rf_par_rs, tr4valp))
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 1112 4 0 0 0
## B 3 751 5 0 0
## C 0 4 676 4 0
## D 0 1 9 632 1
## E 0 0 3 1 717
##
## Overall Statistics
##
## Accuracy : 0.9911
## 95% CI : (0.9876, 0.9938)
## No Information Rate : 0.2842
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.9887
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 0.9973 0.9882 0.9755 0.9922 0.9986
## Specificity 0.9986 0.9975 0.9975 0.9967 0.9988
## Pos Pred Value 0.9964 0.9895 0.9883 0.9829 0.9945
## Neg Pred Value 0.9989 0.9972 0.9948 0.9985 0.9997
## Prevalence 0.2842 0.1937 0.1767 0.1624 0.1830
## Detection Rate 0.2835 0.1914 0.1723 0.1611 0.1828
## Detection Prevalence 0.2845 0.1935 0.1744 0.1639 0.1838
## Balanced Accuracy 0.9979 0.9928 0.9865 0.9944 0.9987
conMtxPlot(confusionMatrix(tr4valp$classe, predict(fit_rf_par_rs, tr4valp))$table)
cluster <- makeCluster(detectCores()-1)
registerDoParallel(cluster)
# Training
start_time <- Sys.time()
fit_gbm_rs <- train(classe ~ ., data = tr4trp, method="gbm", verbose=FALSE, trControl = fitControl)
Sys.time() - start_time
## Time difference of 3.691285 mins
stopCluster(cluster)
registerDoSEQ()
# Validation
confusionMatrix(tr4valp$classe, predict(fit_gbm_rs, tr4valp))
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 1101 11 2 2 0
## B 17 721 21 0 0
## C 0 22 656 5 1
## D 1 0 25 611 6
## E 2 7 7 9 696
##
## Overall Statistics
##
## Accuracy : 0.9648
## 95% CI : (0.9586, 0.9704)
## No Information Rate : 0.2858
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.9555
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 0.9822 0.9474 0.9226 0.9745 0.9900
## Specificity 0.9946 0.9880 0.9913 0.9903 0.9922
## Pos Pred Value 0.9866 0.9499 0.9591 0.9502 0.9653
## Neg Pred Value 0.9929 0.9874 0.9830 0.9951 0.9978
## Prevalence 0.2858 0.1940 0.1812 0.1598 0.1792
## Detection Rate 0.2807 0.1838 0.1672 0.1557 0.1774
## Detection Prevalence 0.2845 0.1935 0.1744 0.1639 0.1838
## Balanced Accuracy 0.9884 0.9677 0.9570 0.9824 0.9911
conMtxPlot(confusionMatrix(tr4valp$classe, predict(fit_gbm_rs, tr4valp))$table)
Comparing 3 methods, the validation accuracy of Decision tree, Random forest and GBM are: 74%, 99%, 96%, respectively.
Based from this result, the Random forest is the most robust method. However it is the most time consuming one (~6 minutes) compared to Decision tree (~1 second) and GBM (~3 minutes).
The next step of my analysis is apply PCA on the best 2 methods (Random forest and GBM), then train the models again to see how they perform.
preprocessPCA <- preProcess(tr4tr[,-53], method = "pca", thresh = 0.9)
preprocessPCA
## Created from 11775 samples and 52 variables
##
## Pre-processing:
## - centered (52)
## - ignored (0)
## - principal component signal extraction (52)
## - scaled (52)
##
## PCA needed 20 components to capture 90 percent of the variance
trainPCA <- predict(preprocessPCA, tr4tr[,-53])
valPCA <- predict(preprocessPCA,tr4val[,-53])
testPCA <- predict(preprocessPCA,tr4test[,-53])
Created from 11775 samples and 52 variables
Pre-processing: - centered (52) - ignored (0) - principal component signal extraction (52) - scaled (52)
PCA needed 20 components to capture 90 percent of the variance The analysis suggests that with 90% variance coverage, it requires only 20 principle components (instead of 52 variables) for training.
cluster <- makeCluster(detectCores()-1) # convention to leave 1 core for OS
registerDoParallel(cluster)
start_time <- Sys.time()
fit_pca_rf <- train(tr4tr$classe, method = "rf", verbose=FALSE, x=trainPCA, trControl = fitControl)
Sys.time() - start_time
## Time difference of 4.108429 mins
stopCluster(cluster)
registerDoSEQ()
confusionMatrix(tr4val$classe, predict(fit_pca_rf, valPCA))
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 1101 3 4 4 4
## B 9 740 10 0 0
## C 5 8 664 4 3
## D 3 0 31 609 0
## E 0 4 9 3 705
##
## Overall Statistics
##
## Accuracy : 0.9735
## 95% CI : (0.968, 0.9783)
## No Information Rate : 0.285
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.9665
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 0.9848 0.9801 0.9248 0.9823 0.9902
## Specificity 0.9947 0.9940 0.9938 0.9897 0.9950
## Pos Pred Value 0.9866 0.9750 0.9708 0.9471 0.9778
## Neg Pred Value 0.9939 0.9953 0.9833 0.9966 0.9978
## Prevalence 0.2850 0.1925 0.1830 0.1580 0.1815
## Detection Rate 0.2807 0.1886 0.1693 0.1552 0.1797
## Detection Prevalence 0.2845 0.1935 0.1744 0.1639 0.1838
## Balanced Accuracy 0.9897 0.9871 0.9593 0.9860 0.9926
cluster <- makeCluster(detectCores()-1)
registerDoParallel(cluster)
start_time <- Sys.time()
fit_pca_gbm <- train(tr4tr$classe, method="gbm", verbose=FALSE, x=trainPCA, trControl = fitControl)
Sys.time() - start_time
## Time difference of 2.104061 mins
stopCluster(cluster)
registerDoSEQ()
confusionMatrix(tr4val$classe,predict(fit_pca_gbm, valPCA))
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 980 38 33 57 8
## B 71 590 56 22 20
## C 56 58 551 13 6
## D 38 20 82 481 22
## E 44 58 47 30 542
##
## Overall Statistics
##
## Accuracy : 0.8014
## 95% CI : (0.7886, 0.8138)
## No Information Rate : 0.3031
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.7482
## Mcnemar's Test P-Value : < 2.2e-16
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 0.8242 0.7723 0.7165 0.7977 0.9064
## Specificity 0.9503 0.9465 0.9578 0.9512 0.9462
## Pos Pred Value 0.8781 0.7773 0.8056 0.7481 0.7517
## Neg Pred Value 0.9255 0.9450 0.9327 0.9628 0.9825
## Prevalence 0.3031 0.1947 0.1960 0.1537 0.1524
## Detection Rate 0.2498 0.1504 0.1405 0.1226 0.1382
## Detection Prevalence 0.2845 0.1935 0.1744 0.1639 0.1838
## Balanced Accuracy 0.8872 0.8594 0.8372 0.8744 0.9263
The results show that with more than half of the features, running time is reduced, but the validation accuracy is also reduced to 97% and 80% accordingly.
In reality, the selection between Random forest with or without PCA depends on the project (sample size, model update requirement…). For this project, and even when I didn’t perform any deep parameter tuning, I pick Random forest without PCA as my final model.
# Apply on test set
confusionMatrix(tr4testp$classe, predict(fit_rf_par_rs, tr4testp))
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 1112 3 0 0 0
## B 5 753 1 0 0
## C 0 3 677 4 0
## D 0 0 9 633 1
## E 0 1 3 1 716
##
## Overall Statistics
##
## Accuracy : 0.9921
## 95% CI : (0.9888, 0.9946)
## No Information Rate : 0.2848
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.99
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 0.9955 0.9908 0.9812 0.9922 0.9986
## Specificity 0.9989 0.9981 0.9978 0.9970 0.9984
## Pos Pred Value 0.9973 0.9921 0.9898 0.9844 0.9931
## Neg Pred Value 0.9982 0.9978 0.9960 0.9985 0.9997
## Prevalence 0.2848 0.1938 0.1759 0.1627 0.1828
## Detection Rate 0.2835 0.1920 0.1726 0.1614 0.1826
## Detection Prevalence 0.2843 0.1935 0.1744 0.1639 0.1838
## Balanced Accuracy 0.9972 0.9944 0.9895 0.9946 0.9985
conMtxPlot(confusionMatrix(tr4testp$classe, predict(fit_rf_par_rs, tr4testp))$table)