# Load required Libraries
library(caret)
Can data collected by wearable tech gear like Jawbone Up, Nike, FuelBand and Fitbit can be used to create a predictive model that can potentially train us the right way to do an exercise? As this analysis will show, it is quite possible. The data is coming from (http://groupware.les.inf.puc-rio.br/har). The machine learning algorithms training by the training data set is shown to reach an accuracy of 100% in classifying if the bicep curl is done correctly.
The training data comes from https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv and it has 19622 observations of 159 variables of bio-kinetic data collected by the wearable tech gears. The goal of this analysis is to see if can correctly classify the observations into 5 classes A,B,C,D,E. With A is the ‘correct’ way of doing the exercise and others the ‘incorrect’ ways.
A quick summary of the data shows this data set has 60 variables that are near zero values. So they are removed to reduce the noise in the data set. A further analysis of the reduced data set thru summary function still reveal that 41 more variables have NAs for 98% percent of the observations. These variables aren’t unique enought to be features so they are also removed to get a much reduced varialbles set. Finally the use identifier variables are also removed to get a final reduced data set with 56 columns down from the original 160 columns
pmlTrain <- read.csv("pml-training.csv")
#remove the near zero value variables
nzv1 <- nearZeroVar(pmlTrain)
pmlTrain1 <- pmlTrain[,-nzv1]
# remove user identifier variables
pmlTrain1 <- pmlTrain1[,-c(1,2)]
#remove variables that aren't unique enough to be useful as predictors. ie NAs for 98% of observations
pmlTrain2 <- pmlTrain1[,colSums(is.na(pmlTrain1)) != 19216]
This goal of this analysis classification. The predictors are used to classify the observations into 5 different groups. Regression models aren’t good candidate for this analysis and hence they aren’t tried. The simplest algorithm first trained on the data set is a decision tree. The ‘rpart’ model trained has a accuracy of about 50%. This isn’t good enough but at least this proved that classification tree based algorithms are good candidates for training. So Bagging and Boosting algorithms should yield better accuracy.
Random forest algorithm with default trainControl (bootstrapping with 25 resampling with accuracy as the metric) is trained on the test dataset to start with. The resulting accuracy is a staggering 99.74%. This is quite good. The final model used an aggregation of 500 trees with 28 random variables used in each split. The average out of Bag (OOB) error of 0.02% across the samples. This clearly shows that this model is almost perfect.
A boosting algorith caret method = ‘gbm’ is tried next to see if it can reach the accuracy of the Random forest. At an interaction.depth = 3 and n.trees = 100, the model reached an accuracy of 99.55% (training error approaching 0.43% ). This is also quite good and almost on-par with the random forest.
When these models are challenged with the testing dataset that has 25% of the original data (5885 rows), they predicted quite well. ‘rf’ got a prediction accuracy of 99.97% and ‘gbm’ an accuracy of 99.66%
# creating training and testing partitions
inTrain <- createDataPartition(y = pmlTrain2$classe, p = 0.7, list = FALSE)
training <- pmlTrain2[inTrain, ]
testing <- pmlTrain2[-inTrain,]
# This is a classification tree model. So the first obvious choice is to look use the 'classification tree'
modfitRPart <- train(classe ~ ., data = training, method = "rpart")
#Accuracy of this classification tree is too low. Time for some bagging
modfitRPart$results
## cp Accuracy Kappa AccuracySD KappaSD
## 1 0.03972129 0.4975520 0.34168326 0.09936221 0.15363205
## 2 0.04726545 0.4230479 0.22125903 0.07067933 0.12060648
## 3 0.11667175 0.3274525 0.06428058 0.04225594 0.06312011
# Random Forest
modfitRF <- train(classe ~ ., data = training, method = "rf")
# wow 99.74% accuracy i.e 0.26% in sample error rate using default trainControl params. ie using Bootstrap with 25 resampling
modfitRF
## Random Forest
##
## 13737 samples
## 56 predictor
## 5 classes: 'A', 'B', 'C', 'D', 'E'
##
## No pre-processing
## Resampling: Bootstrapped (25 reps)
## Summary of sample sizes: 13737, 13737, 13737, 13737, 13737, 13737, ...
## Resampling results across tuning parameters:
##
## mtry Accuracy Kappa
## 2 0.9904095 0.9878616
## 38 0.9980841 0.9975752
## 74 0.9968662 0.9960341
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 38.
sprintf("Avg out of Sample Error Rate %f",round(mean(modfitRF$finalModel$err.rate[1]),2))
## [1] "Avg out of Sample Error Rate 0.020000"
pRF <- predict(modfitRF, testing)
# wow 99.97 accuracy in predicting testing data
confusionMatrix(pRF, testing$classe)
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 1674 0 0 0 0
## B 0 1139 0 0 0
## C 0 0 1022 0 0
## D 0 0 4 964 0
## E 0 0 0 0 1082
##
## Overall Statistics
##
## Accuracy : 0.9993
## 95% CI : (0.9983, 0.9998)
## No Information Rate : 0.2845
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.9991
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 1.0000 1.0000 0.9961 1.0000 1.0000
## Specificity 1.0000 1.0000 1.0000 0.9992 1.0000
## Pos Pred Value 1.0000 1.0000 1.0000 0.9959 1.0000
## Neg Pred Value 1.0000 1.0000 0.9992 1.0000 1.0000
## Prevalence 0.2845 0.1935 0.1743 0.1638 0.1839
## Detection Rate 0.2845 0.1935 0.1737 0.1638 0.1839
## Detection Prevalence 0.2845 0.1935 0.1737 0.1645 0.1839
## Balanced Accuracy 1.0000 1.0000 0.9981 0.9996 1.0000
#Boosting
modfitBoost <- train(classe ~ ., data = training, method = "gbm")
# 99.55% accuracy in-sample
modfitBoost
## Stochastic Gradient Boosting
##
## 13737 samples
## 56 predictor
## 5 classes: 'A', 'B', 'C', 'D', 'E'
##
## No pre-processing
## Resampling: Bootstrapped (25 reps)
## Summary of sample sizes: 13737, 13737, 13737, 13737, 13737, 13737, ...
## Resampling results across tuning parameters:
##
## interaction.depth n.trees Accuracy Kappa
## 1 50 0.8378679 0.7941710
## 1 100 0.8964783 0.8688805
## 1 150 0.9252828 0.9053503
## 2 50 0.9527936 0.9402308
## 2 100 0.9851184 0.9811745
## 2 150 0.9915084 0.9892593
## 3 50 0.9820227 0.9772531
## 3 100 0.9932479 0.9914595
## 3 150 0.9959911 0.9949287
##
## Tuning parameter 'shrinkage' was held constant at a value of 0.1
##
## Tuning parameter 'n.minobsinnode' was held constant at a value of 10
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were n.trees = 150,
## interaction.depth = 3, shrinkage = 0.1 and n.minobsinnode = 10.
pBoost <- predict(modfitBoost, testing)
#99.66% accuracy in predicting test data
confusionMatrix(pBoost, testing$classe)
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 1674 1 0 0 0
## B 0 1135 1 0 0
## C 0 3 1017 0 0
## D 0 0 8 962 3
## E 0 0 0 2 1079
##
## Overall Statistics
##
## Accuracy : 0.9969
## 95% CI : (0.9952, 0.9982)
## No Information Rate : 0.2845
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.9961
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 1.0000 0.9965 0.9912 0.9979 0.9972
## Specificity 0.9998 0.9998 0.9994 0.9978 0.9996
## Pos Pred Value 0.9994 0.9991 0.9971 0.9887 0.9981
## Neg Pred Value 1.0000 0.9992 0.9982 0.9996 0.9994
## Prevalence 0.2845 0.1935 0.1743 0.1638 0.1839
## Detection Rate 0.2845 0.1929 0.1728 0.1635 0.1833
## Detection Prevalence 0.2846 0.1930 0.1733 0.1653 0.1837
## Balanced Accuracy 0.9999 0.9981 0.9953 0.9978 0.9984
The “RF” and “GBM” algorithm trained models have achieved > than 99.5% accuracy. Now model stacking is attempted on them to see if their accuracy can be pushed even higher. For this a seperate validation set is created from the traning data. The ‘rf’ and ‘gbm’ model predeictions are combined with the classe variable they are predicting in a data frame and this is used as the training data to train the stacked model. The resulting model has an accuracy of 99.97% and training error approaching 0.004%. This model is then used to predict on the validation data set of (25% of data ie 5885 observations). It got a prediction accuracy of 100%.
# Building and Training Stacked Model
DFpredComboBagBoost <- data.frame(pRF, pBoost, classe = testing$classe)
modfitComboBagBoost <- train(classe ~., data = DFpredComboBagBoost, method = "gbm")
inTrain2 <- createDataPartition(y = pmlTrain2$classe, p = 0.7, list = FALSE)
validation <- pmlTrain2[-inTrain2,]
#Prediction on Validation Set
pRFV <- predict(modfitRF, validation)
pBoostV <- predict(modfitBoost, validation)
DFpredComboBagBoostV <- data.frame(pRF = pRFV, pBoost = pBoostV)
pcomboBagBoostV <- predict(modfitComboBagBoost, DFpredComboBagBoostV)
# Prediction accuracy of 100%
confusionMatrix(pcomboBagBoostV, validation$classe)
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 1674 0 0 0 0
## B 0 1139 0 0 0
## C 0 0 1024 0 0
## D 0 0 2 964 0
## E 0 0 0 0 1082
##
## Overall Statistics
##
## Accuracy : 0.9997
## 95% CI : (0.9988, 1)
## No Information Rate : 0.2845
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.9996
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 1.0000 1.0000 0.9981 1.0000 1.0000
## Specificity 1.0000 1.0000 1.0000 0.9996 1.0000
## Pos Pred Value 1.0000 1.0000 1.0000 0.9979 1.0000
## Neg Pred Value 1.0000 1.0000 0.9996 1.0000 1.0000
## Prevalence 0.2845 0.1935 0.1743 0.1638 0.1839
## Detection Rate 0.2845 0.1935 0.1740 0.1638 0.1839
## Detection Prevalence 0.2845 0.1935 0.1740 0.1641 0.1839
## Balanced Accuracy 1.0000 1.0000 0.9990 0.9998 1.0000