In this project, we will use data from accelerometers on the belt, forearm, arm, and dumbell of 6 participants. They were asked to perform barbell lifts correctly and incorrectly in 5 different ways.
The data consists of a Training data and Testing data.
The goal of this project is to predict the manner in which they did the exercise, that is the “classe” variable in the training set. The dataset was cleaned and the remaining variables were used for the prediction exercise using 3 prediction models. The model with the best accuracy rate was applied to the 20 test cases available in the testing data.
Note: The dataset used in this project is a courtesy of “Ugulino, W.; Cardador, D.; Vega, K.; Velloso, E.; Milidiu, R.; Fuks, H. Wearable Computing: Accelerometers’ Data Classification of Body Postures and Movements”
rm(list=ls()) # free up memory for the download of the data sets
library(knitr)
library(caret)
library(rpart)
library(rpart.plot)
library(rattle)
library(randomForest)
library(corrplot)
# set the URL for the download of Training and Testing Dataset
urlTrain <- "https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv"
urlTest <- "https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv"
# download the Training and Testing datasets
training <- read.csv(url(urlTrain))
testing <- read.csv(url(urlTest))
dim(training)
## [1] 19622 160
dim(testing)
## [1] 20 160
# create a validation dataset from the training dataset
in_train <- createDataPartition(training$classe, p=0.7, list=FALSE)
train_data <- training[in_train, ]
valid_data <- training[-in_train, ]
dim(train_data)
## [1] 13737 160
dim(valid_data)
## [1] 5885 160
#Remove variables with little impact on outcome of Classe
train_data <- train_data[, -c(1:7)]
valid_data <- valid_data[, -c(1:7)]
dim(train_data)
## [1] 13737 153
dim(valid_data)
## [1] 5885 153
# remove variables with Nearly Zero Variance
NZV <- nearZeroVar(train_data)
train_data <- train_data[, -NZV]
valid_data <- valid_data[, -NZV]
dim(train_data)
## [1] 13737 100
dim(valid_data)
## [1] 5885 100
#Remove variables containing missing values
train_data<- train_data[, colSums(is.na(train_data)) == 0]
valid_data <- valid_data[, colSums(is.na(valid_data)) == 0]
dim(train_data)
## [1] 13737 53
dim(valid_data)
## [1] 5885 53
# Plot correlation between variables to explore relationships
cor_matrix <- cor(train_data[, -53])
corrplot(cor_matrix, order = "FPC", method = "color", type = "lower",
tl.cex = 0.8, tl.col = rgb(0, 0, 0))
# Identify highly correlated variables at a cutoff of 70%
highly_correlated = findCorrelation(cor_matrix, cutoff=0.7)
names(train_data)[highly_correlated]
## [1] "accel_belt_z" "roll_belt" "accel_arm_y"
## [4] "accel_belt_y" "total_accel_belt" "yaw_belt"
## [7] "accel_dumbbell_z" "accel_belt_x" "pitch_belt"
## [10] "magnet_dumbbell_x" "accel_dumbbell_y" "magnet_dumbbell_y"
## [13] "accel_dumbbell_x" "accel_arm_x" "accel_arm_z"
## [16] "magnet_arm_y" "magnet_belt_y" "accel_forearm_y"
## [19] "gyros_forearm_y" "gyros_arm_x"
Three methods will be applied in the model building process using the training dataset. The model with the highest accuracy rate will be selected and applied to the testing dataset for the predictions. The methods used for model building are: Decision Tree, Random Forest and Generalized Boosted Model as presented below.
# model fit
set.seed(12345)
modFitDecTree <- rpart(classe ~ ., data=train_data, method="class")
fancyRpartPlot(modFitDecTree)
# prediction on Validation dataset
predictDecTree <- predict(modFitDecTree, newdata=valid_data, type="class")
confMatDecTree <- confusionMatrix(predictDecTree, as.factor(valid_data$classe))
confMatDecTree
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 1557 248 16 107 45
## B 30 602 89 38 70
## C 48 170 832 83 79
## D 23 69 68 658 75
## E 16 50 21 78 813
##
## Overall Statistics
##
## Accuracy : 0.7582
## 95% CI : (0.747, 0.7691)
## No Information Rate : 0.2845
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.6924
##
## Mcnemar's Test P-Value : < 2.2e-16
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 0.9301 0.5285 0.8109 0.6826 0.7514
## Specificity 0.9012 0.9522 0.9218 0.9522 0.9656
## Pos Pred Value 0.7892 0.7262 0.6865 0.7368 0.8313
## Neg Pred Value 0.9701 0.8938 0.9585 0.9387 0.9452
## Prevalence 0.2845 0.1935 0.1743 0.1638 0.1839
## Detection Rate 0.2646 0.1023 0.1414 0.1118 0.1381
## Detection Prevalence 0.3353 0.1409 0.2059 0.1517 0.1662
## Balanced Accuracy 0.9157 0.7404 0.8664 0.8174 0.8585
# plot matrix results
plot(confMatDecTree$table, col = confMatDecTree$byClass,
main = paste("Decision Tree - Accuracy =",
round(confMatDecTree$overall['Accuracy'], 4)))
# model fit
set.seed(12345)
controlRF <- trainControl(method="cv", number=3, verboseIter=FALSE)
modFitRandForest <- train(classe ~ ., data=train_data, method="rf",
trControl=controlRF)
modFitRandForest$finalModel
##
## Call:
## randomForest(x = x, y = y, mtry = param$mtry)
## Type of random forest: classification
## Number of trees: 500
## No. of variables tried at each split: 2
##
## OOB estimate of error rate: 0.68%
## Confusion matrix:
## A B C D E class.error
## A 3904 2 0 0 0 0.0005120328
## B 16 2638 4 0 0 0.0075244545
## C 0 18 2373 5 0 0.0095993322
## D 0 0 40 2208 4 0.0195381883
## E 0 0 0 5 2520 0.0019801980
# prediction on validation dataset
predictRandForest <- predict(modFitRandForest, newdata=valid_data)
confMatRandForest <- confusionMatrix(predictRandForest, as.factor(valid_data$classe))
confMatRandForest
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 1673 7 0 0 0
## B 0 1131 7 0 0
## C 0 1 1019 13 0
## D 0 0 0 951 2
## E 1 0 0 0 1080
##
## Overall Statistics
##
## Accuracy : 0.9947
## 95% CI : (0.9925, 0.9964)
## No Information Rate : 0.2845
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.9933
##
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 0.9994 0.9930 0.9932 0.9865 0.9982
## Specificity 0.9983 0.9985 0.9971 0.9996 0.9998
## Pos Pred Value 0.9958 0.9938 0.9864 0.9979 0.9991
## Neg Pred Value 0.9998 0.9983 0.9986 0.9974 0.9996
## Prevalence 0.2845 0.1935 0.1743 0.1638 0.1839
## Detection Rate 0.2843 0.1922 0.1732 0.1616 0.1835
## Detection Prevalence 0.2855 0.1934 0.1755 0.1619 0.1837
## Balanced Accuracy 0.9989 0.9958 0.9951 0.9931 0.9990
# plot matrix results
plot(confMatRandForest$table, col = confMatRandForest$byClass,
main = paste("Random Forest - Accuracy =",
round(confMatRandForest$overall['Accuracy'], 4)))
# model fit
set.seed(12345)
controlGBM <- trainControl(method = "repeatedcv", number = 5, repeats = 1)
modFitGBM <- train(classe ~ ., data=train_data, method = "gbm",
trControl = controlGBM, verbose = FALSE)
modFitGBM$finalModel
## A gradient boosted model with multinomial loss function.
## 150 iterations were performed.
## There were 52 predictors of which 52 had non-zero influence.
# prediction on validation dataset
predictGBM <- predict(modFitGBM, newdata=valid_data)
confMatGBM <- confusionMatrix(predictGBM, as.factor(valid_data$classe))
confMatGBM
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 1648 37 0 0 1
## B 20 1082 27 7 9
## C 4 18 988 26 6
## D 2 2 7 924 15
## E 0 0 4 7 1051
##
## Overall Statistics
##
## Accuracy : 0.9674
## 95% CI : (0.9625, 0.9718)
## No Information Rate : 0.2845
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.9587
##
## Mcnemar's Test P-Value : 1.767e-05
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 0.9845 0.9500 0.9630 0.9585 0.9713
## Specificity 0.9910 0.9867 0.9889 0.9947 0.9977
## Pos Pred Value 0.9775 0.9450 0.9482 0.9726 0.9896
## Neg Pred Value 0.9938 0.9880 0.9922 0.9919 0.9936
## Prevalence 0.2845 0.1935 0.1743 0.1638 0.1839
## Detection Rate 0.2800 0.1839 0.1679 0.1570 0.1786
## Detection Prevalence 0.2865 0.1946 0.1771 0.1614 0.1805
## Balanced Accuracy 0.9877 0.9683 0.9759 0.9766 0.9845
# plot matrix results
plot(confMatGBM$table, col = confMatGBM$byClass,
main = paste("GBM - Accuracy =", round(confMatGBM$overall['Accuracy'], 4)))
The results from the above prediction methods show that Random Forest model has the highest accuracy rate with over 99%. Hence, the Random Forest Model will be applied to predict the 20 quiz results using the testing dataset as shown below.
predictTEST <- predict(modFitRandForest, newdata=testing)
predictTEST
## [1] B A B A A E D B A A B C B A E E A B B B
## Levels: A B C D E