In this document, we will predict how 6 partecipants perform various types of exercises, as described in the Background section, by analysing the “classe” variable in the training dataset. The resulting machine learning algorithm will then be applied to the test dataset and the predictions will be tested via the Course Project Prediction Quiz, online.
From the dataset’s authors’ website we learn how the data was gathered. An excerpt reads:
“Six young health participants were asked to perform one set of 10 repetitions of the Unilateral Dumbbell Biceps Curl in five different fashions: exactly according to the specification (Class A), throwing the elbows to the front (Class B), lifting the dumbbell only halfway (Class C), lowering the dumbbell only halfway (Class D) and throwing the hips to the front (Class E).
Class A corresponds to the specified execution of the exercise, while the other 4 classes correspond to common mistakes. Participants were supervised by an experienced weight lifter to make sure the execution complied to the manner they were supposed to simulate. The exercises were performed by six male participants aged between 20-28 years, with little weight lifting experience. We made sure that all participants could easily simulate the mistakes in a safe and controlled manner by using a relatively light dumbbell (1.25kg)."
Full source:
Ugulino, W.; Cardador, D.; Vega, K.; Velloso, E.; Milidiu, R.; Fuks, H. Wearable Computing: Accelerometers’ Data Classification of Body Postures and Movements. Proceedings of 21st Brazilian Symposium on Artificial Intelligence. Advances in Artificial Intelligence - SBIA 2012. In: Lecture Notes in Computer Science. , pp. 52-61. Curitiba, PR: Springer Berlin / Heidelberg, 2012. ISBN 978-3-642-34458-9. DOI: 10.1007/978-3-642-34459-6_6.
Read more here.
The environment is cleared of any previous data and assigned variables loaded onto it and the appropriate libraries are loaded onto RStudio.
rm(list=ls())
library(knitr)
library(caret)
library(rpart)
library(rpart.plot)
library(rattle)
library(randomForest)
library(corrplot)
library(gbm)
set.seed(291)
The datasets are then downloaded and the training set is divided into a 70/30 split, to have a training set and a testing set within the training set and leaving the testing set only for the predictions for the aforementioned quiz.
UrlTrain<-"http://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv"
UrlTest<-"http://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv"
train <- read.csv(url(UrlTrain))
test <- read.csv(url(UrlTest))
InTrain <- createDataPartition(train$classe, p=0.7, list=FALSE)
TrainSet <- train[InTrain, ]
TestSet <- train[-InTrain, ]
dim(TrainSet)
## [1] 13737 160
dim(TestSet)
## [1] 5885 160
Seen the dimesions of the datasets (160 variables), we decide to clean them up by removing variables with:
1. Near Zero Variance
2. Mostly NAs values
3. Identification values
nzv <- nearZeroVar(TrainSet)
TrainSet <- TrainSet[, -nzv]
TestSet <- TestSet[, -nzv]
dim(TrainSet)
## [1] 13737 102
dim(TestSet)
## [1] 5885 102
NAs <- sapply(TrainSet, function(x) mean(is.na(x))) > 0.95
TrainSet <- TrainSet[, NAs==FALSE]
TestSet <- TestSet[, NAs==FALSE]
dim(TrainSet)
## [1] 13737 59
dim(TestSet)
## [1] 5885 59
TrainSet <- TrainSet[, -(1:5)]
TestSet <- TestSet[, -(1:5)]
dim(TrainSet)
## [1] 13737 54
dim(TestSet)
## [1] 5885 54
With now more manageable datasets, we will perform a correlation analysis between the variables to see if anything sticks out.
CorAn <- cor(TrainSet[, -54])
corrplot(CorAn, order = "FPC", method = "color", type = "lower",
tl.cex = 0.8, tl.col = rgb(0, 0, 0))
The more correlated the variables are, the more saturated the colour will be in the matrix. If we exclude the obvious correlations (e.g. accel_belt_z to accel_belt_z), there aren’t many other correlations of note.
For this assignment, we will use three different methods to build a model. The three models will then be run on the “mini” test dataset and the one with the highest accuracy will be used on the test dataset for the quiz. We will also include a confusion matrix at the end of each model to help the visualisation of its accuracy.
Firstly, we fit the model.
set.seed(291)
ControlRF <- trainControl(method="cv", number=3, verboseIter=FALSE)
ModFitRF <- train(classe ~ ., data=TrainSet, method="rf",
trControl=ControlRF)
ModFitRF$finalModel
##
## Call:
## randomForest(x = x, y = y, mtry = param$mtry)
## Type of random forest: classification
## Number of trees: 500
## No. of variables tried at each split: 27
##
## OOB estimate of error rate: 0.24%
## Confusion matrix:
## A B C D E class.error
## A 3905 0 0 0 1 0.0002560164
## B 7 2648 3 0 0 0.0037622272
## C 0 7 2389 0 0 0.0029215359
## D 0 0 9 2243 0 0.0039964476
## E 0 1 0 5 2519 0.0023762376
Then, we run the model on the test dataset.
PredRF <- predict(ModFitRF, newdata=TestSet)
ConfMatRF <- confusionMatrix(PredRF, TestSet$classe)
ConfMatRF
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 1674 1 0 0 0
## B 0 1137 1 0 0
## C 0 0 1025 6 0
## D 0 1 0 958 8
## E 0 0 0 0 1074
##
## Overall Statistics
##
## Accuracy : 0.9971
## 95% CI : (0.9954, 0.9983)
## No Information Rate : 0.2845
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.9963
##
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 1.0000 0.9982 0.9990 0.9938 0.9926
## Specificity 0.9998 0.9998 0.9988 0.9982 1.0000
## Pos Pred Value 0.9994 0.9991 0.9942 0.9907 1.0000
## Neg Pred Value 1.0000 0.9996 0.9998 0.9988 0.9983
## Prevalence 0.2845 0.1935 0.1743 0.1638 0.1839
## Detection Rate 0.2845 0.1932 0.1742 0.1628 0.1825
## Detection Prevalence 0.2846 0.1934 0.1752 0.1643 0.1825
## Balanced Accuracy 0.9999 0.9990 0.9989 0.9960 0.9963
Finally, we plot the confusion matrix in a more visually pleasing (and intuitive) way.
plot(ConfMatRF$table, col = ConfMatRF$byClass,
main = paste("Random Forest - Accuracy =",
round(ConfMatRF$overall['Accuracy'], 4)))
Firstly, we fit the model.
set.seed(291)
ModFitDT <- rpart(classe ~ ., data=TrainSet, method="class")
fancyRpartPlot(ModFitDT)
Then, we run the model on the test dataset.
PredDT <- predict(ModFitDT, newdata=TestSet, type="class")
ConfMatDT <- confusionMatrix(PredDT, TestSet$classe)
ConfMatDT
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 1499 253 39 72 21
## B 32 633 32 17 17
## C 13 122 888 99 20
## D 86 121 57 686 149
## E 44 10 10 90 875
##
## Overall Statistics
##
## Accuracy : 0.7784
## 95% CI : (0.7676, 0.789)
## No Information Rate : 0.2845
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.7189
##
## Mcnemar's Test P-Value : < 2.2e-16
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 0.8955 0.5558 0.8655 0.7116 0.8087
## Specificity 0.9086 0.9794 0.9477 0.9161 0.9679
## Pos Pred Value 0.7956 0.8659 0.7776 0.6242 0.8503
## Neg Pred Value 0.9563 0.9018 0.9709 0.9419 0.9574
## Prevalence 0.2845 0.1935 0.1743 0.1638 0.1839
## Detection Rate 0.2547 0.1076 0.1509 0.1166 0.1487
## Detection Prevalence 0.3201 0.1242 0.1941 0.1867 0.1749
## Balanced Accuracy 0.9020 0.7676 0.9066 0.8138 0.8883
Finally, we plot the confusion matrix in a more visually pleasing (and intuitive) way.
plot(ConfMatDT$table, col = ConfMatDT$byClass,
main = paste("Decision Tree - Accuracy =",
round(ConfMatDT$overall['Accuracy'], 4)))
Firstly, we fit the model.
set.seed(291)
ControlGBM <- trainControl(method = "repeatedcv", number = 5, repeats = 1)
ModFitGBM <- train(classe ~ ., data=TrainSet, method = "gbm",
trControl = ControlGBM, verbose = FALSE)
ModFitGBM$finalModel
## A gradient boosted model with multinomial loss function.
## 150 iterations were performed.
## There were 53 predictors of which 53 had non-zero influence.
Then, we run the model on the test dataset.
PredGBM <- predict(ModFitGBM, newdata=TestSet)
ConfMatGBM <- confusionMatrix(PredGBM, TestSet$classe)
ConfMatGBM
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 1669 9 0 1 0
## B 5 1123 7 2 0
## C 0 5 1016 15 1
## D 0 2 2 946 20
## E 0 0 1 0 1061
##
## Overall Statistics
##
## Accuracy : 0.9881
## 95% CI : (0.985, 0.9907)
## No Information Rate : 0.2845
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.985
##
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 0.9970 0.9860 0.9903 0.9813 0.9806
## Specificity 0.9976 0.9971 0.9957 0.9951 0.9998
## Pos Pred Value 0.9940 0.9877 0.9797 0.9753 0.9991
## Neg Pred Value 0.9988 0.9966 0.9979 0.9963 0.9956
## Prevalence 0.2845 0.1935 0.1743 0.1638 0.1839
## Detection Rate 0.2836 0.1908 0.1726 0.1607 0.1803
## Detection Prevalence 0.2853 0.1932 0.1762 0.1648 0.1805
## Balanced Accuracy 0.9973 0.9915 0.9930 0.9882 0.9902
Finally, we plot the confusion matrix in a more visually pleasing (and intuitive) way.
plot(ConfMatGBM$table, col = ConfMatGBM$byClass,
main = paste("GBM - Accuracy =", round(ConfMatGBM$overall['Accuracy'], 4)))
The accuracies of the three models are: 1. Random Forest: 0.9971
2. Decision Trees: 0.7784
3. Generalised Boosted Model: 0.9881
Since the Random Forest model is the most accurate one, we will apply it to the test dataset to predict the results needed for the aforementioned test.
PredictTest <- predict(ModFitRF, newdata=test)
PredictTest
## [1] B A B A A E D B A A B C B A E E A B B B
## Levels: A B C D E