I. Overview This document is the final report of the Peer Assessment project from Coursera’s course Practical Machine Learning, as part of the Specialization in Data Science. It was built up in RStudio, using its knitr functions, meant to be published in html format. This analysis meant to be the basis for the course quiz and a prediction assignment writeup. The main goal of the project is to predict the manner in which 6 participants performed some exercise as described below. This is the “classe” variable in the training set. The machine learning algorithm described here is applied to the 20 test cases available in the test data and the predictions are submitted in appropriate format to the Course Project Prediction Quiz for automated grading.
Read more: http://groupware.les.inf.puc-rio.br/har#ixzz3xsbS5bVX
https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv
The test data are available here:
https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv
The data for this project come from http://groupware.les.inf.puc-rio.br/har. Full source:
Velloso, E.; Bulling, A.; Gellersen, H.; Ugulino, W.; Fuks, H. “Qualitative Activity Recognition of Weight Lifting Exercises. Proceedings of 4th International Conference in Cooperation with SIGCHI (Augmented Human ’13)”. Stuttgart, Germany: ACM SIGCHI, 2013.
My special thanks to the above mentioned authors for being so generous in allowing their data to be used for this kind of assignment.
A short description of the datasets content from the authors’ website:
“Six young health participants were asked to perform one set of 10 repetitions of the Unilateral Dumbbell Biceps Curl in five different fashions: exactly according to the specification (Class A), throwing the elbows to the front (Class B), lifting the dumbbell only halfway (Class C), lowering the dumbbell only halfway (Class D) and throwing the hips to the front (Class E).
Class A corresponds to the specified execution of the exercise, while the other 4 classes correspond to common mistakes. Participants were supervised by an experienced weight lifter to make sure the execution complied to the manner they were supposed to simulate. The exercises were performed by six male participants aged between 20-28 years, with little weight lifting experience. We made sure that all participants could easily simulate the mistakes in a safe and controlled manner by using a relatively light dumbbell (1.25kg)."
# set the URL for the download
UrlTrain <- "http://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv"
UrlTest <- "http://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv"
# download the datasets
training <- read.csv(url(UrlTrain))
testing <- read.csv(url(UrlTest))
# create a partition with the training dataset
inTrain <- createDataPartition(training$classe, p=0.7, list=FALSE)
TrainSet <- training[inTrain, ]
TestSet <- training[-inTrain, ]
dim(TrainSet)
## [1] 13737 160
UrlTrain <- “http://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv” UrlTest <- “http://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv”
training <- read.csv(url(UrlTrain)) testing <- read.csv(url(UrlTest))
inTrain <- createDataPartition(training$classe, p=0.7, list=FALSE) TrainSet <- training[inTrain, ] TestSet <- training[-inTrain, ] dim(TestSet) ```
dim(TestSet)
## [1] 5885 160
Both created datasets have 160 variables. Those variables have plenty of NA, that can be removed with the cleaning procedures below. The Near Zero variance (NZV) variables are also removed and the ID variables as well.
# remove variables with Nearly Zero Variance
NZV <- nearZeroVar(TrainSet)
TrainSet <- TrainSet[, -NZV]
TestSet <- TestSet[, -NZV]
dim(TrainSet)
## [1] 13737 106
dim(TestSet)
## [1] 5885 106
# remove variables that are mostly NA
AllNA <- sapply(TrainSet, function(x) mean(is.na(x))) > 0.95
TrainSet <- TrainSet[, AllNA==FALSE]
TestSet <- TestSet[, AllNA==FALSE]
dim(TrainSet)
## [1] 13737 59
dim(TestSet)
## [1] 5885 59
# remove identification only variables (columns 1 to 5)
TrainSet <- TrainSet[, -(1:5)]
TestSet <- TestSet[, -(1:5)]
dim(TrainSet)
## [1] 13737 54
dim(TestSet)
## [1] 5885 54
With the cleaning process above, the number of variables for the analysis has been reduced to 54 only.
corMatrix <- cor(TrainSet[, -54])
corrplot(corMatrix, order = "FPC", method = "color", type = "lower", tl.cex = 0.8, tl.col = rgb(0, 0, 0))
The highly correlated variables are shown in dark colors in the graph above. To make an evem more compact analysis, a PCA (Principal Components Analysis) could be performed as pre-processing step to the datasets. Nevertheless, as the correlations are quite few, this step will not be applied for this assignment.
# model fit
set.seed(12345)
controlRF <- trainControl(method="cv", number=3, verboseIter=FALSE)
modFitRandForest <- train(classe ~ ., data=TrainSet, method="rf",
trControl=controlRF)
# prediction on Test dataset
predictRandForest <- predict(modFitRandForest, newdata=TestSet)
confMatRandForest <- confusionMatrix(predictRandForest, TestSet$classe)
confMatRandForest
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 1674 5 0 0 0
## B 0 1133 4 0 0
## C 0 1 1022 8 0
## D 0 0 0 956 3
## E 0 0 0 0 1079
##
## Overall Statistics
##
## Accuracy : 0.9964
## 95% CI : (0.9946, 0.9978)
## No Information Rate : 0.2845
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.9955
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 1.0000 0.9947 0.9961 0.9917 0.9972
## Specificity 0.9988 0.9992 0.9981 0.9994 1.0000
## Pos Pred Value 0.9970 0.9965 0.9913 0.9969 1.0000
## Neg Pred Value 1.0000 0.9987 0.9992 0.9984 0.9994
## Prevalence 0.2845 0.1935 0.1743 0.1638 0.1839
## Detection Rate 0.2845 0.1925 0.1737 0.1624 0.1833
## Detection Prevalence 0.2853 0.1932 0.1752 0.1630 0.1833
## Balanced Accuracy 0.9994 0.9969 0.9971 0.9955 0.9986
# plot matrix results
plot(confMatRandForest$table, col = confMatRandForest$byClass, main = paste("Random Forest - Accuracy =",round(confMatRandForest$overall['Accuracy'], 4)))
# model fit
set.seed(12345)
modFitDecTree <- rpart(classe ~ ., data=TrainSet, method="class")
fancyRpartPlot(modFitDecTree)
# prediction on Test dataset
predictDecTree <- predict(modFitDecTree, newdata=TestSet, type="class")
confMatDecTree <- confusionMatrix(predictDecTree, TestSet$classe)
# plot matrix results
plot(confMatDecTree$table, col = confMatDecTree$byClass, main = paste("Decision Tree - Accuracy =",round(confMatDecTree$overall['Accuracy'], 4)))
# model fit
set.seed(12345)
controlGBM <- trainControl(method = "repeatedcv", number = 5, repeats = 1)
modFitGBM <- train(classe ~ ., data=TrainSet, method = "gbm",
trControl = controlGBM, verbose = FALSE)
## Loading required package: gbm
## Loading required package: survival
##
## Attaching package: 'survival'
## The following object is masked from 'package:caret':
##
## cluster
## Loading required package: splines
## Loading required package: parallel
## Loaded gbm 2.1.1
## Loading required package: plyr
modFitGBM$finalModel
## A gradient boosted model with multinomial loss function.
## 150 iterations were performed.
## There were 53 predictors of which 43 had non-zero influence.
# prediction on Test dataset
predictGBM <- predict(modFitGBM, newdata=TestSet)
confMatGBM <- confusionMatrix(predictGBM, TestSet$classe)
confMatGBM
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 1670 9 0 3 0
## B 2 1117 20 2 1
## C 0 11 1004 14 3
## D 1 2 2 944 11
## E 1 0 0 1 1067
##
## Overall Statistics
##
## Accuracy : 0.9859
## 95% CI : (0.9825, 0.9888)
## No Information Rate : 0.2845
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.9822
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 0.9976 0.9807 0.9786 0.9793 0.9861
## Specificity 0.9972 0.9947 0.9942 0.9967 0.9996
## Pos Pred Value 0.9929 0.9781 0.9729 0.9833 0.9981
## Neg Pred Value 0.9990 0.9954 0.9955 0.9959 0.9969
## Prevalence 0.2845 0.1935 0.1743 0.1638 0.1839
## Detection Rate 0.2838 0.1898 0.1706 0.1604 0.1813
## Detection Prevalence 0.2858 0.1941 0.1754 0.1631 0.1816
## Balanced Accuracy 0.9974 0.9877 0.9864 0.9880 0.9929
# plot matrix results
plot(confMatGBM$table, col = confMatGBM$byClass, main = paste("GBM - Accuracy =", round(confMatGBM$overall['Accuracy'], 4)))
V. Applying the Selected Model to the Test Data The accuracy of the 3 regression modeling methods above are:
Random Forest : 0.9963 Decision Tree : 0.7368 GBM : 0.9839 In that case, the Random Forest model will be applied to predict the 20 quiz results (testing dataset) as shown below.
predictTEST <- predict(modFitRandForest, newdata=testing)
predictTEST
## [1] B A B A A E D B A A B C B A E E A B B B
## Levels: A B C D E