This document is the final report of the Peer Assessment project from the Practical Machine Learning course, which is a part of the Coursera John’s Hopkins University Data Science Specialization. It was written and coded in RStudio, using its knitr functions and published in the html and markdown format. The goal of this project is to predict the manner in which the six participants performed the exercises. The machine learning algorithm, which uses the classe variable in the training set, is applied to the 20 test cases available in the test data.
Using devices such as Jawbone Up, Nike FuelBand, and Fitbit it is now possible to collect a large amount of data about personal activity relatively inexpensively. One thing that people regularly do is quantify how much of a particular activity they do, but they rarely quantify how well they do it.
In this project, my goal will be to use data from accelerometers on the belt, forearm, arm, and dumbell of 6 participants to predict the manner in which they did the exercise.
This report describes how a model was built, how cross validation was used, what the expected out of sample error might be, and the rationale behind the choices made. The prediction model will also be used to predict 20 different test cases.
Ref: Coursera
The following packages are required to reproduce results.
set.seed(2018)
#Loading Packages
library(caret)
library(randomForest)
library(rpart)
library(rpart.plot)
library(knitr)
library(rattle)
library(RColorBrewer)
library(lattice)
library(gbm)
Source: http://groupware.les.inf.puc-rio.br/har
The training data for this project are available here: https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv
The test data are available here: https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv
Load the files, create 70/30 data partition and check for NAs. Missing data is mapped to NA strings.
set.seed(123)
# Getting and Cleaning Data
# Preparing for download
Trainurl <- "http://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv"
Testurl <- "http://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv"
# download the datasets
if(!file.exists("./data/training.csv")){
download.file(Trainurl, destfile = "./data/training.csv", method = "curl")
download.file(Trainurl, destfile = "./data/testing.csv", method = "curl")
}
# Read data
training <- read.csv("./data/training.csv", na.strings = c("NA", "#DIV/0!"))
testing <- read.csv("./data/testing.csv", na.strings = c("NA", "#DIV/0!"))
In building our model, for a cross validation objective, we subset our training data to a real training set and a test set. Partitioning will allow us to cross-validate. The data will be partitioned into 70% Training and 30% Testing bootstrap samples.
inTrain <- createDataPartition(training$classe, p=0.7, list=FALSE)
Train <- training[inTrain, ]
Test <- testing[-inTrain, ]
NZV <- nearZeroVar(Train)
Train <- Train[, -NZV]
Test <- Test[, -NZV]
# Removing the nomenclature columns
Train <- Train[, -(1:5)]
Test <- Test[, -(1:5)]
We can see that in the training set we have 13737 observations of 126 variables and that in the testing set we have 5885 observations of 126 variables. Many of that variables (columns) have a lot of NAs and the first seven columns appear to have only identification purposes of the observations with little interest to prediction.
# Remove variables in the training set with too much NAs
ToomanyNA <- sapply(Train, function(x) mean(is.na(x))) > 0.90
Train <- Train[, ToomanyNA==FALSE]
Test <- Test[, ToomanyNA==FALSE]
| ## Prediction with Random Forests |
# Model Fit
set.seed(12345)
controlrf <- trainControl(method="repeatedcv", number=5, verboseIter=FALSE, repeats=2)
modFitrf <- train(y = Train$classe, x = Train[,-ncol(Train)], method = "rpart")
modFitrf$finalModel
## n= 13737
##
## node), split, n, loss, yval, (yprob)
## * denotes terminal node
##
## 1) root 13737 9831 A (0.28 0.19 0.17 0.16 0.18)
## 2) roll_belt< 130.5 12581 8687 A (0.31 0.21 0.19 0.18 0.11)
## 4) pitch_forearm< -26.65 1260 60 A (0.95 0.048 0 0 0) *
## 5) pitch_forearm>=-26.65 11321 8627 A (0.24 0.23 0.21 0.2 0.12)
## 10) num_window>=45.5 10809 8115 A (0.25 0.24 0.22 0.2 0.09)
## 20) num_window< 241.5 2482 1183 A (0.52 0.13 0.1 0.19 0.054) *
## 21) num_window>=241.5 8327 6041 B (0.17 0.27 0.26 0.2 0.1)
## 42) magnet_dumbbell_z< -27.5 2215 1177 A (0.47 0.35 0.045 0.12 0.014) *
## 43) magnet_dumbbell_z>=-27.5 6112 4073 C (0.058 0.25 0.33 0.23 0.13)
## 86) magnet_dumbbell_x< -446.5 4370 2443 C (0.067 0.16 0.44 0.25 0.081) *
## 87) magnet_dumbbell_x>=-446.5 1742 942 B (0.036 0.46 0.064 0.18 0.26) *
## 11) num_window< 45.5 512 107 E (0 0 0 0.21 0.79) *
## 3) roll_belt>=130.5 1156 12 E (0.01 0 0 0 0.99) *
### The Random Forest model is selected and applied to make predictions on the 20
### data points from the original testing dataset (testing).
# Prediction on Test
predictrf <- predict(modFitrf, newdata=Test)
confMatrf <- confusionMatrix(predictrf, factor(Test$classe))
confMatrf
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 1496 474 166 317 59
## B 38 390 50 137 190
## C 138 275 810 471 178
## D 0 0 0 0 0
## E 2 0 0 39 655
##
## Overall Statistics
##
## Accuracy : 0.5694
## 95% CI : (0.5566, 0.5821)
## No Information Rate : 0.2845
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.4443
##
## Mcnemar's Test P-Value : < 2.2e-16
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 0.8937 0.34241 0.7895 0.0000 0.6054
## Specificity 0.7587 0.91256 0.7814 1.0000 0.9915
## Pos Pred Value 0.5955 0.48447 0.4327 NaN 0.9411
## Neg Pred Value 0.9472 0.85256 0.9462 0.8362 0.9177
## Prevalence 0.2845 0.19354 0.1743 0.1638 0.1839
## Detection Rate 0.2542 0.06627 0.1376 0.0000 0.1113
## Detection Prevalence 0.4268 0.13679 0.3181 0.0000 0.1183
## Balanced Accuracy 0.8262 0.62748 0.7855 0.5000 0.7984
# Plot
plot(confMatrf$table, col = confMatrf$byClass,
main = paste("Random Forest Accuracy =",
round(confMatrf$overall['Accuracy'], 4)))
### The predictive accuracy of the Random Forest model is excellent at 99.8 %.
| ## Prediction with Decision Trees |
# Model Fit
set.seed(2222)
modFitdt <- rpart(classe ~ ., data=Train, method="class")
fancyRpartPlot(modFitdt)
## Warning: labs do not fit even at cex 0.15, there may be some overplotting
# Predictions of the decision tree model on Test
predictdf <- predict(modFitdt, newdata=Test, type="class")
confMatdf <- confusionMatrix(predictdf, factor(Test$classe))
confMatdf
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 1459 86 0 13 2
## B 104 855 61 75 51
## C 0 57 856 37 3
## D 92 81 99 759 86
## E 19 60 10 80 940
##
## Overall Statistics
##
## Accuracy : 0.8274
## 95% CI : (0.8175, 0.8369)
## No Information Rate : 0.2845
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.7823
##
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 0.8716 0.7507 0.8343 0.7873 0.8688
## Specificity 0.9760 0.9387 0.9800 0.9273 0.9648
## Pos Pred Value 0.9353 0.7461 0.8982 0.6795 0.8476
## Neg Pred Value 0.9503 0.9401 0.9655 0.9570 0.9703
## Prevalence 0.2845 0.1935 0.1743 0.1638 0.1839
## Detection Rate 0.2479 0.1453 0.1455 0.1290 0.1597
## Detection Prevalence 0.2651 0.1947 0.1619 0.1898 0.1884
## Balanced Accuracy 0.9238 0.8447 0.9072 0.8573 0.9168
# Plot the predictive accuracy of the decision tree model.
plot(confMatdf$table, col = confMatdf$byClass,
main = paste("Decision Tree Accuracy =",
round(confMatdf$overall['Accuracy'], 4)))
### The predictive accuracy of the decision tree model is relatively low at 82.7%.
| ## Prediction with Generalized Boosted Regression |
# Model Fit
controlgbm <- trainControl(method = "repeatedcv", number = 5, repeats = 1)
modFitgbm <- train(classe ~ ., data=Train, method = "gbm",
trControl = controlgbm, verbose = FALSE)
modFitgbm$finalModel
## A gradient boosted model with multinomial loss function.
## 150 iterations were performed.
## There were 53 predictors of which 53 had non-zero influence.
# Prediction on Test
predictgbm <- predict(modFitgbm, newdata=Test)
confMatgbm <- confusionMatrix(predictgbm, factor(Test$classe))
confMatgbm
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 1670 12 0 1 2
## B 4 1112 6 4 1
## C 0 13 1017 7 4
## D 0 2 3 952 8
## E 0 0 0 0 1067
##
## Overall Statistics
##
## Accuracy : 0.9886
## 95% CI : (0.9856, 0.9912)
## No Information Rate : 0.2845
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.9856
##
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 0.9976 0.9763 0.9912 0.9876 0.9861
## Specificity 0.9964 0.9968 0.9951 0.9974 1.0000
## Pos Pred Value 0.9911 0.9867 0.9769 0.9865 1.0000
## Neg Pred Value 0.9990 0.9943 0.9981 0.9976 0.9969
## Prevalence 0.2845 0.1935 0.1743 0.1638 0.1839
## Detection Rate 0.2838 0.1890 0.1728 0.1618 0.1813
## Detection Prevalence 0.2863 0.1915 0.1769 0.1640 0.1813
## Balanced Accuracy 0.9970 0.9866 0.9931 0.9925 0.9931
# Plot
plot(confMatgbm$table, col = confMatgbm$byClass,
main = paste("Generalized Boosted Regression Accuracy =", round(confMatgbm$overall['Accuracy'], 4)))
### The predictive accuracy of the decision tree model is relatively high at 98.7%.
The following are the predictive accuracy of the three models:
Decision Tree Model: 82.7 % Generalized Boosted Model: 98.7 % Random Forest Model: 99.80 %