Using devices such as Jawbone Up, Nike FuelBand, and Fitbit, it is now possible to collect a large amount of data about personal activity relatively inexpensively. These type of devices are part of the quantified self movement - a group of enthusiasts who take measurements about themselves regularly to improve their health, to find patterns in their behavior, or because they are tech geeks. One thing that people regularly do is quantify how much of a particular activity they do, but they rarely quantify how well they do it.
In this project, we will use data from accelerometers on the belt, forearm, arm, and dumbell of 6 participants. They were asked to perform barbell lifts correctly and incorrectly in 5 different ways. The goal of this project is to predict the manner in which they did the exercise. We will also use our prediction model to predict 20 different test cases.
Our data consists of a training dataset and a test dataset, courtesy of “Ugulino, W.; Cardador, D.; Vega, K.; Velloso, E.; Milidiu, R.; Fuks, H. Wearable Computing: Accelerometers’ Data Classification of Body Postures and Movements.”
# Loading of Libraries
library(caret)
## Loading required package: lattice
## Loading required package: ggplot2
library(rpart)
library(rpart.plot)
library(RColorBrewer)
library(rattle)
## Rattle: A free graphical interface for data science with R.
## Version 5.3.0 Copyright (c) 2006-2018 Togaware Pty Ltd.
## Type 'rattle()' to shake, rattle, and roll your data.
library(randomForest)
## randomForest 4.6-14
## Type rfNews() to see new features/changes/bug fixes.
##
## Attaching package: 'randomForest'
## The following object is masked from 'package:rattle':
##
## importance
## The following object is masked from 'package:ggplot2':
##
## margin
library(corrplot)
## corrplot 0.84 loaded
library(gbm)
## Loaded gbm 2.1.5
# Loading the Datasets
train_in <- read.csv('./pml-training.csv', header=T)
test_in <- read.csv('./pml-testing.csv', header=T)
dim(train_in)
## [1] 19622 160
dim(test_in)
## [1] 20 160
# Cleaning the Datasets
trainData <- train_in[, colSums(is.na(train_in)) == 0]
testData <- test_in[, colSums(is.na(test_in)) == 0]
trainData <- trainData[, -c(1:7)]
testData <- testData[, -c(1:7)]
dim(trainData)
## [1] 19622 86
dim(testData)
## [1] 20 53
We split the training dataset (trainData) into 70% for training and 30% for cross validation (validData). This will also help us to determine out-of-sample errors. Thereafter, we will use our prediction model to predict 20 different test cases using our test dataset (testData).
# Splitting the Training Dataset
set.seed(1234)
inTrain <- createDataPartition(trainData$classe, p = 0.7, list = FALSE)
trainData <- trainData[inTrain, ]
validData <- trainData[-inTrain, ]
dim(trainData)
## [1] 13737 86
dim(validData)
## [1] 4123 86
# Removing Variables that have near zero variances
NZV <- nearZeroVar(trainData)
trainData <- trainData[, -NZV]
validData <- validData[, -NZV]
dim(trainData)
## [1] 13737 53
dim(validData)
## [1] 4123 53
# Plotting a Correlation Plot for Training Data
cor_mat <- cor(trainData[, -53])
corrplot(cor_mat, order = "FPC", method = "color",
type = "upper", tl.cex = 0.8, tl.col = rgb(0, 0, 0))
In the correlation plot shown above, the variables that are highly correlated are highlighted at the dark blue intersections. We use a threshold value of 0.75 to determine these highly correlated variables.
# Finding Variables that are Highly Correlated in Training Data
highlyCorrelated = findCorrelation(cor_mat, cutoff=0.75)
names(trainData)[highlyCorrelated]
## [1] "accel_belt_z" "roll_belt" "accel_belt_y"
## [4] "total_accel_belt" "accel_dumbbell_z" "accel_belt_x"
## [7] "pitch_belt" "magnet_dumbbell_x" "accel_dumbbell_y"
## [10] "magnet_dumbbell_y" "accel_dumbbell_x" "accel_arm_x"
## [13] "accel_arm_z" "magnet_arm_y" "magnet_belt_z"
## [16] "accel_forearm_y" "gyros_forearm_y" "gyros_dumbbell_x"
## [19] "gyros_dumbbell_z" "gyros_arm_x"
For this project, we will use 3 different algorithms for prediction of the outcome. The algorithms are as follows:
# Plotting the Classification Tree Prediction Model with Training Data
set.seed(12345)
decisionTreeMod1 <- rpart(classe ~ ., data=trainData, method="class")
fancyRpartPlot(decisionTreeMod1)
## Warning: labs do not fit even at cex 0.15, there may be some overplotting
Next, we cross validate our classification tree prediction model with our validation dataset (validData), to determine the accuracy of this prediction model.
# Cross Validating the Classification Tree Prediction Model with Validation Data
predictTreeMod1 <- predict(decisionTreeMod1, validData, type = "class")
cmtree <- confusionMatrix(predictTreeMod1, validData$classe)
cmtree
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 1067 105 9 24 9
## B 40 502 59 63 77
## C 28 90 611 116 86
## D 11 49 41 423 41
## E 19 41 18 46 548
##
## Overall Statistics
##
## Accuracy : 0.7642
## 95% CI : (0.751, 0.7771)
## No Information Rate : 0.2826
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.7015
##
## Mcnemar's Test P-Value : < 2.2e-16
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 0.9159 0.6379 0.8279 0.6295 0.7201
## Specificity 0.9503 0.9284 0.9055 0.9589 0.9631
## Pos Pred Value 0.8789 0.6775 0.6563 0.7487 0.8155
## Neg Pred Value 0.9663 0.9157 0.9602 0.9300 0.9383
## Prevalence 0.2826 0.1909 0.1790 0.1630 0.1846
## Detection Rate 0.2588 0.1218 0.1482 0.1026 0.1329
## Detection Prevalence 0.2944 0.1797 0.2258 0.1370 0.1630
## Balanced Accuracy 0.9331 0.7831 0.8667 0.7942 0.8416
# Plotting the Matrix Results
plot(cmtree$table, col = cmtree$byClass, main = paste("Decision Tree: Accuracy =", round(cmtree$overall['Accuracy'], 4)))
From the decision tree shown above, the accuracy of our classification tree prediction model is 0.7642 and therefore, its out-of-sample error is 0.2358.
# Plotting the Random Decision Forests Prediction Model with Training Data
controlRF <- trainControl(method="cv", number=3, verboseIter=FALSE)
modRF1 <- train(classe ~ ., data=trainData, method="rf", trControl=controlRF)
modRF1$finalModel
##
## Call:
## randomForest(x = x, y = y, mtry = param$mtry)
## Type of random forest: classification
## Number of trees: 500
## No. of variables tried at each split: 27
##
## OOB estimate of error rate: 0.7%
## Confusion matrix:
## A B C D E class.error
## A 3902 3 0 0 1 0.001024066
## B 19 2634 5 0 0 0.009029345
## C 0 17 2369 10 0 0.011268781
## D 0 1 26 2224 1 0.012433393
## E 0 2 5 6 2512 0.005148515
plot(modRF1)
Next, we cross validate our random decision forests prediction model with our validation dataset (validData), to determine the accuracy of this prediction model.
# Cross Validating the Random Decision Forests Prediction Model with Validation Data
predictRF1 <- predict(modRF1, newdata=validData)
cmrf <- confusionMatrix(predictRF1, validData$classe)
cmrf
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 1165 0 0 0 0
## B 0 787 0 0 0
## C 0 0 738 0 0
## D 0 0 0 672 0
## E 0 0 0 0 761
##
## Overall Statistics
##
## Accuracy : 1
## 95% CI : (0.9991, 1)
## No Information Rate : 0.2826
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 1
##
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 1.0000 1.0000 1.000 1.000 1.0000
## Specificity 1.0000 1.0000 1.000 1.000 1.0000
## Pos Pred Value 1.0000 1.0000 1.000 1.000 1.0000
## Neg Pred Value 1.0000 1.0000 1.000 1.000 1.0000
## Prevalence 0.2826 0.1909 0.179 0.163 0.1846
## Detection Rate 0.2826 0.1909 0.179 0.163 0.1846
## Detection Prevalence 0.2826 0.1909 0.179 0.163 0.1846
## Balanced Accuracy 1.0000 1.0000 1.000 1.000 1.0000
# Plotting the Matrix Results
plot(cmrf$table, col = cmrf$byClass, main = paste("Random Decision Forests Confusion Matrix: Accuracy =", round(cmrf$overall['Accuracy'], 4)))
From the random decision forests confusion matrix shown above, the accuracy of our random decision forests prediction model is 1 and therefore, its out-of-sample error is 0.
# Setting Up the Generalized Boosted Models Prediction Model with Training Data
set.seed(12345)
controlGBM <- trainControl(method = "repeatedcv", number = 5, repeats = 1)
modGBM <- train(classe ~ ., data=trainData, method = "gbm", trControl = controlGBM, verbose = FALSE)
modGBM$finalModel
## A gradient boosted model with multinomial loss function.
## 150 iterations were performed.
## There were 52 predictors of which 52 had non-zero influence.
print(modGBM)
## Stochastic Gradient Boosting
##
## 13737 samples
## 52 predictor
## 5 classes: 'A', 'B', 'C', 'D', 'E'
##
## No pre-processing
## Resampling: Cross-Validated (5 fold, repeated 1 times)
## Summary of sample sizes: 10990, 10990, 10989, 10991, 10988
## Resampling results across tuning parameters:
##
## interaction.depth n.trees Accuracy Kappa
## 1 50 0.7521285 0.6858434
## 1 100 0.8227397 0.7756753
## 1 150 0.8522224 0.8130469
## 2 50 0.8564452 0.8181267
## 2 100 0.9059465 0.8809760
## 2 150 0.9301168 0.9115592
## 3 50 0.8969931 0.8695557
## 3 100 0.9392159 0.9230740
## 3 150 0.9587251 0.9477728
##
## Tuning parameter 'shrinkage' was held constant at a value of 0.1
##
## Tuning parameter 'n.minobsinnode' was held constant at a value of 10
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were n.trees = 150, interaction.depth =
## 3, shrinkage = 0.1 and n.minobsinnode = 10.
# Cross Validating the Generalized Boosted Models Prediction Model with Validation Data
predictGBM <- predict(modGBM, newdata=validData)
cmGBM <- confusionMatrix(predictGBM, validData$classe)
cmGBM
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 1155 20 0 0 1
## B 9 754 17 5 6
## C 1 12 713 16 3
## D 0 1 6 647 8
## E 0 0 2 4 743
##
## Overall Statistics
##
## Accuracy : 0.9731
## 95% CI : (0.9677, 0.9778)
## No Information Rate : 0.2826
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.966
##
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 0.9914 0.9581 0.9661 0.9628 0.9763
## Specificity 0.9929 0.9889 0.9905 0.9957 0.9982
## Pos Pred Value 0.9821 0.9532 0.9570 0.9773 0.9920
## Neg Pred Value 0.9966 0.9901 0.9926 0.9928 0.9947
## Prevalence 0.2826 0.1909 0.1790 0.1630 0.1846
## Detection Rate 0.2801 0.1829 0.1729 0.1569 0.1802
## Detection Prevalence 0.2852 0.1919 0.1807 0.1606 0.1817
## Balanced Accuracy 0.9922 0.9735 0.9783 0.9792 0.9873
From the generalized boosted models confusion matrix shown above, the accuracy of our generalized boosted models prediction model is 0.9731 and therefore, its out-of-sample error is 0.0269.
The accuracy values of the 3 prediction models are as follows:
From this comparison, we concluded that the Random Decision Forests Prediction Model is the best for our analysis.
# Using Random Decision Forests Prediction Model on Test Data
Results <- predict(modRF1, newdata=testData)
Results
## [1] B A B A A E D B A A B C B A E E A B B B
## Levels: A B C D E
The generated output, “Results” will be used to answer the “Course Project Prediction Quiz”.