This document is the final report of the Peer Assessment project from Coursera’s course Practical Machine Learning, as part of the Specialization in Data Science. It was built up in RStudio, using its knitr functions, meant to be published in html format.
This analysis is meant to be the basis for the course quiz and a prediction assignment write up. The main goal of the project is to predict the manner in which 6 participants performed some exercise as described below (This is the classe variable in the training set. The machine learning algorithm described here is applied to the 20 test cases available in the test data and the predictions are submitted in appropriate format to the Course Project Prediction Quiz for automated grading.
Using devices such as Jawbone Up, Nike FuelBand, and Fitbit it is now possible to collect a large amount of data about personal activity relatively inexpensively. These type of devices are part of the quantified self movement – a group of enthusiasts who take measurements about themselves regularly to improve their health, to find patterns in their behavior, or because they are tech geeks. One thing that people regularly do is quantify how much of a particular activity they do, but they rarely quantify how well they do it.
In this project, our goal is to use data from accelerometers on the belt, forearm, arm, and dumbbell of 6 participants. They were asked to perform barbell lifts in five different ways: exactly according to specification (Class A), throwing the elbows to the front (Class B), lifting the dumbbell only halfway (Class C), lowering the dumbbell only halfway (Class D) and throwing the hips to the front (Class E). Class A corresponds to the specified execution of the exercise while the other 4 classes correspond to common mistakes.
More information is available from the website here: http://groupware.les.inf.puc-rio.br/har
The training data for this project are available here: https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv
The test data are available here: https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv
Here, we downloaded and loaded the training and testing dataset from the given url. The needed libraries were also loaded.
training.Url <- "https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv"
testing.Url <- "https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv"
#download.file(training.Url, "./pml-training.csv")
#download.file(testing.Url, "./pml-testing.csv")
training.df <- read.csv("./pml-training.csv")
testing.df <- read.csv("./pml-testing.csv")
library(knitr)
## Warning: package 'knitr' was built under R version 4.0.5
library(caret)
## Warning: package 'caret' was built under R version 4.0.5
## Loading required package: ggplot2
## Warning: package 'ggplot2' was built under R version 4.0.5
## Loading required package: lattice
library(rpart)
library(rpart.plot)
## Warning: package 'rpart.plot' was built under R version 4.0.5
library(rattle)
## Warning: package 'rattle' was built under R version 4.0.5
## Loading required package: tibble
## Warning: package 'tibble' was built under R version 4.0.5
## Loading required package: bitops
## Rattle: A free graphical interface for data science with R.
## Version 5.4.0 Copyright (c) 2006-2020 Togaware Pty Ltd.
## Type 'rattle()' to shake, rattle, and roll your data.
library(randomForest)
## Warning: package 'randomForest' was built under R version 4.0.5
## randomForest 4.6-14
## Type rfNews() to see new features/changes/bug fixes.
##
## Attaching package: 'randomForest'
## The following object is masked from 'package:rattle':
##
## importance
## The following object is masked from 'package:ggplot2':
##
## margin
library(corrplot)
## corrplot 0.90 loaded
That’s a lot of NA variables which can affect our model. Time to remove the columns with a large amount of NA’s.
# Number of NA variables in dataset
sum(is.na(training.df))
## [1] 1287472
# Selecting columns of the training dataset which have a mean of the logical response of is.na() to be less than 0.9
training.df <- training.df[, colMeans(is.na(training.df)) < 0.9]
# removing irrelevant metadata
training.df <- training.df[, -c(1:7)]
dim(training.df)
## [1] 19622 86
Removing near zero variance variables
nvz <- nearZeroVar(training.df)
training.df <- training.df[,-nvz]
dim(training.df)
## [1] 19622 53
We partitioned the training data set into two to create a training dataset (70% of the training data set) for the modeling process and testing data set (the remaining 30%) for validations.
set.seed(12345)
# Partitioning based on the response column
inTrain <- createDataPartition(training.df$classe, p=0.7, list=FALSE)
trainSet <- training.df[inTrain,]
testSet <- training.df[-inTrain,]
dim(trainSet)
## [1] 13737 53
dim(testSet)
## [1] 5885 53
A correlation analysis amongst the variables excluding the response variable (classe) before proceeding to the modeling procedures.
corMatrix <- cor(trainSet[, -53]) # Excluding the classe variable
corrplot(corMatrix, method = "color", type = "lower", order = "FPC",
tl.cex = 0.8, tl.col = rgb(0,0,0))
The highly correlated values are shown in dark colours in the graph above.
Three methods will be applied to model the regressions and the best one (the one with higher accuracy when applied to the test dataset) will be used for the quiz predictions. The methods are Random Forest, Decision Tree and Generalized Boosted Model. A confusion matrix is plotted at the end of each analysis to better visualize the accuracy of the models.
set.seed(12345)
# model fit
controlRF <- trainControl(method="cv", number=3, verboseIter=FALSE)
modFitRandForest <- train(classe ~ ., data=trainSet, method="rf",
trControl=controlRF)
modFitRandForest$finalModel
##
## Call:
## randomForest(x = x, y = y, mtry = min(param$mtry, ncol(x)))
## Type of random forest: classification
## Number of trees: 500
## No. of variables tried at each split: 27
##
## OOB estimate of error rate: 0.68%
## Confusion matrix:
## A B C D E class.error
## A 3899 5 0 0 2 0.001792115
## B 19 2630 9 0 0 0.010534236
## C 0 15 2373 8 0 0.009599332
## D 0 1 21 2227 3 0.011101243
## E 0 3 4 3 2515 0.003960396
predictRandForest <- predict(modFitRandForest, newdata = testSet)
confMatRandForest <- confusionMatrix(predictRandForest, as.factor(testSet$classe))
confMatRandForest
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 1672 7 0 0 0
## B 1 1129 4 0 0
## C 1 3 1019 7 1
## D 0 0 3 956 1
## E 0 0 0 1 1080
##
## Overall Statistics
##
## Accuracy : 0.9951
## 95% CI : (0.9929, 0.9967)
## No Information Rate : 0.2845
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.9938
##
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 0.9988 0.9912 0.9932 0.9917 0.9982
## Specificity 0.9983 0.9989 0.9975 0.9992 0.9998
## Pos Pred Value 0.9958 0.9956 0.9884 0.9958 0.9991
## Neg Pred Value 0.9995 0.9979 0.9986 0.9984 0.9996
## Prevalence 0.2845 0.1935 0.1743 0.1638 0.1839
## Detection Rate 0.2841 0.1918 0.1732 0.1624 0.1835
## Detection Prevalence 0.2853 0.1927 0.1752 0.1631 0.1837
## Balanced Accuracy 0.9986 0.9951 0.9954 0.9954 0.9990
# plot matrix results
plot(confMatRandForest$table, col=confMatRandForest$byClass,
main = paste("Random Forest - Accuracy =",
round(confMatRandForest$overall["Accuracy"], 4)))
# model fit
set.seed(12345)
modFitDecTree <- rpart(classe ~ ., data=trainSet, method="class")
fancyRpartPlot(modFitDecTree)
predictDecTree <- predict(modFitDecTree, newdata=testSet, type="class")
confMatDecTree <- confusionMatrix(predictDecTree, as.factor(testSet$classe))
confMatDecTree
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 1532 176 28 48 41
## B 54 585 57 64 76
## C 35 154 819 134 126
## D 25 76 58 631 56
## E 28 148 64 87 783
##
## Overall Statistics
##
## Accuracy : 0.7392
## 95% CI : (0.7277, 0.7503)
## No Information Rate : 0.2845
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.6692
##
## Mcnemar's Test P-Value : < 2.2e-16
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 0.9152 0.51361 0.7982 0.6546 0.7237
## Specificity 0.9304 0.94711 0.9076 0.9563 0.9319
## Pos Pred Value 0.8395 0.69976 0.6459 0.7459 0.7054
## Neg Pred Value 0.9650 0.89028 0.9552 0.9339 0.9374
## Prevalence 0.2845 0.19354 0.1743 0.1638 0.1839
## Detection Rate 0.2603 0.09941 0.1392 0.1072 0.1331
## Detection Prevalence 0.3101 0.14206 0.2155 0.1438 0.1886
## Balanced Accuracy 0.9228 0.73036 0.8529 0.8054 0.8278
plot(confMatDecTree$table, col=confMatDecTree$byClass,
main = paste("Decision Tree - Accuracy =",
round(confMatDecTree$overall["Accuracy"], 4)))
set.seed(12345)
controlGBM <- trainControl(method="repeatedcv", number=5, repeats=1)
modFitGBM <- train(classe ~ ., data=trainSet, method="gbm",
trControl=controlGBM, verbose=FALSE)
modFitGBM$finalModel
## A gradient boosted model with multinomial loss function.
## 150 iterations were performed.
## There were 52 predictors of which 51 had non-zero influence.
predictGBM <- predict(modFitGBM, newdata = testSet)
confMatGBM <- confusionMatrix(predictGBM, as.factor(testSet$classe))
confMatGBM
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 1647 39 0 1 1
## B 19 1066 38 4 14
## C 4 33 979 38 6
## D 4 0 8 915 8
## E 0 1 1 6 1053
##
## Overall Statistics
##
## Accuracy : 0.9618
## 95% CI : (0.9565, 0.9665)
## No Information Rate : 0.2845
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.9516
##
## Mcnemar's Test P-Value : 8.329e-08
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 0.9839 0.9359 0.9542 0.9492 0.9732
## Specificity 0.9903 0.9842 0.9833 0.9959 0.9983
## Pos Pred Value 0.9757 0.9343 0.9236 0.9786 0.9925
## Neg Pred Value 0.9936 0.9846 0.9903 0.9901 0.9940
## Prevalence 0.2845 0.1935 0.1743 0.1638 0.1839
## Detection Rate 0.2799 0.1811 0.1664 0.1555 0.1789
## Detection Prevalence 0.2868 0.1939 0.1801 0.1589 0.1803
## Balanced Accuracy 0.9871 0.9601 0.9688 0.9726 0.9858
plot(confMatGBM$table, col=confMatGBM$byClass,
main = paste("GBM - Accuracy =",
round(confMatGBM$overall["Accuracy"], 4)))
The accuracy of the three regression modeling methods above are:
In that case, the Random Forest model will be applied to test dataset (testing.df).
predictTest <- predict(modFitRandForest, newdata = testing.df)
predictTest
## [1] B A B A A E D B A A B C B A E E A B B B
## Levels: A B C D E