Background
Using devices such as Jawbone Up, Nike FuelBand, and Fitbit it is now possible to collect a large amount of data about personal activity relatively inexpensively. These type of devices are part of the quantified self movement - a group of enthusiasts who take measurements about themselves regularly to improve their health, to find patterns in their behavior, or because they are tech geeks. One thing that people regularly do is quantify how much of a particular activity they do, but they rarely quantify how well they do it. In this project, your goal will be to use data from accelerometers on the belt, forearm, arm, and dumbell of 6 participants. They were asked to perform barbell lifts correctly and incorrectly in 5 different ways. More information is available from the website here: http://web.archive.org/web/20161224072740/http:/groupware.les.inf.puc-rio.br/har (see the section on the Weight Lifting Exercise Dataset).
Data
The training data for this project are available here:
https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv
The test data are available here:
https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv
The data for this project come from this source: http://web.archive.org/web/20161224072740/http:/groupware.les.inf.puc-rio.br/har. If you use the document you create for this class for any purpose please cite them as they have been very generous in allowing their data to be used for this kind of assignment.
#load libraries necessary for modeling
library(lattice);library(ggplot2);library(caret);library(randomForest);library(rpart);library(rpart.plot)
## Warning: package 'caret' was built under R version 3.4.4
## Warning: package 'randomForest' was built under R version 3.4.4
## randomForest 4.6-14
## Type rfNews() to see new features/changes/bug fixes.
##
## Attaching package: 'randomForest'
## The following object is masked from 'package:ggplot2':
##
## margin
## Warning: package 'rpart.plot' was built under R version 3.4.4
#get and read data
datrain<-"https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv"
datest<-"https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv"
#read files, leaving out columsn with NAs, blanks and DIV/0! entries
train<-read.csv(datrain,na.strings=c("NA","#DIV/0!", ""))
test<-read.csv(datest,na.strings=c("NA","#DIV/0!", ""))
#dim(train);dim(test);summary(train);summary(test)
#remove non-essential variables, i.e. username, timestamp, window, etc. cols 1-7
train<-train[,-c(1:7)]
test<-test[,-c(1:7)]
#remove columns with NAs
train<-train[,colSums(is.na(train)) == 0]
test <-test[,colSums(is.na(test)) == 0]
Split the data set using data partitions function
trainPart<-createDataPartition(y=train$classe,p=.75,list=FALSE)
trainSet<-train[trainPart,]
validSet<-train[-trainPart,]
Show the classe levels and frequency.
#plot the classe variable
plot(trainSet$classe,col="pink",main="Classe within Train Set",xlab="Classe",ylab="Frequency")
We will try 3 different models using seed of 33134.
set.seed(33134)
rfModel<-train(classe~.,data=trainSet,method="rf",verbose=FALSE)
rfPred<-predict(rfModel,validSet)
confusionMatrix(rfPred,validSet$classe)
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 1394 3 0 0 0
## B 0 945 7 0 0
## C 0 1 847 23 0
## D 0 0 1 780 2
## E 1 0 0 1 899
##
## Overall Statistics
##
## Accuracy : 0.992
## 95% CI : (0.9891, 0.9943)
## No Information Rate : 0.2845
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.9899
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 0.9993 0.9958 0.9906 0.9701 0.9978
## Specificity 0.9991 0.9982 0.9941 0.9993 0.9995
## Pos Pred Value 0.9979 0.9926 0.9724 0.9962 0.9978
## Neg Pred Value 0.9997 0.9990 0.9980 0.9942 0.9995
## Prevalence 0.2845 0.1935 0.1743 0.1639 0.1837
## Detection Rate 0.2843 0.1927 0.1727 0.1591 0.1833
## Detection Prevalence 0.2849 0.1941 0.1776 0.1597 0.1837
## Balanced Accuracy 0.9992 0.9970 0.9924 0.9847 0.9986
Random Forest prediction shows an accuracy level of 99%, with a 95% confidence interval of (.993,.997).
dtModel<-train(classe~.,data=trainSet,method="rpart")
dtPred<-predict(dtModel,validSet)
confusionMatrix(dtPred,validSet$classe)
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 1272 395 398 360 113
## B 14 326 28 148 111
## C 104 228 429 296 257
## D 0 0 0 0 0
## E 5 0 0 0 420
##
## Overall Statistics
##
## Accuracy : 0.499
## 95% CI : (0.4849, 0.5131)
## No Information Rate : 0.2845
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.3454
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 0.9118 0.34352 0.50175 0.0000 0.46615
## Specificity 0.6392 0.92389 0.78143 1.0000 0.99875
## Pos Pred Value 0.5012 0.51994 0.32648 NaN 0.98824
## Neg Pred Value 0.9480 0.85434 0.88134 0.8361 0.89261
## Prevalence 0.2845 0.19352 0.17435 0.1639 0.18373
## Detection Rate 0.2594 0.06648 0.08748 0.0000 0.08564
## Detection Prevalence 0.5175 0.12785 0.26794 0.0000 0.08666
## Balanced Accuracy 0.7755 0.63371 0.64159 0.5000 0.73245
The Classification tree shows an accuracy of 50%, with a confidence interval of (.483,.512).
ldaModel<-train(classe~.,data=trainSet,method="lda")
ldaPred<-predict(ldaModel,validSet)
confusionMatrix(ldaPred,validSet$classe)
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 1138 135 94 48 32
## B 29 614 85 34 160
## C 98 112 554 96 87
## D 125 40 97 592 85
## E 5 48 25 34 537
##
## Overall Statistics
##
## Accuracy : 0.7004
## 95% CI : (0.6874, 0.7132)
## No Information Rate : 0.2845
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.621
## Mcnemar's Test P-Value : < 2.2e-16
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 0.8158 0.6470 0.6480 0.7363 0.5960
## Specificity 0.9119 0.9221 0.9029 0.9154 0.9720
## Pos Pred Value 0.7865 0.6659 0.5850 0.6305 0.8274
## Neg Pred Value 0.9257 0.9159 0.9239 0.9465 0.9145
## Prevalence 0.2845 0.1935 0.1743 0.1639 0.1837
## Detection Rate 0.2321 0.1252 0.1130 0.1207 0.1095
## Detection Prevalence 0.2951 0.1880 0.1931 0.1915 0.1323
## Balanced Accuracy 0.8639 0.7846 0.7754 0.8258 0.7840
The linear model shows an accuracty of 69%, with a 95% confidence interval of (.681,.707).
In comparing the 3 models, the best fit would be the random forest model, based on the accuracy level.
Using the random forest prediction model, I expect a 99% level of accuracy (based on the validation set) for these data points. The sensitivity ranged from 92%-99% for each level. The specificity for all levels was 99%. Below is the predicted classes for each observation.
finalPred <- predict(rfModel,test)
finalPred
## [1] B A B A A E D B A A B C B A E E A B B B
## Levels: A B C D E