Practical Machine Learning Exercise

Introduction

Background

Using devices such as Jawbone Up, Nike FuelBand, and Fitbit it is now possible to collect a large amount of data about personal activity relatively inexpensively. These type of devices are part of the quantified self movement - a group of enthusiasts who take measurements about themselves regularly to improve their health, to find patterns in their behavior, or because they are tech geeks. One thing that people regularly do is quantify how much of a particular activity they do, but they rarely quantify how well they do it. In this project, your goal will be to use data from accelerometers on the belt, forearm, arm, and dumbell of 6 participants. They were asked to perform barbell lifts correctly and incorrectly in 5 different ways. More information is available from the website here: http://web.archive.org/web/20161224072740/http:/groupware.les.inf.puc-rio.br/har (see the section on the Weight Lifting Exercise Dataset).

Data

The training data for this project are available here:

https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv

The test data are available here:

https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv

The data for this project come from this source: http://web.archive.org/web/20161224072740/http:/groupware.les.inf.puc-rio.br/har. If you use the document you create for this class for any purpose please cite them as they have been very generous in allowing their data to be used for this kind of assignment.

Initial Exploration & Processing

#load libraries necessary for modeling
library(lattice);library(ggplot2);library(caret);library(randomForest);library(rpart);library(rpart.plot)

## Warning: package 'caret' was built under R version 3.4.4

## Warning: package 'randomForest' was built under R version 3.4.4

## randomForest 4.6-14

## Type rfNews() to see new features/changes/bug fixes.

## 
## Attaching package: 'randomForest'

## The following object is masked from 'package:ggplot2':
## 
##     margin

## Warning: package 'rpart.plot' was built under R version 3.4.4

#get and read data
datrain<-"https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv"
datest<-"https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv"
#read files, leaving out columsn with NAs, blanks and DIV/0! entries
train<-read.csv(datrain,na.strings=c("NA","#DIV/0!", ""))
test<-read.csv(datest,na.strings=c("NA","#DIV/0!", ""))
#dim(train);dim(test);summary(train);summary(test)
#remove non-essential variables, i.e. username, timestamp, window, etc.  cols 1-7
train<-train[,-c(1:7)]
test<-test[,-c(1:7)]
#remove columns with NAs
train<-train[,colSums(is.na(train)) == 0]
test <-test[,colSums(is.na(test)) == 0]

Split the data set using data partitions function

trainPart<-createDataPartition(y=train$classe,p=.75,list=FALSE)
trainSet<-train[trainPart,]
validSet<-train[-trainPart,]

Show the classe levels and frequency.

#plot the classe variable
plot(trainSet$classe,col="pink",main="Classe within Train Set",xlab="Classe",ylab="Frequency")

Models

We will try 3 different models using seed of 33134.

set.seed(33134)

Random Forest

rfModel<-train(classe~.,data=trainSet,method="rf",verbose=FALSE)
rfPred<-predict(rfModel,validSet)
confusionMatrix(rfPred,validSet$classe)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 1394    3    0    0    0
##          B    0  945    7    0    0
##          C    0    1  847   23    0
##          D    0    0    1  780    2
##          E    1    0    0    1  899
## 
## Overall Statistics
##                                           
##                Accuracy : 0.992           
##                  95% CI : (0.9891, 0.9943)
##     No Information Rate : 0.2845          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.9899          
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            0.9993   0.9958   0.9906   0.9701   0.9978
## Specificity            0.9991   0.9982   0.9941   0.9993   0.9995
## Pos Pred Value         0.9979   0.9926   0.9724   0.9962   0.9978
## Neg Pred Value         0.9997   0.9990   0.9980   0.9942   0.9995
## Prevalence             0.2845   0.1935   0.1743   0.1639   0.1837
## Detection Rate         0.2843   0.1927   0.1727   0.1591   0.1833
## Detection Prevalence   0.2849   0.1941   0.1776   0.1597   0.1837
## Balanced Accuracy      0.9992   0.9970   0.9924   0.9847   0.9986

Random Forest prediction shows an accuracy level of 99%, with a 95% confidence interval of (.993,.997).

Classification Tree

dtModel<-train(classe~.,data=trainSet,method="rpart")
dtPred<-predict(dtModel,validSet)
confusionMatrix(dtPred,validSet$classe)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 1272  395  398  360  113
##          B   14  326   28  148  111
##          C  104  228  429  296  257
##          D    0    0    0    0    0
##          E    5    0    0    0  420
## 
## Overall Statistics
##                                           
##                Accuracy : 0.499           
##                  95% CI : (0.4849, 0.5131)
##     No Information Rate : 0.2845          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.3454          
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            0.9118  0.34352  0.50175   0.0000  0.46615
## Specificity            0.6392  0.92389  0.78143   1.0000  0.99875
## Pos Pred Value         0.5012  0.51994  0.32648      NaN  0.98824
## Neg Pred Value         0.9480  0.85434  0.88134   0.8361  0.89261
## Prevalence             0.2845  0.19352  0.17435   0.1639  0.18373
## Detection Rate         0.2594  0.06648  0.08748   0.0000  0.08564
## Detection Prevalence   0.5175  0.12785  0.26794   0.0000  0.08666
## Balanced Accuracy      0.7755  0.63371  0.64159   0.5000  0.73245

The Classification tree shows an accuracy of 50%, with a confidence interval of (.483,.512).

Linear Model

ldaModel<-train(classe~.,data=trainSet,method="lda")
ldaPred<-predict(ldaModel,validSet)
confusionMatrix(ldaPred,validSet$classe)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 1138  135   94   48   32
##          B   29  614   85   34  160
##          C   98  112  554   96   87
##          D  125   40   97  592   85
##          E    5   48   25   34  537
## 
## Overall Statistics
##                                           
##                Accuracy : 0.7004          
##                  95% CI : (0.6874, 0.7132)
##     No Information Rate : 0.2845          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.621           
##  Mcnemar's Test P-Value : < 2.2e-16       
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            0.8158   0.6470   0.6480   0.7363   0.5960
## Specificity            0.9119   0.9221   0.9029   0.9154   0.9720
## Pos Pred Value         0.7865   0.6659   0.5850   0.6305   0.8274
## Neg Pred Value         0.9257   0.9159   0.9239   0.9465   0.9145
## Prevalence             0.2845   0.1935   0.1743   0.1639   0.1837
## Detection Rate         0.2321   0.1252   0.1130   0.1207   0.1095
## Detection Prevalence   0.2951   0.1880   0.1931   0.1915   0.1323
## Balanced Accuracy      0.8639   0.7846   0.7754   0.8258   0.7840

The linear model shows an accuracty of 69%, with a 95% confidence interval of (.681,.707).

In comparing the 3 models, the best fit would be the random forest model, based on the accuracy level.

Final Prediction

Using the random forest prediction model, I expect a 99% level of accuracy (based on the validation set) for these data points. The sensitivity ranged from 92%-99% for each level. The specificity for all levels was 99%. Below is the predicted classes for each observation.

finalPred <- predict(rfModel,test)
finalPred

##  [1] B A B A A E D B A A B C B A E E A B B B
## Levels: A B C D E