Overview

Using devices such as Jawbone Up, Nike FuelBand, and Fitbit it is now possible to collect a large amount of data about personal activity relatively inexpensively. In this project, the data will be used from accelerometers on the belt, forearm, arm, and dumbell of 6 participants.

The goal of this project is to predict the manner in which they did the exercise and predict 20 different test cases using by the model.

Part 1. Model Selection

1.0 Data and Library Load

library(caret)

train <- read.csv("./Data/pml-training.csv")
test <- read.csv("./Data/pml-testing.csv")
#since the train data is so big, need to be split as train and validation sets.

set.seed(255)
inTrain <- createDataPartition(y=train$classe, p=0.618, list=FALSE)
trainSet <- train[inTrain, ]
trainVald <- train[-inTrain, ]

1.1 Data Preparation for modeling

#check the nzv - nearly zero variance and remove the variables from the data
nzv <- nearZeroVar(trainSet)
#nzv
trainSet <- trainSet[, -nzv]
trainVald <- trainVald[, -nzv]

# remove NA data
trainSet<- trainSet[, colSums(is.na(trainSet)) == 0] 
trainVald <- trainVald[, colSums(is.na(trainVald)) == 0] 

# remove variables - not intuitive data for prediction (X, user_name, raw_timestamp_part_1, raw_timestamp_part_2, cvtd_timestamp), which happen to be the first five variables
trainSet <- trainSet[, -(1:5)]
trainVald <- trainVald[, -(1:5)]

1.2 Modeling - rpart, gbm, rf

#Model Building
#train to use 3-fold Cross Validation
control <- trainControl(method="cv", number=3, verboseIter=F)

# fit model rpart on trainSet
fitrpart <- train(classe~., data = trainSet, method = "rpart", trControl=control)

#fit model gbm on trainSet
fitgbm <- train(classe ~ ., data=trainSet, method="gbm", trControl=control)

#fit model random forest on trainSet
fitrf <- train(classe ~ ., data=trainSet, method="rf", trControl=control)

1.3 Modeling Comparison- rpart, gbm, rf

#predict the validation data and check accuracy on model - rpart
predrpartVald <- predict(fitrpart, newdata=trainVald) 
confusionMatrix(predrpartVald, trainVald$classe)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 1934  612  637  503  138
##          B   28  504   39  218   87
##          C  165  334  631  441  306
##          D    0    0    0    0    0
##          E    4    0    0   66  846
## 
## Overall Statistics
##                                           
##                Accuracy : 0.5225          
##                  95% CI : (0.5111, 0.5339)
##     No Information Rate : 0.2844          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.3767          
##  Mcnemar's Test P-Value : < 2.2e-16       
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            0.9076  0.34759  0.48279   0.0000   0.6144
## Specificity            0.6475  0.93844  0.79858   1.0000   0.9886
## Pos Pred Value         0.5058  0.57534  0.33617      NaN   0.9236
## Neg Pred Value         0.9463  0.85703  0.87963   0.8361   0.9193
## Prevalence             0.2844  0.19351  0.17443   0.1639   0.1838
## Detection Rate         0.2581  0.06726  0.08421   0.0000   0.1129
## Detection Prevalence   0.5103  0.11691  0.25050   0.0000   0.1222
## Balanced Accuracy      0.7775  0.64301  0.64068   0.5000   0.8015
#fitrpart$results

#predict the validation data and check accuracy on model - gbm
predgbmVald <- predict(fitgbm, newdata=trainVald) 
confusionMatrix(predgbmVald, trainVald$classe)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 2129   12    0    0    0
##          B    2 1420    9   10    5
##          C    0   18 1294   17    6
##          D    0    0    3 1200    3
##          E    0    0    1    1 1363
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9884          
##                  95% CI : (0.9857, 0.9907)
##     No Information Rate : 0.2844          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.9853          
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            0.9991   0.9793   0.9901   0.9772   0.9898
## Specificity            0.9978   0.9957   0.9934   0.9990   0.9997
## Pos Pred Value         0.9944   0.9820   0.9693   0.9950   0.9985
## Neg Pred Value         0.9996   0.9950   0.9979   0.9955   0.9977
## Prevalence             0.2844   0.1935   0.1744   0.1639   0.1838
## Detection Rate         0.2841   0.1895   0.1727   0.1601   0.1819
## Detection Prevalence   0.2857   0.1930   0.1782   0.1610   0.1822
## Balanced Accuracy      0.9984   0.9875   0.9917   0.9881   0.9948
#predict the validation data and check accuracy on model - random forest
predrfVald <- predict(fitrf, newdata=trainVald)
confusionMatrix(predrfVald, trainVald$classe)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 2130    6    0    0    0
##          B    0 1444    2    0    0
##          C    0    0 1302   10    0
##          D    0    0    3 1218    1
##          E    1    0    0    0 1376
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9969          
##                  95% CI : (0.9954, 0.9981)
##     No Information Rate : 0.2844          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.9961          
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            0.9995   0.9959   0.9962   0.9919   0.9993
## Specificity            0.9989   0.9997   0.9984   0.9994   0.9998
## Pos Pred Value         0.9972   0.9986   0.9924   0.9967   0.9993
## Neg Pred Value         0.9998   0.9990   0.9992   0.9984   0.9998
## Prevalence             0.2844   0.1935   0.1744   0.1639   0.1838
## Detection Rate         0.2843   0.1927   0.1738   0.1626   0.1836
## Detection Prevalence   0.2851   0.1930   0.1751   0.1631   0.1838
## Balanced Accuracy      0.9992   0.9978   0.9973   0.9956   0.9996

Conclusion :

According to the model statistics, the random forest model has the best accuray rate. Choose the random forest model as the best model to predict the test dataset.

Part 2. Prediction

2.1 Prediction according to the proper modeling - rf

#test Data preparation
test <- test[,-nzv]
test <- test[, colSums(is.na(test)) == 0] 
test <- test[,-(1:5)]
predTest <- predict(fitrf, newdata = test)
predTest
##  [1] B A B A A E D B A A B C B A E E A B B B
## Levels: A B C D E

B A B A A E D B A A B C B A E E A B B B