HARProject

Preprocessing

The data had variable with too many missing values. Since the proportion of missing rows were > 90%, it is better to remove those variables rather than attempting any imputation method. Also the first 7 variables are label variables and can therefore be removed

#read in the data
harTrain <- read.csv("pml-training.csv", na.strings=c("NA",""))
ncol(harTrain)

FALSE [1] 160

#remove variables with >90% missing values
harTrainSub <- harTrain[,colSums(is.na(harTrain)) <= .9*nrow(harTrain)]
#removing the first 7 columns 
harTrainSub <- harTrainSub[,-(1:7)]
ncol(harTrainSub)

FALSE [1] 53

# checking for any more missings
ncol(harTrainSub[,colSums(is.na(harTrainSub))])

FALSE [1] 0

Splitting the data into training and testing subsets

library(caret)
set.seed(100)
splt <- createDataPartition(harTrainSub$classe, p =0.7, list = F)
training <- harTrainSub[splt,]
testing <- harTrainSub[-splt,]

RPART(Recursive Partitioning and Regression Trees) Model

An Rpart model is first attempted. K- fold cross validation is done using trainControl function in the caret package, setting the number of folds = 10

mdlTree <- train(classe~., data = training, method = "rpart",  trControl = trainControl(method = "cv", number = 10) )
library(rattle)
library(rpart)
library(rpart.plot)
fancyRpartPlot(mdlTree$finalModel)

#Applying the model to the testing set and checking the accuracy
predTree <- predict(mdlTree, newdata = testing)
confusionMatrix(predTree, testing$classe)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 1493  506  504  438  152
##          B   33  369   35  170  141
##          C  118  264  487  356  287
##          D    0    0    0    0    0
##          E   30    0    0    0  502
## 
## Overall Statistics
##                                           
##                Accuracy : 0.4845          
##                  95% CI : (0.4716, 0.4973)
##     No Information Rate : 0.2845          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.3256          
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            0.8919   0.3240  0.47466   0.0000   0.4640
## Specificity            0.6200   0.9201  0.78905   1.0000   0.9938
## Pos Pred Value         0.4827   0.4933  0.32209      NaN   0.9436
## Neg Pred Value         0.9352   0.8501  0.87674   0.8362   0.8916
## Prevalence             0.2845   0.1935  0.17434   0.1638   0.1839
## Detection Rate         0.2537   0.0627  0.08275   0.0000   0.0853
## Detection Prevalence   0.5256   0.1271  0.25692   0.0000   0.0904
## Balanced Accuracy      0.7560   0.6221  0.63186   0.5000   0.7289

The accuracy of the model is hardly 50%, which is better than the baseline accuracy of 28% (Predicting the most frequent class (A) for all). However, we can attempt to improve this using other ML algorithms. Changind the value of K did not yield any substantial improvements in the accuracy of the model.

Random Forest

The train functin in caret package with method = “rf” is used to build a random forest model including the trainControl function with cross validation, k = 10. K is chosen to be 10, to be a number that could be handled by the random forest algorithm based on the RAM limitation and run time on my machine.

set.seed(100)
mdlRF<- train(classe~., data = training, method = "rf",trControl = trainControl(method = "cv", number = 10) )
save(mdlRF,file = "harRF.RData")
 
#Plotting the variables in order of importance
plot(varImp(mdlRF))

#Applying the model to the testing set and checking the accuracy
predRF <- predict(mdlRF, newdata = testing)
confusionMatrix(predRF, testing$classe)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 1673    8    0    0    0
##          B    1 1131    4    0    0
##          C    0    0 1019   16    0
##          D    0    0    3  947    1
##          E    0    0    0    1 1081
## 
## Overall Statistics
##                                          
##                Accuracy : 0.9942         
##                  95% CI : (0.9919, 0.996)
##     No Information Rate : 0.2845         
##     P-Value [Acc > NIR] : < 2.2e-16      
##                                          
##                   Kappa : 0.9927         
##  Mcnemar's Test P-Value : NA             
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            0.9994   0.9930   0.9932   0.9824   0.9991
## Specificity            0.9981   0.9989   0.9967   0.9992   0.9998
## Pos Pred Value         0.9952   0.9956   0.9845   0.9958   0.9991
## Neg Pred Value         0.9998   0.9983   0.9986   0.9966   0.9998
## Prevalence             0.2845   0.1935   0.1743   0.1638   0.1839
## Detection Rate         0.2843   0.1922   0.1732   0.1609   0.1837
## Detection Prevalence   0.2856   0.1930   0.1759   0.1616   0.1839
## Balanced Accuracy      0.9988   0.9960   0.9949   0.9908   0.9994

The random forest model predicts with a high degree of accuracy of 99.4% when applied to the test set. So the model when applied to out of sample should give that level of accuracy, and the expected out of sample error should be about 0.6%

Conclusion

The model was applied to the data set with 20 test cases and the resulting predictions were found to be 100 % accurate as per the submission results. So, as predicted, the random forest model had a high degree of accuracy.

HARProject

Bipin Karunakaran

July 25, 2015

Synopsis:

Preprocessing

RPART(Recursive Partitioning and Regression Trees) Model

Random Forest

Conclusion