Synopsis:

This is the report on the analysis done on the Human Activity report data submitted as part of the Practical Machine Learning Course from John Hopkin’s University offered through Coursera.

The objective is to analyze the data on Excercise activity and fit a model that predicts the 5 classes of excercise performed, based on the data from accelerometers on the belt, forearm, arm, and dumbell of 6 participants. They performed barbell lifts correctly and incorrectly in 5 different ways. Class A correponds to the correct technique and classes B to E corresponds to various wrong techniques.

The data set downloaded from http://groupware.les.inf.puc-rio.br/har is used for fitting the model with ‘classe’ as the target variable. Two methods, RPART and Random Forest were used with cross validation, and the model was applied to the 20 test cases given.

Preprocessing

The data had variable with too many missing values. Since the proportion of missing rows were > 90%, it is better to remove those variables rather than attempting any imputation method. Also the first 7 variables are label variables and can therefore be removed

#read in the data
harTrain <- read.csv("pml-training.csv", na.strings=c("NA",""))
ncol(harTrain)
FALSE [1] 160
#remove variables with >90% missing values
harTrainSub <- harTrain[,colSums(is.na(harTrain)) <= .9*nrow(harTrain)]
#removing the first 7 columns 
harTrainSub <- harTrainSub[,-(1:7)]
ncol(harTrainSub)
FALSE [1] 53
# checking for any more missings
ncol(harTrainSub[,colSums(is.na(harTrainSub))])
FALSE [1] 0

Splitting the data into training and testing subsets

library(caret)
set.seed(100)
splt <- createDataPartition(harTrainSub$classe, p =0.7, list = F)
training <- harTrainSub[splt,]
testing <- harTrainSub[-splt,]

RPART(Recursive Partitioning and Regression Trees) Model

An Rpart model is first attempted. K- fold cross validation is done using trainControl function in the caret package, setting the number of folds = 10

mdlTree <- train(classe~., data = training, method = "rpart",  trControl = trainControl(method = "cv", number = 10) )
library(rattle)
library(rpart)
library(rpart.plot)
fancyRpartPlot(mdlTree$finalModel)

#Applying the model to the testing set and checking the accuracy
predTree <- predict(mdlTree, newdata = testing)
confusionMatrix(predTree, testing$classe)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 1493  506  504  438  152
##          B   33  369   35  170  141
##          C  118  264  487  356  287
##          D    0    0    0    0    0
##          E   30    0    0    0  502
## 
## Overall Statistics
##                                           
##                Accuracy : 0.4845          
##                  95% CI : (0.4716, 0.4973)
##     No Information Rate : 0.2845          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.3256          
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            0.8919   0.3240  0.47466   0.0000   0.4640
## Specificity            0.6200   0.9201  0.78905   1.0000   0.9938
## Pos Pred Value         0.4827   0.4933  0.32209      NaN   0.9436
## Neg Pred Value         0.9352   0.8501  0.87674   0.8362   0.8916
## Prevalence             0.2845   0.1935  0.17434   0.1638   0.1839
## Detection Rate         0.2537   0.0627  0.08275   0.0000   0.0853
## Detection Prevalence   0.5256   0.1271  0.25692   0.0000   0.0904
## Balanced Accuracy      0.7560   0.6221  0.63186   0.5000   0.7289

The accuracy of the model is hardly 50%, which is better than the baseline accuracy of 28% (Predicting the most frequent class (A) for all). However, we can attempt to improve this using other ML algorithms. Changind the value of K did not yield any substantial improvements in the accuracy of the model.

Random Forest

The train functin in caret package with method = “rf” is used to build a random forest model including the trainControl function with cross validation, k = 10. K is chosen to be 10, to be a number that could be handled by the random forest algorithm based on the RAM limitation and run time on my machine.

set.seed(100)
mdlRF<- train(classe~., data = training, method = "rf",trControl = trainControl(method = "cv", number = 10) )
save(mdlRF,file = "harRF.RData")
 
#Plotting the variables in order of importance
plot(varImp(mdlRF))

#Applying the model to the testing set and checking the accuracy
predRF <- predict(mdlRF, newdata = testing)
confusionMatrix(predRF, testing$classe)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 1673    8    0    0    0
##          B    1 1131    4    0    0
##          C    0    0 1019   16    0
##          D    0    0    3  947    1
##          E    0    0    0    1 1081
## 
## Overall Statistics
##                                          
##                Accuracy : 0.9942         
##                  95% CI : (0.9919, 0.996)
##     No Information Rate : 0.2845         
##     P-Value [Acc > NIR] : < 2.2e-16      
##                                          
##                   Kappa : 0.9927         
##  Mcnemar's Test P-Value : NA             
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            0.9994   0.9930   0.9932   0.9824   0.9991
## Specificity            0.9981   0.9989   0.9967   0.9992   0.9998
## Pos Pred Value         0.9952   0.9956   0.9845   0.9958   0.9991
## Neg Pred Value         0.9998   0.9983   0.9986   0.9966   0.9998
## Prevalence             0.2845   0.1935   0.1743   0.1638   0.1839
## Detection Rate         0.2843   0.1922   0.1732   0.1609   0.1837
## Detection Prevalence   0.2856   0.1930   0.1759   0.1616   0.1839
## Balanced Accuracy      0.9988   0.9960   0.9949   0.9908   0.9994

The random forest model predicts with a high degree of accuracy of 99.4% when applied to the test set. So the model when applied to out of sample should give that level of accuracy, and the expected out of sample error should be about 0.6%

Conclusion

The model was applied to the data set with 20 test cases and the resulting predictions were found to be 100 % accurate as per the submission results. So, as predicted, the random forest model had a high degree of accuracy.