This is the report on the analysis done on the Human Activity report data submitted as part of the Practical Machine Learning Course from John Hopkin’s University offered through Coursera.
The objective is to analyze the data on Excercise activity and fit a model that predicts the 5 classes of excercise performed, based on the data from accelerometers on the belt, forearm, arm, and dumbell of 6 participants. They performed barbell lifts correctly and incorrectly in 5 different ways. Class A correponds to the correct technique and classes B to E corresponds to various wrong techniques.
The data set downloaded from http://groupware.les.inf.puc-rio.br/har is used for fitting the model with ‘classe’ as the target variable. Two methods, RPART and Random Forest were used with cross validation, and the model was applied to the 20 test cases given.
The data had variable with too many missing values. Since the proportion of missing rows were > 90%, it is better to remove those variables rather than attempting any imputation method. Also the first 7 variables are label variables and can therefore be removed
#read in the data
harTrain <- read.csv("pml-training.csv", na.strings=c("NA",""))
ncol(harTrain)
FALSE [1] 160
#remove variables with >90% missing values
harTrainSub <- harTrain[,colSums(is.na(harTrain)) <= .9*nrow(harTrain)]
#removing the first 7 columns
harTrainSub <- harTrainSub[,-(1:7)]
ncol(harTrainSub)
FALSE [1] 53
# checking for any more missings
ncol(harTrainSub[,colSums(is.na(harTrainSub))])
FALSE [1] 0
Splitting the data into training and testing subsets
library(caret)
set.seed(100)
splt <- createDataPartition(harTrainSub$classe, p =0.7, list = F)
training <- harTrainSub[splt,]
testing <- harTrainSub[-splt,]
An Rpart model is first attempted. K- fold cross validation is done using trainControl function in the caret package, setting the number of folds = 10
mdlTree <- train(classe~., data = training, method = "rpart", trControl = trainControl(method = "cv", number = 10) )
library(rattle)
library(rpart)
library(rpart.plot)
fancyRpartPlot(mdlTree$finalModel)
#Applying the model to the testing set and checking the accuracy
predTree <- predict(mdlTree, newdata = testing)
confusionMatrix(predTree, testing$classe)
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 1493 506 504 438 152
## B 33 369 35 170 141
## C 118 264 487 356 287
## D 0 0 0 0 0
## E 30 0 0 0 502
##
## Overall Statistics
##
## Accuracy : 0.4845
## 95% CI : (0.4716, 0.4973)
## No Information Rate : 0.2845
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.3256
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 0.8919 0.3240 0.47466 0.0000 0.4640
## Specificity 0.6200 0.9201 0.78905 1.0000 0.9938
## Pos Pred Value 0.4827 0.4933 0.32209 NaN 0.9436
## Neg Pred Value 0.9352 0.8501 0.87674 0.8362 0.8916
## Prevalence 0.2845 0.1935 0.17434 0.1638 0.1839
## Detection Rate 0.2537 0.0627 0.08275 0.0000 0.0853
## Detection Prevalence 0.5256 0.1271 0.25692 0.0000 0.0904
## Balanced Accuracy 0.7560 0.6221 0.63186 0.5000 0.7289
The accuracy of the model is hardly 50%, which is better than the baseline accuracy of 28% (Predicting the most frequent class (A) for all). However, we can attempt to improve this using other ML algorithms. Changind the value of K did not yield any substantial improvements in the accuracy of the model.
The train functin in caret package with method = “rf” is used to build a random forest model including the trainControl function with cross validation, k = 10. K is chosen to be 10, to be a number that could be handled by the random forest algorithm based on the RAM limitation and run time on my machine.
set.seed(100)
mdlRF<- train(classe~., data = training, method = "rf",trControl = trainControl(method = "cv", number = 10) )
save(mdlRF,file = "harRF.RData")
#Plotting the variables in order of importance
plot(varImp(mdlRF))
#Applying the model to the testing set and checking the accuracy
predRF <- predict(mdlRF, newdata = testing)
confusionMatrix(predRF, testing$classe)
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 1673 8 0 0 0
## B 1 1131 4 0 0
## C 0 0 1019 16 0
## D 0 0 3 947 1
## E 0 0 0 1 1081
##
## Overall Statistics
##
## Accuracy : 0.9942
## 95% CI : (0.9919, 0.996)
## No Information Rate : 0.2845
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.9927
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 0.9994 0.9930 0.9932 0.9824 0.9991
## Specificity 0.9981 0.9989 0.9967 0.9992 0.9998
## Pos Pred Value 0.9952 0.9956 0.9845 0.9958 0.9991
## Neg Pred Value 0.9998 0.9983 0.9986 0.9966 0.9998
## Prevalence 0.2845 0.1935 0.1743 0.1638 0.1839
## Detection Rate 0.2843 0.1922 0.1732 0.1609 0.1837
## Detection Prevalence 0.2856 0.1930 0.1759 0.1616 0.1839
## Balanced Accuracy 0.9988 0.9960 0.9949 0.9908 0.9994
The random forest model predicts with a high degree of accuracy of 99.4% when applied to the test set. So the model when applied to out of sample should give that level of accuracy, and the expected out of sample error should be about 0.6%
The model was applied to the data set with 20 test cases and the resulting predictions were found to be 100 % accurate as per the submission results. So, as predicted, the random forest model had a high degree of accuracy.