Prediction models

Summary

This report shows the construction of a model to predict the type of execution of a simple dumbbell lifting exercise, from total possible of five types of executions, based on the data collected by wearable devices. All the data are presented by the authors of the study “Qualitative Activity Recognition of Weight Lifting Exercises”, cited at the bottom of this document.
The first step was the selection of a few relevant predictors, for which a simple CART model was built. As expected this model did not provide acceptable accuracy, but it was a very efficient tool to select predictors of interest.
In a second step these predictors were used to build a more sophisticated Random Forest model with low risk of ovefitting. Cross validation was conducted by subdividing the original training dataset into training and validation. This second model was then used to predict the testing dataset with 100% accurate results based on the prediction quizz provided by the instructor of the course.

Selection of predictors to prevent risk of overfitting

Build a simple CART model to select predictors. Start with the preparation of the data. Libraries needed but not shown on this summary for brevity include dplyr, caret, rattle.

as_training<-read.csv("pml-training.csv")
as_testing<-read.csv("pml-testing.csv")

Define two fuctions to easily remove mostly NA variables and factors. If the need to improve model accuracy required it these factors would be added back. Add the response variable back. Also Remove the first few variables that contain non relevant information.

not_any_na <- function(x) all(!is.na(x))
as_tr<-select_if(as_training,not_any_na)
not_factor<-function(x) !is.factor(x)
as_tr_c<-select_if(as_tr,not_factor)
as_tr_c$classe<-as_training$classe
as_tr_c_r<-as_tr_c[,-(1:4)]

Data partition for cross validation.

trainIn = createDataPartition(as_tr_c_r$classe, p = 2/3,list=FALSE)
tr_in = as_tr_c_r[trainIn,]
test_in = as_tr_c_r[-trainIn,]

And now built the simple CART model just for the purpose of selecting relevant predictors that will be used in the final model.

modTree <- train(classe ~ .,method="rpart",data=tr_in)
fancyRpartPlot(modTree$finalModel)

The variables shown on the graph above and their associated predictors (for example, same measurement for the other axes) will be used to build the final model, including the outcome variable. THe new data reduced dataset is:

tr_in_red<-tr_in[,c(1,2,3,19,20,21,22,23,24,40,41,42,37,38,39,53)]
##Names of the variables
names(tr_in_red)

##  [1] "roll_belt"         "pitch_belt"        "yaw_belt"         
##  [4] "gyros_arm_y"       "gyros_arm_z"       "accel_arm_x"      
##  [7] "accel_arm_y"       "accel_arm_z"       "magnet_arm_x"     
## [10] "roll_forearm"      "pitch_forearm"     "yaw_forearm"      
## [13] "magnet_dumbbell_x" "magnet_dumbbell_y" "magnet_dumbbell_z"
## [16] "classe"

As a side note, the accuracy of this model for the prediction of the testing partition above, named test_in, is 0.4886, poor, but the purpose of the exercise is to reduce the number of variables, for which this tree is very helpful, as shown by the results of this study.

Final model

As explained, a Random Forest model is built for the prediction, using for cross validation the data partition above that resulted in the datasets tr_in and test_in, that now will be further simplified by selecting only the corresponding columns of the predictors obtained through the CART model.

## RF model
mod2 <- train(classe ~.,method="rf",data=tr_in_red)
##Cross validation with partition
pred2<-predict(mod2,test_in)
confusionMatrix(pred2,test_in$classe)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 1859    9    0    0    0
##          B    1 1238   15    0    3
##          C    0   17 1123   19    1
##          D    0    1    2 1053    2
##          E    0    0    0    0 1196
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9893          
##                  95% CI : (0.9865, 0.9916)
##     No Information Rate : 0.2844          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.9865          
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            0.9995   0.9787   0.9851   0.9823   0.9950
## Specificity            0.9981   0.9964   0.9931   0.9991   1.0000
## Pos Pred Value         0.9952   0.9849   0.9681   0.9953   1.0000
## Neg Pred Value         0.9998   0.9949   0.9968   0.9965   0.9989
## Prevalence             0.2844   0.1935   0.1743   0.1639   0.1838
## Detection Rate         0.2843   0.1893   0.1717   0.1610   0.1829
## Detection Prevalence   0.2857   0.1922   0.1774   0.1618   0.1829
## Balanced Accuracy      0.9988   0.9875   0.9891   0.9907   0.9975

Based on these results, the out of sample error for the model, that in the case of categorical variables can be estimated with the accuracy, shows that the expected accuracy is 0.9907.

The final step is to run the prediction on the original testing data set, that was named “as_testing” at the top of the document.

pred3<-predict(mod2,as_testing)
pred3

##  [1] B A B A A E D B A A B C B A E E A B B B
## Levels: A B C D E

The students of this course can easily confirm that the predicitons above provide a 100% success rate on the Prediction Quiz provided by the instructor.

Dataset obtained from: Velloso, E.; Bulling, A.; Gellersen, H.; Ugulino, W.; Fuks, H. Qualitative Activity Recognition of Weight Lifting Exercises. Proceedings of 4th International Conference in Cooperation with SIGCHI (Augmented Human ’13) . Stuttgart, Germany: ACM SIGCHI, 2013.

Prediction models

TheCmos

March 11, 2019

Summary

Selection of predictors to prevent risk of overfitting

Final model