Prediction Model using Weight Lifting Exercises Dataset

Overview

We are going to create an algorithm to predict as precisely as possible the correct way (How well) to exercise. To do so, we are going to use the public dataset (Weight Lifting Exercises Dataset) and Machine Learning techniques, principally Random Forest, Generalized Boosted, Linear Discriminant Analysis, Recursive Partitioning And Regression Trees and, of course, Cross Validation.

Exploratory Analysis

Datasets

URL.train <- "https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv"
URL.test <- "https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv"
Fil.train <- "pml-training.csv"
Fil.test <- "pml-testing.csv"

if(!file.exists(Fil.train))
  download.file(URL.train, destfile = Fil.train)

if(!file.exists(Fil.test))
  download.file(URL.test, destfile = Fil.test)
  
Dat.train <- read.csv(Fil.train, na.strings=c("NA","#DIV/0!","")) 
Dat.test <- read.csv(Fil.test, na.strings=c("NA","#DIV/0!",""))

We download the 2 datasets (training and testing) and upload them to the memory. We have previously identified various residual and null values, which we proceed to convert to NA.

Pre-Process Training Dataset

set.seed(13)
Spl <- createDataPartition(Dat.train$classe, p = 0.7, list = FALSE)
Dat.train.train <- Dat.train[Spl, ]
Dat.train.valid <- Dat.train[-Spl, ]

Dat.train.train <- Dat.train.train[, -c(1:5)]
Dat.train.valid <- Dat.train.valid[, -c(1:5)]

nz <- nearZeroVar(Dat.train.train)
Dat.train.train <- Dat.train.train[, -nz]
Dat.train.valid <- Dat.train.valid[, -nz]

vna    <- sapply(Dat.train.train, function(x) mean(is.na(x))) > 0.97
Dat.train.train <- Dat.train.train[, vna==FALSE]
Dat.train.valid <- Dat.train.valid[, vna==FALSE]

dim(Dat.train.train)

## [1] 13737    54

dim(Dat.train.valid)

## [1] 5885   54

descrCor <-  cor(Dat.train.train[, -length(Dat.train.train)])
highlyCorDescr <- findCorrelation(descrCor, cutoff = .8)
Dat.train.train <- Dat.train.train[,-highlyCorDescr]
Dat.train.valid <- Dat.train.valid[,-highlyCorDescr]

dim(Dat.train.train)

## [1] 13737    41

dim(Dat.train.valid)

## [1] 5885   41

Initially, the 2 datasets have 160 covariables each. We split the training data into 2 parts, one using 70% to build the models, and the other using 30% to validate them and make it possible to choose the most accurate one. Then, we have to eliminate the descriptive covariables of the mediation process itself, or those that have id’s that are of no use to our prediction (the first 5), followed by the covariables that have a variance near zero, and lastly eliminating the covariables that, for the most part, have a value of NA (over 97% of the data).
Finally, we evaluate the correlation between the 54 covariables and, by establishing a threshold of 80% of absolute correlation, we are left with 41 covariables that we deem appropriate for building the prediction models.

Pre-Process Testing Dataset

Dat.test <- Dat.test[, -c(1:5)]
Dat.test <- Dat.test[, -nz]
Dat.test <- Dat.test[, vna==FALSE]
Dat.test <- Dat.test[,-highlyCorDescr]

dim(Dat.test)

## [1] 20 41

We must carry out the same transformations with the testing data provided, which we will use to make the prediction for the 20 samples (individuals) at the end of the report.

Build diferent models

vControl <- trainControl(method="cv", number=4, verboseIter = FALSE)
vMetric <- "Accuracy"

We establish the general parameters that we will use for building all of the models. We are going to use Cross Validation in all of the cases.

1.- Model LDA:

Modfit.lda <- train(classe ~ ., method = "lda", data = Dat.train.train, verbose = FALSE, trControl = vControl, metric = vMetric)

Pre.lda <- predict(Modfit.lda, Dat.train.valid)

confusionMatrix(Pre.lda, Dat.train.valid$classe)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 1300  188  118   46   51
##          B  139  663   84   69  176
##          C  118  118  657  122  138
##          D   88   83  136  653  137
##          E   29   87   31   74  580
## 
## Overall Statistics
##                                           
##                Accuracy : 0.6547          
##                  95% CI : (0.6424, 0.6669)
##     No Information Rate : 0.2845          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.5634          
##                                           
##  Mcnemar's Test P-Value : < 2.2e-16       
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            0.7766   0.5821   0.6404   0.6774  0.53604
## Specificity            0.9043   0.9014   0.8979   0.9098  0.95399
## Pos Pred Value         0.7634   0.5862   0.5698   0.5953  0.72409
## Neg Pred Value         0.9106   0.8999   0.9220   0.9350  0.90126
## Prevalence             0.2845   0.1935   0.1743   0.1638  0.18386
## Detection Rate         0.2209   0.1127   0.1116   0.1110  0.09856
## Detection Prevalence   0.2894   0.1922   0.1959   0.1864  0.13611
## Balanced Accuracy      0.8404   0.7417   0.7691   0.7936  0.74502

We build the model using 70% of the training data and validate it with the remaining 30%. In this instance, the Accuracy is under 66%, and we thereby conclude that this Machine Learning technique is not an appropriate tool for our data.

2.- Model RPART:

Modfit.rpart <- train(classe ~ ., method = "rpart", data = Dat.train.train, trControl = vControl, metric = vMetric)

Pre.rpart <- predict(Modfit.rpart, Dat.train.valid)

confusionMatrix(Pre.rpart, Dat.train.valid$classe)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 1379  204   16   83   99
##          B   29  386   24  191  213
##          C  265  547  986  606  395
##          D    0    0    0    0    0
##          E    1    2    0   84  375
## 
## Overall Statistics
##                                          
##                Accuracy : 0.5312         
##                  95% CI : (0.5183, 0.544)
##     No Information Rate : 0.2845         
##     P-Value [Acc > NIR] : < 2.2e-16      
##                                          
##                   Kappa : 0.4057         
##                                          
##  Mcnemar's Test P-Value : < 2.2e-16      
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            0.8238  0.33889   0.9610   0.0000  0.34658
## Specificity            0.9045  0.90371   0.6269   1.0000  0.98189
## Pos Pred Value         0.7743  0.45789   0.3523      NaN  0.81169
## Neg Pred Value         0.9281  0.85065   0.9870   0.8362  0.86963
## Prevalence             0.2845  0.19354   0.1743   0.1638  0.18386
## Detection Rate         0.2343  0.06559   0.1675   0.0000  0.06372
## Detection Prevalence   0.3026  0.14325   0.4756   0.0000  0.07850
## Balanced Accuracy      0.8642  0.62130   0.7939   0.5000  0.66423

Similarly, we build the model with 70% of the training data and validate it with the remaining 30%. In this case, the Accuracy is under 54%, and so Machine Learning is definitively not the appropriate technique to use for our data.

3.- Model GBM:

Modfit.gbm <- train(classe ~ ., method = "gbm", data = Dat.train.train, trControl = vControl, metric = vMetric, verbose = FALSE)

Pre.gbm <- predict(Modfit.gbm, Dat.train.valid)

confusionMatrix(Pre.gbm, Dat.train.valid$classe)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 1669    7    0    0    0
##          B    5 1117    5    0    0
##          C    0   12 1018    8    1
##          D    0    3    3  955    5
##          E    0    0    0    1 1076
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9915          
##                  95% CI : (0.9888, 0.9937)
##     No Information Rate : 0.2845          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.9893          
##                                           
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            0.9970   0.9807   0.9922   0.9907   0.9945
## Specificity            0.9983   0.9979   0.9957   0.9978   0.9998
## Pos Pred Value         0.9958   0.9911   0.9798   0.9886   0.9991
## Neg Pred Value         0.9988   0.9954   0.9983   0.9982   0.9988
## Prevalence             0.2845   0.1935   0.1743   0.1638   0.1839
## Detection Rate         0.2836   0.1898   0.1730   0.1623   0.1828
## Detection Prevalence   0.2848   0.1915   0.1766   0.1641   0.1830
## Balanced Accuracy      0.9977   0.9893   0.9939   0.9942   0.9971

We do the same for this model. We build it with 70% of the training data and validate it with the remaining 30%. In this case, the accuracy is really good, reaching 99%. Depending on the results from the final model, this could turn out to be the chosen one.

4.- Model RF:

Modfit.rf <- train(classe ~ ., method = "rf", data = Dat.train.train, trControl = vControl, metric = vMetric)

Pre.rf <- predict(Modfit.rf, Dat.train.valid)

confusionMatrix(Pre.rf, Dat.train.valid$classe)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 1674    2    0    0    0
##          B    0 1135    1    0    0
##          C    0    2 1025    2    0
##          D    0    0    0  962    0
##          E    0    0    0    0 1082
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9988          
##                  95% CI : (0.9976, 0.9995)
##     No Information Rate : 0.2845          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.9985          
##                                           
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            1.0000   0.9965   0.9990   0.9979   1.0000
## Specificity            0.9995   0.9998   0.9992   1.0000   1.0000
## Pos Pred Value         0.9988   0.9991   0.9961   1.0000   1.0000
## Neg Pred Value         1.0000   0.9992   0.9998   0.9996   1.0000
## Prevalence             0.2845   0.1935   0.1743   0.1638   0.1839
## Detection Rate         0.2845   0.1929   0.1742   0.1635   0.1839
## Detection Prevalence   0.2848   0.1930   0.1749   0.1635   0.1839
## Balanced Accuracy      0.9998   0.9981   0.9991   0.9990   1.0000

As we can see, Random Forest is the most exact model, with an Accuracy of a little bit more than 99,8%, making it practically unbeatable. This is the model that we will use to make our prediction for the covariable ‘classe’ that will determine the way that each of the 20 samples (individuals) exercises, with the value A representing the correct way, and B, C, D and E representing the 4 most common errors with regard to doing the exercises specified in the experiment.

Error

Out of sample error

Accu <- sum(Pre.rf == Dat.train.valid$classe) / length(Pre.rf)
Accu

## [1] 0.9988105

Error <- 1 - Accu
Error

## [1] 0.001189465

pError <- Error * 100
pError

## [1] 0.1189465

We have calculated the rate of error ‘out-of-sample’ for our model built using Random Forest and, as we expected, it is very low, under 0.2% (0.12%). We can rest assured that this is the winning model. In addition to providing the best calculations, it also has a very high level of accuracy.

Prediction

Testing Dataset

Pre.rf.testing <- predict(Modfit.rf, Dat.test)
Pre.rf.testing

The predictions of the 20 samples (individuals) carried out by our winning model are all correct. The model adjusts perfectly to the reality of the data. We have verified that the 20 results are correct by introducing them in the Automated Grading Quiz.