NeededPackages <- c("datasets", "dplyr", "ggplot2", "caret", "reshape2", "tibble")
lapply(NeededPackages, require, character.only = TRUE)

Introduction

The analysis below is completed as part of the evaluation requirement for the Coursera, Johns Hopkins Data Science Specialization - Practical Machine Learning course. In this report a data set on sport activity is provided with a goal of training a model that can classify the intensity of an activity based on a number of covariates that are provided. The report provides a reproducible analysis that starts from the raw data and ends with a highly accurate Random Forest model, with an out of sample accuracy > 99%. In doing so it highlights the care that should be taken when building ML models.

Data Preparation

The structure of the data preparation will aim to optimize the computational requirement of the analysis. It will first and foremost aim to clean up the major short comings of the data. This is done to ensure that as much of the unnecessary data is removed before being passed to a downstream analysis.

Loading the data

The very first thing that needs to be implemented is to download the necessary data for both groups.

urlTrain <- "https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv"
urlTest <- "https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv"
download.file(urlTrain, destfile = "Train.csv")
download.file(urlTest, destfile = "Testing.csv")
Testing <- as_tibble(read.csv(file = "Testing.csv", header = T, sep = ",", stringsAsFactors = TRUE))
Training <- as_tibble(read.csv(file = "Train.csv", header = T, sep = ",", stringsAsFactors = TRUE))

Uniqe Identifieres

Before proceeding with model fitting, it is important to remove the X variable which is just a row count variable. The reasoning for this is demonstrated in the code below.

rowNumber <- seq(1:length(Training$X))
identical(rowNumber, Training$X)
## [1] TRUE

This is exactly the behavior of the X variable in the test data and the reason as to why it has to be removed. Because the data sets are ordered on the target variable, a model training, that leaves the X variable in the training set, would be perfect model in the in sample prediction but very poor in out of sample prediction.

Testing <- Testing[, -c(1)]
Training <- Training[, -c(1)]

Non Varying Features

It is clear from the description of the data set that there are a lot of non varying variables that can not be used to explain any variation in the target variable. Hence, it is important to remove this variables from the data set, so they do not cause computational complexity. Since, the main task is to predict the target variable in the quiz test set, any variable that does not show variation in there is not useful in predicting the target outcome. The command below captures non varying variables in the validation data set and removes them from the training data set.

ZeroVar <- nearZeroVar(x = Testing, freqCut = 99/1, saveMetrics = FALSE)
NearZero <- Training[, ZeroVar]
Training <- Training[, -ZeroVar]
Testing <- Testing[,-ZeroVar]

Training, Testing & Validation Sets

With the data sets prepared, creation of a validation set is necessary to measure accuracy of fitted models with an out of sample accuracy. Hence, in what follows the analysis will break the “Training” data into an actual training and validation sets with a 3:1 split. Afterwards, different algorithms are going to be trained on the actual training data and validated with the validation set, until a sufficiently good model is derived. Once a sufficiently good model is achieved, it is applied on the test data to get an estimates for the quiz.

inTrain <- createDataPartition(y=Training$classe, p= 0.75, list = FALSE)
Validation <- Training[-inTrain, ]
Training <- Training[inTrain,]
## Warning: The `i` argument of ``[`()` can't be a matrix as of tibble 3.0.0.
## Convert to a vector.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_warnings()` to see where this warning was generated.

Modeling

Clasification Trees

modFitCT <- train(classe ~ ., data = Training, method = "rpart" )
rattle::fancyRpartPlot(modFitCT$finalModel, sub = NULL)

# In sample error 
confusionMatrix(predict(modFitCT, Training), Training$classe)     
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 3851 1440  399  845  234
##          B   51  802  144  342  704
##          C  271  606 2024 1225  540
##          D    0    0    0    0    0
##          E   12    0    0    0 1228
## 
## Overall Statistics
##                                          
##                Accuracy : 0.5371         
##                  95% CI : (0.529, 0.5452)
##     No Information Rate : 0.2843         
##     P-Value [Acc > NIR] : < 2.2e-16      
##                                          
##                   Kappa : 0.4001         
##                                          
##  Mcnemar's Test P-Value : NA             
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            0.9202  0.28160   0.7885   0.0000  0.45381
## Specificity            0.7230  0.89545   0.7826   1.0000  0.99900
## Pos Pred Value         0.5689  0.39256   0.4338      NaN  0.99032
## Neg Pred Value         0.9580  0.83858   0.9460   0.8361  0.89034
## Prevalence             0.2843  0.19350   0.1744   0.1639  0.18386
## Detection Rate         0.2617  0.05449   0.1375   0.0000  0.08344
## Detection Prevalence   0.4599  0.13881   0.3170   0.0000  0.08425
## Balanced Accuracy      0.8216  0.58853   0.7855   0.5000  0.72640
# OUt of sample error    
pred <- predict(modFitCT, newdata = Validation[,-58])

confusionMatrix(pred, Validation$classe)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 1276  490  120  270   54
##          B   17  260   39  106  236
##          C  100  199  696  428  208
##          D    0    0    0    0    0
##          E    2    0    0    0  403
## 
## Overall Statistics
##                                           
##                Accuracy : 0.5373          
##                  95% CI : (0.5232, 0.5513)
##     No Information Rate : 0.2845          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.4012          
##                                           
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            0.9147  0.27397   0.8140   0.0000  0.44728
## Specificity            0.7338  0.89937   0.7691   1.0000  0.99950
## Pos Pred Value         0.5774  0.39514   0.4267      NaN  0.99506
## Neg Pred Value         0.9558  0.83773   0.9514   0.8361  0.88931
## Prevalence             0.2845  0.19352   0.1743   0.1639  0.18373
## Detection Rate         0.2602  0.05302   0.1419   0.0000  0.08218
## Detection Prevalence   0.4507  0.13418   0.3326   0.0000  0.08259
## Balanced Accuracy      0.8243  0.58667   0.7916   0.5000  0.72339

A classification train has been successfully trained. However, the accuracy rate of the classification tree leaves a lot more to be desired. In the next step the analysis will train and test a random forest algorithm.

Random Forest

modFitRF <- randomForest::randomForest(classe ~ ., data = Training) 
# In sample error 
confusionMatrix(predict(modFitRF, Training), Training$classe)     
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 4185    0    0    0    0
##          B    0 2848    0    0    0
##          C    0    0 2567    0    0
##          D    0    0    0 2412    0
##          E    0    0    0    0 2706
## 
## Overall Statistics
##                                      
##                Accuracy : 1          
##                  95% CI : (0.9997, 1)
##     No Information Rate : 0.2843     
##     P-Value [Acc > NIR] : < 2.2e-16  
##                                      
##                   Kappa : 1          
##                                      
##  Mcnemar's Test P-Value : NA         
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            1.0000   1.0000   1.0000   1.0000   1.0000
## Specificity            1.0000   1.0000   1.0000   1.0000   1.0000
## Pos Pred Value         1.0000   1.0000   1.0000   1.0000   1.0000
## Neg Pred Value         1.0000   1.0000   1.0000   1.0000   1.0000
## Prevalence             0.2843   0.1935   0.1744   0.1639   0.1839
## Detection Rate         0.2843   0.1935   0.1744   0.1639   0.1839
## Detection Prevalence   0.2843   0.1935   0.1744   0.1639   0.1839
## Balanced Accuracy      1.0000   1.0000   1.0000   1.0000   1.0000
# OUt of sample error    
pred <- predict(modFitRF, newdata = Validation[,-58])

confusionMatrix(pred, Validation$classe)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 1395    0    0    0    0
##          B    0  949    2    0    0
##          C    0    0  853    1    0
##          D    0    0    0  802    0
##          E    0    0    0    1  901
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9992          
##                  95% CI : (0.9979, 0.9998)
##     No Information Rate : 0.2845          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.999           
##                                           
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            1.0000   1.0000   0.9977   0.9975   1.0000
## Specificity            1.0000   0.9995   0.9998   1.0000   0.9998
## Pos Pred Value         1.0000   0.9979   0.9988   1.0000   0.9989
## Neg Pred Value         1.0000   1.0000   0.9995   0.9995   1.0000
## Prevalence             0.2845   0.1935   0.1743   0.1639   0.1837
## Detection Rate         0.2845   0.1935   0.1739   0.1635   0.1837
## Detection Prevalence   0.2845   0.1939   0.1741   0.1635   0.1839
## Balanced Accuracy      1.0000   0.9997   0.9987   0.9988   0.9999

Results

Under normal conditions more model tuning and ensambling would be used to further increase the accuracy of the final model used for prediction. However, the Random Forest model trained above is already highly accurate with an in sample accuracy rate of 1 and out of sample accuracy rate of 0.9991843. The analysis will therefore not proceed further. But rather use this Random Forest model to predict the classes of the testing data. To not violate the Coursera Honor Code the results are not printed here.

Testing <- rbind(Training[1, -58] , Testing[,-58])
Testing <- Testing[-1,]
predict(modFitRF, newdata = Testing)