NeededPackages <- c("datasets", "dplyr", "ggplot2", "caret", "reshape2", "tibble")
lapply(NeededPackages, require, character.only = TRUE)
The analysis below is completed as part of the evaluation requirement for the Coursera, Johns Hopkins Data Science Specialization - Practical Machine Learning course. In this report a data set on sport activity is provided with a goal of training a model that can classify the intensity of an activity based on a number of covariates that are provided. The report provides a reproducible analysis that starts from the raw data and ends with a highly accurate Random Forest model, with an out of sample accuracy > 99%. In doing so it highlights the care that should be taken when building ML models.
The structure of the data preparation will aim to optimize the computational requirement of the analysis. It will first and foremost aim to clean up the major short comings of the data. This is done to ensure that as much of the unnecessary data is removed before being passed to a downstream analysis.
The very first thing that needs to be implemented is to download the necessary data for both groups.
urlTrain <- "https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv"
urlTest <- "https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv"
download.file(urlTrain, destfile = "Train.csv")
download.file(urlTest, destfile = "Testing.csv")
Testing <- as_tibble(read.csv(file = "Testing.csv", header = T, sep = ",", stringsAsFactors = TRUE))
Training <- as_tibble(read.csv(file = "Train.csv", header = T, sep = ",", stringsAsFactors = TRUE))
Before proceeding with model fitting, it is important to remove the X variable which is just a row count variable. The reasoning for this is demonstrated in the code below.
rowNumber <- seq(1:length(Training$X))
identical(rowNumber, Training$X)
## [1] TRUE
This is exactly the behavior of the X variable in the test data and the reason as to why it has to be removed. Because the data sets are ordered on the target variable, a model training, that leaves the X variable in the training set, would be perfect model in the in sample prediction but very poor in out of sample prediction.
Testing <- Testing[, -c(1)]
Training <- Training[, -c(1)]
It is clear from the description of the data set that there are a lot of non varying variables that can not be used to explain any variation in the target variable. Hence, it is important to remove this variables from the data set, so they do not cause computational complexity. Since, the main task is to predict the target variable in the quiz test set, any variable that does not show variation in there is not useful in predicting the target outcome. The command below captures non varying variables in the validation data set and removes them from the training data set.
ZeroVar <- nearZeroVar(x = Testing, freqCut = 99/1, saveMetrics = FALSE)
NearZero <- Training[, ZeroVar]
Training <- Training[, -ZeroVar]
Testing <- Testing[,-ZeroVar]
With the data sets prepared, creation of a validation set is necessary to measure accuracy of fitted models with an out of sample accuracy. Hence, in what follows the analysis will break the “Training” data into an actual training and validation sets with a 3:1 split. Afterwards, different algorithms are going to be trained on the actual training data and validated with the validation set, until a sufficiently good model is derived. Once a sufficiently good model is achieved, it is applied on the test data to get an estimates for the quiz.
inTrain <- createDataPartition(y=Training$classe, p= 0.75, list = FALSE)
Validation <- Training[-inTrain, ]
Training <- Training[inTrain,]
## Warning: The `i` argument of ``[`()` can't be a matrix as of tibble 3.0.0.
## Convert to a vector.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_warnings()` to see where this warning was generated.
modFitCT <- train(classe ~ ., data = Training, method = "rpart" )
rattle::fancyRpartPlot(modFitCT$finalModel, sub = NULL)
# In sample error
confusionMatrix(predict(modFitCT, Training), Training$classe)
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 3851 1440 399 845 234
## B 51 802 144 342 704
## C 271 606 2024 1225 540
## D 0 0 0 0 0
## E 12 0 0 0 1228
##
## Overall Statistics
##
## Accuracy : 0.5371
## 95% CI : (0.529, 0.5452)
## No Information Rate : 0.2843
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.4001
##
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 0.9202 0.28160 0.7885 0.0000 0.45381
## Specificity 0.7230 0.89545 0.7826 1.0000 0.99900
## Pos Pred Value 0.5689 0.39256 0.4338 NaN 0.99032
## Neg Pred Value 0.9580 0.83858 0.9460 0.8361 0.89034
## Prevalence 0.2843 0.19350 0.1744 0.1639 0.18386
## Detection Rate 0.2617 0.05449 0.1375 0.0000 0.08344
## Detection Prevalence 0.4599 0.13881 0.3170 0.0000 0.08425
## Balanced Accuracy 0.8216 0.58853 0.7855 0.5000 0.72640
# OUt of sample error
pred <- predict(modFitCT, newdata = Validation[,-58])
confusionMatrix(pred, Validation$classe)
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 1276 490 120 270 54
## B 17 260 39 106 236
## C 100 199 696 428 208
## D 0 0 0 0 0
## E 2 0 0 0 403
##
## Overall Statistics
##
## Accuracy : 0.5373
## 95% CI : (0.5232, 0.5513)
## No Information Rate : 0.2845
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.4012
##
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 0.9147 0.27397 0.8140 0.0000 0.44728
## Specificity 0.7338 0.89937 0.7691 1.0000 0.99950
## Pos Pred Value 0.5774 0.39514 0.4267 NaN 0.99506
## Neg Pred Value 0.9558 0.83773 0.9514 0.8361 0.88931
## Prevalence 0.2845 0.19352 0.1743 0.1639 0.18373
## Detection Rate 0.2602 0.05302 0.1419 0.0000 0.08218
## Detection Prevalence 0.4507 0.13418 0.3326 0.0000 0.08259
## Balanced Accuracy 0.8243 0.58667 0.7916 0.5000 0.72339
A classification train has been successfully trained. However, the accuracy rate of the classification tree leaves a lot more to be desired. In the next step the analysis will train and test a random forest algorithm.
modFitRF <- randomForest::randomForest(classe ~ ., data = Training)
# In sample error
confusionMatrix(predict(modFitRF, Training), Training$classe)
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 4185 0 0 0 0
## B 0 2848 0 0 0
## C 0 0 2567 0 0
## D 0 0 0 2412 0
## E 0 0 0 0 2706
##
## Overall Statistics
##
## Accuracy : 1
## 95% CI : (0.9997, 1)
## No Information Rate : 0.2843
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 1
##
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 1.0000 1.0000 1.0000 1.0000 1.0000
## Specificity 1.0000 1.0000 1.0000 1.0000 1.0000
## Pos Pred Value 1.0000 1.0000 1.0000 1.0000 1.0000
## Neg Pred Value 1.0000 1.0000 1.0000 1.0000 1.0000
## Prevalence 0.2843 0.1935 0.1744 0.1639 0.1839
## Detection Rate 0.2843 0.1935 0.1744 0.1639 0.1839
## Detection Prevalence 0.2843 0.1935 0.1744 0.1639 0.1839
## Balanced Accuracy 1.0000 1.0000 1.0000 1.0000 1.0000
# OUt of sample error
pred <- predict(modFitRF, newdata = Validation[,-58])
confusionMatrix(pred, Validation$classe)
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 1395 0 0 0 0
## B 0 949 2 0 0
## C 0 0 853 1 0
## D 0 0 0 802 0
## E 0 0 0 1 901
##
## Overall Statistics
##
## Accuracy : 0.9992
## 95% CI : (0.9979, 0.9998)
## No Information Rate : 0.2845
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.999
##
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 1.0000 1.0000 0.9977 0.9975 1.0000
## Specificity 1.0000 0.9995 0.9998 1.0000 0.9998
## Pos Pred Value 1.0000 0.9979 0.9988 1.0000 0.9989
## Neg Pred Value 1.0000 1.0000 0.9995 0.9995 1.0000
## Prevalence 0.2845 0.1935 0.1743 0.1639 0.1837
## Detection Rate 0.2845 0.1935 0.1739 0.1635 0.1837
## Detection Prevalence 0.2845 0.1939 0.1741 0.1635 0.1839
## Balanced Accuracy 1.0000 0.9997 0.9987 0.9988 0.9999
Under normal conditions more model tuning and ensambling would be used to further increase the accuracy of the final model used for prediction. However, the Random Forest model trained above is already highly accurate with an in sample accuracy rate of 1 and out of sample accuracy rate of 0.9991843. The analysis will therefore not proceed further. But rather use this Random Forest model to predict the classes of the testing data. To not violate the Coursera Honor Code the results are not printed here.
Testing <- rbind(Training[1, -58] , Testing[,-58])
Testing <- Testing[-1,]
predict(modFitRF, newdata = Testing)