Project Practical Machine Learning

Author

Jose Nicolas Molina

Introduction

The goal of this project is to predict how six participants performed barbell lifts correctly and incorrectly in five different ways. Data from accelerometers attached to the belt, forearm, upper arm, and dumbbell are used to determine how well each participant was performing the barbell lifts. The machine learning algorithm that is developed in this report is applied to 20 test cases available in the test data. The predictions are used for the project prediction evaluation questionnaire for qualification.

Data Loading, Packages and Library

Download data from url

dataTrain <- "http://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv"

dataTest <- "http://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv"

Load the datasets

training <- read.csv(url(dataTrain))
testing  <- read.csv(url(dataTest))

Load packages and library

library(knitr)
library(caret)
Loading required package: ggplot2
Loading required package: lattice
library(rpart)
library(rpart.plot)
library(rattle)
Loading required package: tibble
Loading required package: bitops
Rattle: A free graphical interface for data science with R.
Version 5.5.1 Copyright (c) 2006-2021 Togaware Pty Ltd.
Type 'rattle()' to shake, rattle, and roll your data.
library(randomForest)
randomForest 4.7-1.1
Type rfNews() to see new features/changes/bug fixes.

Attaching package: 'randomForest'
The following object is masked from 'package:rattle':

    importance
The following object is masked from 'package:ggplot2':

    margin
library(corrplot)
corrplot 0.92 loaded

Data spliting

A training data set (70%) will be created for the modeling process and a test data set (30%) for validations. The testing data set is used in the choice of results for the prediction quiz.

inTrain  <- createDataPartition(training$classe, p=0.7, list=FALSE)
trainSet <- training[inTrain, ]
testSet  <- training[-inTrain, ]

Exploratory Analysis and Cleaning

dim(trainSet)
[1] 13737   160
dim(testSet)
[1] 5885  160

Note: The data created from trainSet and testSet contains 160 variables. In the variables there are NA, which will be applied cleaning. Likewise, unnecessary variables such as Nearly Zero Variance and id are eliminated.

Remove NAs in variables

nonNA    <- sapply(trainSet, function(x) mean(is.na(x))) > 0.95
trainSet <- trainSet[, nonNA==FALSE]
testSet  <- testSet[, nonNA==FALSE]
dim(trainSet)
[1] 13737    93
dim(testSet)
[1] 5885   93

Remove variables with almost zero variance

neZeVar <- nearZeroVar(trainSet)
trainSet <- trainSet[, -neZeVar]
testSet  <- testSet[, -neZeVar]
dim(trainSet)
[1] 13737    59
dim(testSet)
[1] 5885   59

Remove id variables

trainSet <- trainSet[, -(1:5)]
testSet  <- testSet[, -(1:5)]
dim(trainSet)
[1] 13737    54
dim(testSet)
[1] 5885   54

Note: 54 will be the variables that will be analyzed after the cleaning process.

Correlation Analysis

cMatrix <- cor(trainSet[, -54])
corrplot(cMatrix, order = "FPC", method = "color", type = "lower", 
         tl.cex = 0.8, tl.col = rgb(0, 0, 0))

Note: Variables with high correlations are presented in dark colors.

Prediction Model Building

Random Forest, Decision Trees and Generalized Boosted Model methods are applied to model the regressions. The one with the highest precision, when applied to the test data set, is used in the predictions of the questionnaire.

Random Forest

Model fit

fitRfc <- trainControl(method="cv", number=3, verboseIter=FALSE)
modFitRf <- train(classe ~ ., data=trainSet, method="rf",
                          trControl=fitRfc)
modFitRf$finalModel

Call:
 randomForest(x = x, y = y, mtry = param$mtry) 
               Type of random forest: classification
                     Number of trees: 500
No. of variables tried at each split: 27

        OOB estimate of  error rate: 0.19%
Confusion matrix:
     A    B    C    D    E  class.error
A 3904    1    0    0    1 0.0005120328
B    5 2650    2    1    0 0.0030097818
C    0    5 2390    1    0 0.0025041736
D    0    0    8 2244    0 0.0035523979
E    0    0    0    2 2523 0.0007920792

Prediction on test dataset

predictRf <- predict(modFitRf, newdata=testSet)
confMatRf <- confusionMatrix(predictRf, as.factor(testSet$classe))
confMatRf
Confusion Matrix and Statistics

          Reference
Prediction    A    B    C    D    E
         A 1674    1    0    0    0
         B    0 1138    3    0    0
         C    0    0 1023    3    0
         D    0    0    0  960    0
         E    0    0    0    1 1082

Overall Statistics
                                          
               Accuracy : 0.9986          
                 95% CI : (0.9973, 0.9994)
    No Information Rate : 0.2845          
    P-Value [Acc > NIR] : < 2.2e-16       
                                          
                  Kappa : 0.9983          
                                          
 Mcnemar's Test P-Value : NA              

Statistics by Class:

                     Class: A Class: B Class: C Class: D Class: E
Sensitivity            1.0000   0.9991   0.9971   0.9959   1.0000
Specificity            0.9998   0.9994   0.9994   1.0000   0.9998
Pos Pred Value         0.9994   0.9974   0.9971   1.0000   0.9991
Neg Pred Value         1.0000   0.9998   0.9994   0.9992   1.0000
Prevalence             0.2845   0.1935   0.1743   0.1638   0.1839
Detection Rate         0.2845   0.1934   0.1738   0.1631   0.1839
Detection Prevalence   0.2846   0.1939   0.1743   0.1631   0.1840
Balanced Accuracy      0.9999   0.9992   0.9982   0.9979   0.9999

Plot matrix results

plot(confMatRf$table, col = confMatRf$byClass, 
     main = paste("Random Forest - Accuracy =",
                  round(confMatRf$overall['Accuracy'], 4)))

Decision Trees

Model fit

set.seed(12345)
modFitDTree <- rpart(classe ~ ., data=trainSet, method="class")
fancyRpartPlot(modFitDTree)

Prediction on test dataset

predictDTree <- predict(modFitDTree, newdata=testSet, type="class")
confMatDTree <- confusionMatrix(predictDTree, as.factor(testSet$classe))
confMatDTree
Confusion Matrix and Statistics

          Reference
Prediction    A    B    C    D    E
         A 1468  165   49   58    5
         B   99  710   63   26   19
         C   57  113  819   60   27
         D   46  146   95  759  166
         E    4    5    0   61  865

Overall Statistics
                                          
               Accuracy : 0.7852          
                 95% CI : (0.7745, 0.7957)
    No Information Rate : 0.2845          
    P-Value [Acc > NIR] : < 2.2e-16       
                                          
                  Kappa : 0.7284          
                                          
 Mcnemar's Test P-Value : < 2.2e-16       

Statistics by Class:

                     Class: A Class: B Class: C Class: D Class: E
Sensitivity            0.8769   0.6234   0.7982   0.7873   0.7994
Specificity            0.9342   0.9564   0.9471   0.9079   0.9854
Pos Pred Value         0.8413   0.7743   0.7612   0.6262   0.9251
Neg Pred Value         0.9502   0.9136   0.9570   0.9561   0.9562
Prevalence             0.2845   0.1935   0.1743   0.1638   0.1839
Detection Rate         0.2494   0.1206   0.1392   0.1290   0.1470
Detection Prevalence   0.2965   0.1558   0.1828   0.2059   0.1589
Balanced Accuracy      0.9056   0.7899   0.8727   0.8476   0.8924

Plot matrix results

plot(confMatDTree$table, col = confMatDTree$byClass, 
     main = paste("Decision Tree - Accuracy =",
                  round(confMatDTree$overall['Accuracy'], 4)))

Generalized Boosted Model

Model fit

set.seed(12345)
fitGBM <- trainControl(method = "repeatedcv", number = 5, repeats = 1)
modFitGBM  <- train(classe ~ ., data=trainSet, method = "gbm",
                    trControl =fitGBM, verbose = FALSE)
modFitGBM$finalModel
A gradient boosted model with multinomial loss function.
150 iterations were performed.
There were 53 predictors of which 53 had non-zero influence.

Prediction on test dataset

predictGBM <- predict(modFitGBM, newdata=testSet)
cfMatGBM <- confusionMatrix(predictGBM, as.factor(testSet$classe))
cfMatGBM
Confusion Matrix and Statistics

          Reference
Prediction    A    B    C    D    E
         A 1671   11    0    0    0
         B    1 1118   21    1    1
         C    0    7  996   10    0
         D    2    3    6  951    5
         E    0    0    3    2 1076

Overall Statistics
                                          
               Accuracy : 0.9876          
                 95% CI : (0.9844, 0.9903)
    No Information Rate : 0.2845          
    P-Value [Acc > NIR] : < 2.2e-16       
                                          
                  Kappa : 0.9843          
                                          
 Mcnemar's Test P-Value : NA              

Statistics by Class:

                     Class: A Class: B Class: C Class: D Class: E
Sensitivity            0.9982   0.9816   0.9708   0.9865   0.9945
Specificity            0.9974   0.9949   0.9965   0.9967   0.9990
Pos Pred Value         0.9935   0.9790   0.9832   0.9835   0.9954
Neg Pred Value         0.9993   0.9956   0.9938   0.9974   0.9988
Prevalence             0.2845   0.1935   0.1743   0.1638   0.1839
Detection Rate         0.2839   0.1900   0.1692   0.1616   0.1828
Detection Prevalence   0.2858   0.1941   0.1721   0.1643   0.1837
Balanced Accuracy      0.9978   0.9883   0.9836   0.9916   0.9967

Plot matrix results

plot(cfMatGBM$table, col = cfMatGBM$byClass, 
     main = paste("GBM - Accuracy =", round(cfMatGBM$overall['Accuracy'], 4)))

Prediction

The Random Forest model is used to predict the 20 results of the questionnaire for qualification

Random Forest : 0.999

Decision Tree : 0.7342

GBM : 0.9871

predictTest<- predict(modFitRf, newdata=testing)
predictTest
 [1] B A B A A E D B A A B C B A E E A B B B
Levels: A B C D E