I Overview

In this project, we will use data from accelerometers on the belt, forearm, arm, and dumbell of 6 participants. They were asked to perform barbell lifts correctly and incorrectly in 5 different ways.

The data consists of a Training data and Testing data.

The goal of this project is to predict the manner in which they did the exercise, that is the “classe” variable in the training set. The dataset was cleaned and the remaining variables were used for the prediction exercise using 3 prediction models. The model with the best accuracy rate was applied to the 20 test cases available in the testing data.

Note: The dataset used in this project is a courtesy of “Ugulino, W.; Cardador, D.; Vega, K.; Velloso, E.; Milidiu, R.; Fuks, H. Wearable Computing: Accelerometers’ Data Classification of Body Postures and Movements”

II Load Relevant Libraries

rm(list=ls())   # free up memory for the download of the data sets
library(knitr)
library(caret)
library(rpart)
library(rpart.plot)
library(rattle)
library(randomForest)
library(corrplot)

III Getting, Cleaning and Exploring Data

# set the URL for the download of Training and Testing Dataset
urlTrain <- "https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv"
urlTest  <- "https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv"

# download the Training and Testing datasets
training <- read.csv(url(urlTrain))
testing  <- read.csv(url(urlTest))
dim(training)
## [1] 19622   160
dim(testing)
## [1]  20 160
# create a validation dataset from the training dataset 
in_train  <- createDataPartition(training$classe, p=0.7, list=FALSE)
train_data <- training[in_train, ]
valid_data  <- training[-in_train, ]
dim(train_data)
## [1] 13737   160
dim(valid_data)
## [1] 5885  160
#Remove variables with little impact on outcome of Classe
train_data <- train_data[, -c(1:7)]
valid_data <- valid_data[, -c(1:7)]
dim(train_data)
## [1] 13737   153
dim(valid_data)
## [1] 5885  153
# remove variables with Nearly Zero Variance
NZV <- nearZeroVar(train_data)
train_data <- train_data[, -NZV]
valid_data  <- valid_data[, -NZV]
dim(train_data)
## [1] 13737   100
dim(valid_data)
## [1] 5885  100
#Remove variables containing missing values
train_data<- train_data[, colSums(is.na(train_data)) == 0]
valid_data <- valid_data[, colSums(is.na(valid_data)) == 0]
dim(train_data)
## [1] 13737    53
dim(valid_data)
## [1] 5885   53
# Plot correlation between variables to explore relationships
cor_matrix <- cor(train_data[, -53])
corrplot(cor_matrix, order = "FPC", method = "color", type = "lower", 
         tl.cex = 0.8, tl.col = rgb(0, 0, 0))

# Identify highly correlated variables at a cutoff of 70%
highly_correlated = findCorrelation(cor_matrix, cutoff=0.7)
names(train_data)[highly_correlated]
##  [1] "accel_belt_z"      "roll_belt"         "accel_arm_y"      
##  [4] "accel_belt_y"      "total_accel_belt"  "yaw_belt"         
##  [7] "accel_dumbbell_z"  "accel_belt_x"      "pitch_belt"       
## [10] "magnet_dumbbell_x" "accel_dumbbell_y"  "magnet_dumbbell_y"
## [13] "accel_dumbbell_x"  "accel_arm_x"       "accel_arm_z"      
## [16] "magnet_arm_y"      "magnet_belt_y"     "accel_forearm_y"  
## [19] "gyros_forearm_y"   "gyros_arm_x"

IV Prediction Model Building

Three methods will be applied in the model building process using the training dataset. The model with the highest accuracy rate will be selected and applied to the testing dataset for the predictions. The methods used for model building are: Decision Tree, Random Forest and Generalized Boosted Model as presented below.

a. Decision Tree Method

# model fit
set.seed(12345)
modFitDecTree <- rpart(classe ~ ., data=train_data, method="class")
fancyRpartPlot(modFitDecTree)

# prediction on Validation dataset
predictDecTree <- predict(modFitDecTree, newdata=valid_data, type="class")
confMatDecTree <- confusionMatrix(predictDecTree, as.factor(valid_data$classe))
confMatDecTree
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 1557  248   16  107   45
##          B   30  602   89   38   70
##          C   48  170  832   83   79
##          D   23   69   68  658   75
##          E   16   50   21   78  813
## 
## Overall Statistics
##                                          
##                Accuracy : 0.7582         
##                  95% CI : (0.747, 0.7691)
##     No Information Rate : 0.2845         
##     P-Value [Acc > NIR] : < 2.2e-16      
##                                          
##                   Kappa : 0.6924         
##                                          
##  Mcnemar's Test P-Value : < 2.2e-16      
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            0.9301   0.5285   0.8109   0.6826   0.7514
## Specificity            0.9012   0.9522   0.9218   0.9522   0.9656
## Pos Pred Value         0.7892   0.7262   0.6865   0.7368   0.8313
## Neg Pred Value         0.9701   0.8938   0.9585   0.9387   0.9452
## Prevalence             0.2845   0.1935   0.1743   0.1638   0.1839
## Detection Rate         0.2646   0.1023   0.1414   0.1118   0.1381
## Detection Prevalence   0.3353   0.1409   0.2059   0.1517   0.1662
## Balanced Accuracy      0.9157   0.7404   0.8664   0.8174   0.8585
# plot matrix results
plot(confMatDecTree$table, col = confMatDecTree$byClass, 
     main = paste("Decision Tree - Accuracy =",
                  round(confMatDecTree$overall['Accuracy'], 4)))

b. Random Forest Method

# model fit
set.seed(12345)
controlRF <- trainControl(method="cv", number=3, verboseIter=FALSE)
modFitRandForest <- train(classe ~ ., data=train_data, method="rf",
                          trControl=controlRF)
modFitRandForest$finalModel
## 
## Call:
##  randomForest(x = x, y = y, mtry = param$mtry) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 2
## 
##         OOB estimate of  error rate: 0.68%
## Confusion matrix:
##      A    B    C    D    E  class.error
## A 3904    2    0    0    0 0.0005120328
## B   16 2638    4    0    0 0.0075244545
## C    0   18 2373    5    0 0.0095993322
## D    0    0   40 2208    4 0.0195381883
## E    0    0    0    5 2520 0.0019801980
# prediction on validation dataset
predictRandForest <- predict(modFitRandForest, newdata=valid_data)
confMatRandForest <- confusionMatrix(predictRandForest, as.factor(valid_data$classe))
confMatRandForest
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 1673    7    0    0    0
##          B    0 1131    7    0    0
##          C    0    1 1019   13    0
##          D    0    0    0  951    2
##          E    1    0    0    0 1080
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9947          
##                  95% CI : (0.9925, 0.9964)
##     No Information Rate : 0.2845          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.9933          
##                                           
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            0.9994   0.9930   0.9932   0.9865   0.9982
## Specificity            0.9983   0.9985   0.9971   0.9996   0.9998
## Pos Pred Value         0.9958   0.9938   0.9864   0.9979   0.9991
## Neg Pred Value         0.9998   0.9983   0.9986   0.9974   0.9996
## Prevalence             0.2845   0.1935   0.1743   0.1638   0.1839
## Detection Rate         0.2843   0.1922   0.1732   0.1616   0.1835
## Detection Prevalence   0.2855   0.1934   0.1755   0.1619   0.1837
## Balanced Accuracy      0.9989   0.9958   0.9951   0.9931   0.9990
# plot matrix results
plot(confMatRandForest$table, col = confMatRandForest$byClass, 
     main = paste("Random Forest - Accuracy =",
                  round(confMatRandForest$overall['Accuracy'], 4)))

c. Generalized Boosted Model

# model fit
set.seed(12345)
controlGBM <- trainControl(method = "repeatedcv", number = 5, repeats = 1)
modFitGBM  <- train(classe ~ ., data=train_data, method = "gbm",
                    trControl = controlGBM, verbose = FALSE)
modFitGBM$finalModel
## A gradient boosted model with multinomial loss function.
## 150 iterations were performed.
## There were 52 predictors of which 52 had non-zero influence.
# prediction on validation dataset
predictGBM <- predict(modFitGBM, newdata=valid_data)
confMatGBM <- confusionMatrix(predictGBM, as.factor(valid_data$classe))
confMatGBM
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 1648   37    0    0    1
##          B   20 1082   27    7    9
##          C    4   18  988   26    6
##          D    2    2    7  924   15
##          E    0    0    4    7 1051
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9674          
##                  95% CI : (0.9625, 0.9718)
##     No Information Rate : 0.2845          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.9587          
##                                           
##  Mcnemar's Test P-Value : 1.767e-05       
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            0.9845   0.9500   0.9630   0.9585   0.9713
## Specificity            0.9910   0.9867   0.9889   0.9947   0.9977
## Pos Pred Value         0.9775   0.9450   0.9482   0.9726   0.9896
## Neg Pred Value         0.9938   0.9880   0.9922   0.9919   0.9936
## Prevalence             0.2845   0.1935   0.1743   0.1638   0.1839
## Detection Rate         0.2800   0.1839   0.1679   0.1570   0.1786
## Detection Prevalence   0.2865   0.1946   0.1771   0.1614   0.1805
## Balanced Accuracy      0.9877   0.9683   0.9759   0.9766   0.9845
# plot matrix results
plot(confMatGBM$table, col = confMatGBM$byClass, 
     main = paste("GBM - Accuracy =", round(confMatGBM$overall['Accuracy'], 4)))

V. Applying the Selected Model to the Testing Dataset

The results from the above prediction methods show that Random Forest model has the highest accuracy rate with over 99%. Hence, the Random Forest Model will be applied to predict the 20 quiz results using the testing dataset as shown below.

predictTEST <- predict(modFitRandForest, newdata=testing)
predictTEST
##  [1] B A B A A E D B A A B C B A E E A B B B
## Levels: A B C D E