Brief Overview

“Six young health participants were asked to perform one set of 10 repetitions of the Unilateral Dumbbell Biceps Curl in five different fashions: exactly according to the specification (Class A), throwing the elbows to the front (Class B), lifting the dumbbell only halfway (Class C), lowering the dumbbell only halfway (Class D) and throwing the hips to the front (Class E).”

Read more about the experiment and the where the data for this project comes from here

Using this machine learning algorithm, predictions can be made as to which “Class” of exercise is being done by the participant (variable = classe). After comparing three different models (Classification Tree, Random Forest, and Gradient Boosting Machine), it was found that a Random Forest model is best for this case with a 99.6% accuracy in predictions.

The Data

The training data for this project are available here

The test data are available here

Loading the Data & Necessary Packages

library(caret)
library(rpart)
library(randomForest)
library(ggplot2)
library(gridExtra)
library(grid)
library(e1071)
library(knitr)
set.seed(42)
urlTrain<-"https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv"
urlTest<-"https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv"
download.file(urlTrain,destfile="pml-training.csv")
download.file(urlTest,destfile="pml-testing.csv")

data<-read.csv("pml-training.csv")
valid<-read.csv("pml-testing.csv")

Note that the “testing” data file has been renamed as the validation set and will not be used until the final predictions.

Slicing

inTrain<-createDataPartition(y=data$classe,p=0.75,list=FALSE)
training<-data[inTrain,]
testing<-data[-inTrain,]
dim(training)
## [1] 14718   160
dim(testing)
## [1] 4904  160
dim(valid)
## [1]  20 160

To get a testing set separate from the final validation set, the initial trainging set has been spliced to give a train and test set.

Pre-Processing

First, the near zero variance variables are removed to shrink the data as they will have little impact ont he predictions.

##remove near zero variance variables
nzv<-nearZeroVar(training)
training<-training[,-nzv]
testing<-testing[,-nzv]
valid<-valid[,-nzv]
##check dimensions to ensure they all have the same no. of cols
dim(training)
## [1] 14718   105
dim(testing)
## [1] 4904  105
dim(valid)
## [1]  20 105
##check names of cols to see what is unnecessary 
names(training)
##   [1] "X"                        "user_name"               
##   [3] "raw_timestamp_part_1"     "raw_timestamp_part_2"    
##   [5] "cvtd_timestamp"           "num_window"              
##   [7] "roll_belt"                "pitch_belt"              
##   [9] "yaw_belt"                 "total_accel_belt"        
##  [11] "max_roll_belt"            "max_picth_belt"          
##  [13] "min_roll_belt"            "min_pitch_belt"          
##  [15] "amplitude_roll_belt"      "amplitude_pitch_belt"    
##  [17] "var_total_accel_belt"     "avg_roll_belt"           
##  [19] "stddev_roll_belt"         "var_roll_belt"           
##  [21] "avg_pitch_belt"           "stddev_pitch_belt"       
##  [23] "var_pitch_belt"           "avg_yaw_belt"            
##  [25] "stddev_yaw_belt"          "var_yaw_belt"            
##  [27] "gyros_belt_x"             "gyros_belt_y"            
##  [29] "gyros_belt_z"             "accel_belt_x"            
##  [31] "accel_belt_y"             "accel_belt_z"            
##  [33] "magnet_belt_x"            "magnet_belt_y"           
##  [35] "magnet_belt_z"            "roll_arm"                
##  [37] "pitch_arm"                "yaw_arm"                 
##  [39] "total_accel_arm"          "var_accel_arm"           
##  [41] "gyros_arm_x"              "gyros_arm_y"             
##  [43] "gyros_arm_z"              "accel_arm_x"             
##  [45] "accel_arm_y"              "accel_arm_z"             
##  [47] "magnet_arm_x"             "magnet_arm_y"            
##  [49] "magnet_arm_z"             "max_picth_arm"           
##  [51] "max_yaw_arm"              "min_roll_arm"            
##  [53] "min_yaw_arm"              "amplitude_yaw_arm"       
##  [55] "roll_dumbbell"            "pitch_dumbbell"          
##  [57] "yaw_dumbbell"             "max_roll_dumbbell"       
##  [59] "max_picth_dumbbell"       "min_roll_dumbbell"       
##  [61] "min_pitch_dumbbell"       "amplitude_roll_dumbbell" 
##  [63] "amplitude_pitch_dumbbell" "total_accel_dumbbell"    
##  [65] "var_accel_dumbbell"       "avg_roll_dumbbell"       
##  [67] "stddev_roll_dumbbell"     "var_roll_dumbbell"       
##  [69] "avg_pitch_dumbbell"       "stddev_pitch_dumbbell"   
##  [71] "var_pitch_dumbbell"       "avg_yaw_dumbbell"        
##  [73] "stddev_yaw_dumbbell"      "var_yaw_dumbbell"        
##  [75] "gyros_dumbbell_x"         "gyros_dumbbell_y"        
##  [77] "gyros_dumbbell_z"         "accel_dumbbell_x"        
##  [79] "accel_dumbbell_y"         "accel_dumbbell_z"        
##  [81] "magnet_dumbbell_x"        "magnet_dumbbell_y"       
##  [83] "magnet_dumbbell_z"        "roll_forearm"            
##  [85] "pitch_forearm"            "yaw_forearm"             
##  [87] "max_roll_forearm"         "max_picth_forearm"       
##  [89] "min_roll_forearm"         "min_pitch_forearm"       
##  [91] "amplitude_roll_forearm"   "amplitude_pitch_forearm" 
##  [93] "total_accel_forearm"      "var_accel_forearm"       
##  [95] "avg_roll_forearm"         "gyros_forearm_x"         
##  [97] "gyros_forearm_y"          "gyros_forearm_z"         
##  [99] "accel_forearm_x"          "accel_forearm_y"         
## [101] "accel_forearm_z"          "magnet_forearm_x"        
## [103] "magnet_forearm_y"         "magnet_forearm_z"        
## [105] "classe"

Now, it can be seen from the column names remaining, columns 1-6 can be removed as they are just describing id numbers for participants and timestamps for the activity.

##remove the first 6 cols
training<-training[,-c(1:6)]
testing<-testing[,-c(1:6)]
valid<-valid[,-c(1:6)]
##check dimensions to ensure they match no. of cols
dim(training)
## [1] 14718    99
dim(testing)
## [1] 4904   99
dim(valid)
## [1] 20 99

Lastly, the columns that are mostly consisting of NA values will be removed as they will also have little impact on the predictions.

##determine which cols have over 95% NAs and remove them 
MostNAs<-sapply(training, function(x) mean(is.na(x))) > 0.95
training<-training[,MostNAs==FALSE]
testing<-testing[,MostNAs==FALSE]
valid<-valid[,MostNAs==FALSE]
##check dimensions to ensure same no. of cols
dim(training)
## [1] 14718    53
dim(testing)
## [1] 4904   53
dim(valid)
## [1] 20 53

Fitting Model Comparison

Now that the data has been pruned to an efficient size, three fit models will be compared to see which has the highest accuracy for the predictors left.

Classification Tree Model

First, a Classification Tree Model is fitted with cross validation of 5 folds.

set.seed(42)
trControl<-trainControl(method="cv", number=5)
modFitCT<-train(classe~.,method="rpart",data=training,trControl=trControl)
predCT<-predict(modFitCT,testing)
confusionMatrix(predCT,testing$classe)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 1257  380  386  379  123
##          B   21  319   28  132  118
##          C  117  250  441  293  246
##          D    0    0    0    0    0
##          E    0    0    0    0  414
## 
## Overall Statistics
##                                           
##                Accuracy : 0.4957          
##                  95% CI : (0.4816, 0.5098)
##     No Information Rate : 0.2845          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.3415          
##                                           
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            0.9011  0.33614  0.51579   0.0000  0.45949
## Specificity            0.6386  0.92440  0.77624   1.0000  1.00000
## Pos Pred Value         0.4978  0.51618  0.32739      NaN  1.00000
## Neg Pred Value         0.9420  0.85301  0.88361   0.8361  0.89154
## Prevalence             0.2845  0.19352  0.17435   0.1639  0.18373
## Detection Rate         0.2563  0.06505  0.08993   0.0000  0.08442
## Detection Prevalence   0.5149  0.12602  0.27467   0.0000  0.08442
## Balanced Accuracy      0.7699  0.63027  0.64602   0.5000  0.72974
confusionMatrix(predCT,testing$classe)$overall[1]
##  Accuracy 
## 0.4957178

This model gives a 49.18% accuracy and the classification tree is seen in Figure 1 below

Figure 1: Classification Tree Plot

library(rattle)
fancyRpartPlot(modFitCT$finalModel)

Gradient Boosting Machine Model

Next, a Gradient Boosting Machine is fitted with cross validation of 5 folds.

modFitGBM<-train(classe~.,method="gbm",data=training,trControl=trControl,verbose=FALSE)
predGBM<-predict(modFitGBM,testing)
confusionMatrix(predGBM,testing$classe)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 1372   35    0    2    2
##          B   10  890   29    1   10
##          C    7   18  821   30    4
##          D    3    5    5  767   15
##          E    3    1    0    4  870
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9625          
##                  95% CI : (0.9568, 0.9676)
##     No Information Rate : 0.2845          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.9525          
##                                           
##  Mcnemar's Test P-Value : 1.436e-09       
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            0.9835   0.9378   0.9602   0.9540   0.9656
## Specificity            0.9889   0.9874   0.9854   0.9932   0.9980
## Pos Pred Value         0.9724   0.9468   0.9330   0.9648   0.9909
## Neg Pred Value         0.9934   0.9851   0.9916   0.9910   0.9923
## Prevalence             0.2845   0.1935   0.1743   0.1639   0.1837
## Detection Rate         0.2798   0.1815   0.1674   0.1564   0.1774
## Detection Prevalence   0.2877   0.1917   0.1794   0.1621   0.1790
## Balanced Accuracy      0.9862   0.9626   0.9728   0.9736   0.9818
confusionMatrix(predGBM,testing$classe)$overall[1]
##  Accuracy 
## 0.9624796

This model gives a 96.53% accuracy

Random Forest Model

Last, a Random Forest Model is fitted with of cross validation of 5 folds.

modFitRF<-randomForest(classe~.,data=training,trControl=trControl)
predRF<-predict(modFitRF,testing)
confusionMatrix(predRF,testing$classe)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 1395    4    0    0    0
##          B    0  944    3    0    0
##          C    0    1  851   13    0
##          D    0    0    1  789    0
##          E    0    0    0    2  901
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9951          
##                  95% CI : (0.9927, 0.9969)
##     No Information Rate : 0.2845          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.9938          
##                                           
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            1.0000   0.9947   0.9953   0.9813   1.0000
## Specificity            0.9989   0.9992   0.9965   0.9998   0.9995
## Pos Pred Value         0.9971   0.9968   0.9838   0.9987   0.9978
## Neg Pred Value         1.0000   0.9987   0.9990   0.9964   1.0000
## Prevalence             0.2845   0.1935   0.1743   0.1639   0.1837
## Detection Rate         0.2845   0.1925   0.1735   0.1609   0.1837
## Detection Prevalence   0.2853   0.1931   0.1764   0.1611   0.1841
## Balanced Accuracy      0.9994   0.9970   0.9959   0.9905   0.9998
confusionMatrix(predRF,testing$classe)$overall[1]
## Accuracy 
## 0.995106

This model gives a 99.63 accuracy

Generalization Error

The out-of-sample error can be calculated on the testing set by subtracting the accuracy from 1.

#Classification Tree Error
1-0.4918
## [1] 0.5082
#GBM Error
1-0.9653
## [1] 0.0347
#Random Forest Error
1-0.9963
## [1] 0.0037

All calculations of accuracy and out-of-sample error can be seen here in one table to compare.

Figure 2: Table Comparison of Model Types

Model Type Accuracy Generalization Error
Classification Tree 49.18% 0.5082
Gradient Boosting Machine 96.53% 0.0347
Random Forest 99.63% 0.0037

Visual Comparison

All predictions were plotted next to the actual outcomes for a visual comparison and representation.

Figure 3: Plot Comparison of Model Predictions vs. Actual

pTest<-qplot(testing$classe,main="Actual",xlab="Class Assignment",ylab="Frequency")
pCT<-qplot(predCT,main="Classification Tree",xlab="Class Predictions",ylab="Frequency")
pRF<-qplot(predRF,main="Random Forest",xlab="Class Predictions",ylab="Frequency")
pGBM<-qplot(predGBM,main="GBM",xlab="Class Predictions",ylab="Frequency")

grid.arrange(pTest,pCT,pRF,pGBM,ncol=2)

From Figures 2 & 3, it is clear the classification tree model is quite far from the actual and will obviously not be used. The Random Forest and GBM models are close in accuracy, but the Random Forest model will be chosen for it’s higher accuracy and will be used on the final validation set.

Final Predictions

Using the Random Forest Model, the predictions on the validation set (the given original testing data) is as follows:

predFinal<-predict(modFitRF,newdata=valid)
predFinal
##  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 
##  B  A  B  A  A  E  D  B  A  A  B  C  B  A  E  E  A  B  B  B 
## Levels: A B C D E