“Six young health participants were asked to perform one set of 10 repetitions of the Unilateral Dumbbell Biceps Curl in five different fashions: exactly according to the specification (Class A), throwing the elbows to the front (Class B), lifting the dumbbell only halfway (Class C), lowering the dumbbell only halfway (Class D) and throwing the hips to the front (Class E).”
Read more about the experiment and the where the data for this project comes from here
Using this machine learning algorithm, predictions can be made as to which “Class” of exercise is being done by the participant (variable = classe). After comparing three different models (Classification Tree, Random Forest, and Gradient Boosting Machine), it was found that a Random Forest model is best for this case with a 99.6% accuracy in predictions.
The training data for this project are available here
The test data are available here
library(caret)
library(rpart)
library(randomForest)
library(ggplot2)
library(gridExtra)
library(grid)
library(e1071)
library(knitr)
set.seed(42)
urlTrain<-"https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv"
urlTest<-"https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv"
download.file(urlTrain,destfile="pml-training.csv")
download.file(urlTest,destfile="pml-testing.csv")
data<-read.csv("pml-training.csv")
valid<-read.csv("pml-testing.csv")
Note that the “testing” data file has been renamed as the validation set and will not be used until the final predictions.
inTrain<-createDataPartition(y=data$classe,p=0.75,list=FALSE)
training<-data[inTrain,]
testing<-data[-inTrain,]
dim(training)
## [1] 14718 160
dim(testing)
## [1] 4904 160
dim(valid)
## [1] 20 160
To get a testing set separate from the final validation set, the initial trainging set has been spliced to give a train and test set.
First, the near zero variance variables are removed to shrink the data as they will have little impact ont he predictions.
##remove near zero variance variables
nzv<-nearZeroVar(training)
training<-training[,-nzv]
testing<-testing[,-nzv]
valid<-valid[,-nzv]
##check dimensions to ensure they all have the same no. of cols
dim(training)
## [1] 14718 105
dim(testing)
## [1] 4904 105
dim(valid)
## [1] 20 105
##check names of cols to see what is unnecessary
names(training)
## [1] "X" "user_name"
## [3] "raw_timestamp_part_1" "raw_timestamp_part_2"
## [5] "cvtd_timestamp" "num_window"
## [7] "roll_belt" "pitch_belt"
## [9] "yaw_belt" "total_accel_belt"
## [11] "max_roll_belt" "max_picth_belt"
## [13] "min_roll_belt" "min_pitch_belt"
## [15] "amplitude_roll_belt" "amplitude_pitch_belt"
## [17] "var_total_accel_belt" "avg_roll_belt"
## [19] "stddev_roll_belt" "var_roll_belt"
## [21] "avg_pitch_belt" "stddev_pitch_belt"
## [23] "var_pitch_belt" "avg_yaw_belt"
## [25] "stddev_yaw_belt" "var_yaw_belt"
## [27] "gyros_belt_x" "gyros_belt_y"
## [29] "gyros_belt_z" "accel_belt_x"
## [31] "accel_belt_y" "accel_belt_z"
## [33] "magnet_belt_x" "magnet_belt_y"
## [35] "magnet_belt_z" "roll_arm"
## [37] "pitch_arm" "yaw_arm"
## [39] "total_accel_arm" "var_accel_arm"
## [41] "gyros_arm_x" "gyros_arm_y"
## [43] "gyros_arm_z" "accel_arm_x"
## [45] "accel_arm_y" "accel_arm_z"
## [47] "magnet_arm_x" "magnet_arm_y"
## [49] "magnet_arm_z" "max_picth_arm"
## [51] "max_yaw_arm" "min_roll_arm"
## [53] "min_yaw_arm" "amplitude_yaw_arm"
## [55] "roll_dumbbell" "pitch_dumbbell"
## [57] "yaw_dumbbell" "max_roll_dumbbell"
## [59] "max_picth_dumbbell" "min_roll_dumbbell"
## [61] "min_pitch_dumbbell" "amplitude_roll_dumbbell"
## [63] "amplitude_pitch_dumbbell" "total_accel_dumbbell"
## [65] "var_accel_dumbbell" "avg_roll_dumbbell"
## [67] "stddev_roll_dumbbell" "var_roll_dumbbell"
## [69] "avg_pitch_dumbbell" "stddev_pitch_dumbbell"
## [71] "var_pitch_dumbbell" "avg_yaw_dumbbell"
## [73] "stddev_yaw_dumbbell" "var_yaw_dumbbell"
## [75] "gyros_dumbbell_x" "gyros_dumbbell_y"
## [77] "gyros_dumbbell_z" "accel_dumbbell_x"
## [79] "accel_dumbbell_y" "accel_dumbbell_z"
## [81] "magnet_dumbbell_x" "magnet_dumbbell_y"
## [83] "magnet_dumbbell_z" "roll_forearm"
## [85] "pitch_forearm" "yaw_forearm"
## [87] "max_roll_forearm" "max_picth_forearm"
## [89] "min_roll_forearm" "min_pitch_forearm"
## [91] "amplitude_roll_forearm" "amplitude_pitch_forearm"
## [93] "total_accel_forearm" "var_accel_forearm"
## [95] "avg_roll_forearm" "gyros_forearm_x"
## [97] "gyros_forearm_y" "gyros_forearm_z"
## [99] "accel_forearm_x" "accel_forearm_y"
## [101] "accel_forearm_z" "magnet_forearm_x"
## [103] "magnet_forearm_y" "magnet_forearm_z"
## [105] "classe"
Now, it can be seen from the column names remaining, columns 1-6 can be removed as they are just describing id numbers for participants and timestamps for the activity.
##remove the first 6 cols
training<-training[,-c(1:6)]
testing<-testing[,-c(1:6)]
valid<-valid[,-c(1:6)]
##check dimensions to ensure they match no. of cols
dim(training)
## [1] 14718 99
dim(testing)
## [1] 4904 99
dim(valid)
## [1] 20 99
Lastly, the columns that are mostly consisting of NA values will be removed as they will also have little impact on the predictions.
##determine which cols have over 95% NAs and remove them
MostNAs<-sapply(training, function(x) mean(is.na(x))) > 0.95
training<-training[,MostNAs==FALSE]
testing<-testing[,MostNAs==FALSE]
valid<-valid[,MostNAs==FALSE]
##check dimensions to ensure same no. of cols
dim(training)
## [1] 14718 53
dim(testing)
## [1] 4904 53
dim(valid)
## [1] 20 53
Now that the data has been pruned to an efficient size, three fit models will be compared to see which has the highest accuracy for the predictors left.
First, a Classification Tree Model is fitted with cross validation of 5 folds.
set.seed(42)
trControl<-trainControl(method="cv", number=5)
modFitCT<-train(classe~.,method="rpart",data=training,trControl=trControl)
predCT<-predict(modFitCT,testing)
confusionMatrix(predCT,testing$classe)
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 1257 380 386 379 123
## B 21 319 28 132 118
## C 117 250 441 293 246
## D 0 0 0 0 0
## E 0 0 0 0 414
##
## Overall Statistics
##
## Accuracy : 0.4957
## 95% CI : (0.4816, 0.5098)
## No Information Rate : 0.2845
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.3415
##
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 0.9011 0.33614 0.51579 0.0000 0.45949
## Specificity 0.6386 0.92440 0.77624 1.0000 1.00000
## Pos Pred Value 0.4978 0.51618 0.32739 NaN 1.00000
## Neg Pred Value 0.9420 0.85301 0.88361 0.8361 0.89154
## Prevalence 0.2845 0.19352 0.17435 0.1639 0.18373
## Detection Rate 0.2563 0.06505 0.08993 0.0000 0.08442
## Detection Prevalence 0.5149 0.12602 0.27467 0.0000 0.08442
## Balanced Accuracy 0.7699 0.63027 0.64602 0.5000 0.72974
confusionMatrix(predCT,testing$classe)$overall[1]
## Accuracy
## 0.4957178
This model gives a 49.18% accuracy and the classification tree is seen in Figure 1 below
library(rattle)
fancyRpartPlot(modFitCT$finalModel)
Next, a Gradient Boosting Machine is fitted with cross validation of 5 folds.
modFitGBM<-train(classe~.,method="gbm",data=training,trControl=trControl,verbose=FALSE)
predGBM<-predict(modFitGBM,testing)
confusionMatrix(predGBM,testing$classe)
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 1372 35 0 2 2
## B 10 890 29 1 10
## C 7 18 821 30 4
## D 3 5 5 767 15
## E 3 1 0 4 870
##
## Overall Statistics
##
## Accuracy : 0.9625
## 95% CI : (0.9568, 0.9676)
## No Information Rate : 0.2845
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.9525
##
## Mcnemar's Test P-Value : 1.436e-09
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 0.9835 0.9378 0.9602 0.9540 0.9656
## Specificity 0.9889 0.9874 0.9854 0.9932 0.9980
## Pos Pred Value 0.9724 0.9468 0.9330 0.9648 0.9909
## Neg Pred Value 0.9934 0.9851 0.9916 0.9910 0.9923
## Prevalence 0.2845 0.1935 0.1743 0.1639 0.1837
## Detection Rate 0.2798 0.1815 0.1674 0.1564 0.1774
## Detection Prevalence 0.2877 0.1917 0.1794 0.1621 0.1790
## Balanced Accuracy 0.9862 0.9626 0.9728 0.9736 0.9818
confusionMatrix(predGBM,testing$classe)$overall[1]
## Accuracy
## 0.9624796
This model gives a 96.53% accuracy
Last, a Random Forest Model is fitted with of cross validation of 5 folds.
modFitRF<-randomForest(classe~.,data=training,trControl=trControl)
predRF<-predict(modFitRF,testing)
confusionMatrix(predRF,testing$classe)
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 1395 4 0 0 0
## B 0 944 3 0 0
## C 0 1 851 13 0
## D 0 0 1 789 0
## E 0 0 0 2 901
##
## Overall Statistics
##
## Accuracy : 0.9951
## 95% CI : (0.9927, 0.9969)
## No Information Rate : 0.2845
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.9938
##
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 1.0000 0.9947 0.9953 0.9813 1.0000
## Specificity 0.9989 0.9992 0.9965 0.9998 0.9995
## Pos Pred Value 0.9971 0.9968 0.9838 0.9987 0.9978
## Neg Pred Value 1.0000 0.9987 0.9990 0.9964 1.0000
## Prevalence 0.2845 0.1935 0.1743 0.1639 0.1837
## Detection Rate 0.2845 0.1925 0.1735 0.1609 0.1837
## Detection Prevalence 0.2853 0.1931 0.1764 0.1611 0.1841
## Balanced Accuracy 0.9994 0.9970 0.9959 0.9905 0.9998
confusionMatrix(predRF,testing$classe)$overall[1]
## Accuracy
## 0.995106
This model gives a 99.63 accuracy
The out-of-sample error can be calculated on the testing set by subtracting the accuracy from 1.
#Classification Tree Error
1-0.4918
## [1] 0.5082
#GBM Error
1-0.9653
## [1] 0.0347
#Random Forest Error
1-0.9963
## [1] 0.0037
All calculations of accuracy and out-of-sample error can be seen here in one table to compare.
| Model Type | Accuracy | Generalization Error |
|---|---|---|
| Classification Tree | 49.18% | 0.5082 |
| Gradient Boosting Machine | 96.53% | 0.0347 |
| Random Forest | 99.63% | 0.0037 |
All predictions were plotted next to the actual outcomes for a visual comparison and representation.
pTest<-qplot(testing$classe,main="Actual",xlab="Class Assignment",ylab="Frequency")
pCT<-qplot(predCT,main="Classification Tree",xlab="Class Predictions",ylab="Frequency")
pRF<-qplot(predRF,main="Random Forest",xlab="Class Predictions",ylab="Frequency")
pGBM<-qplot(predGBM,main="GBM",xlab="Class Predictions",ylab="Frequency")
grid.arrange(pTest,pCT,pRF,pGBM,ncol=2)
From Figures 2 & 3, it is clear the classification tree model is quite far from the actual and will obviously not be used. The Random Forest and GBM models are close in accuracy, but the Random Forest model will be chosen for it’s higher accuracy and will be used on the final validation set.
Using the Random Forest Model, the predictions on the validation set (the given original testing data) is as follows:
predFinal<-predict(modFitRF,newdata=valid)
predFinal
## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
## B A B A A E D B A A B C B A E E A B B B
## Levels: A B C D E