This Human Activity Recognition analysis is focused on recognizing the quality of performing weight lifting exercises. The approach used aims at investigating “how (well)” an activity is performed by the participant.
Six young participants were asked to perform one set of 10 repetitions of the Unilateral Dumbbell Biceps Curl in five different fashions: according to specification (Class A), throwing the elbows to the front (Class B), lifting the dumbbell only halfway (Class C), lowering the dumbbell only halfway (Class D), and throwing the hips to the front (Class E). Class A corresponds to the specified (ideal) execution of the exercise, while the other 4 classes correspond to common mistakes.
Using the classe variable as outcome, and some other variables in the training dataset as predictors the goal of this Machine Learning analysis is to predict the manner in which the participants did the exercise.
More information about the research is available here (see the section on the Weight Lifting Exercise Dataset).
The analysis uses csv-formatted datasets available through the links pml-training and pml-testing. Both sets come from this source.
The pml-training dataset consists of 19,622 records which will be divided into two sets: a training dataset that will be used to train the model for prediction, and a testing dataset to be used to validate the prediction model and decide on model accuracy. The pml-testing dataset consists of 20 new observations on which the selected algorithm is to be applied to predict the quality of performing the weight-lifting exercises.
trainURL<-"https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv"
testURL<-"https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv"
trainFile<- "./pml-training.csv"
testFile<-"./pml-testing.csv"
if (file.exists(trainFile) == FALSE) {
download.file(trainURL, destfile = trainFile)
}
if (file.exists(testFile) == FALSE) {
download.file(testURL, destfile = testFile)
}
In order to perform the best model training analysis the datasets need to be pre-processed for any missing values. Hence, datasets are read while considering values of “NA”, “#DIV/0!”, and blanks as NA/missing values recognized by R during analysis.
pmlTrain <- read.csv(file = 'pml-training.csv',na.strings = c('NA','#DIV/0!',''))
pmlTest <- read.csv(file = 'pml-testing.csv',na.strings = c('NA','#DIV/0!',''))
A histogram is built on the pml-training dataset using the classe variable to have some clues on the frequency of this variable values across the dataset.
library(ggplot2)
g<-ggplot(pmlTrain,aes(x=classe))+
geom_histogram(alpha = .20, binwidth=.5, colour = "black")+
labs(x="Classe",y="Frequency")+
scale_y_continuous(breaks=seq(0, 5000, 500))+
theme(plot.title = element_text(size = 14, face = "bold", colour = "black", vjust = +1))+
ggtitle(expression(atop("Histogram representing the frequency of Classe outcome",
atop(italic("Training dataset")))))
g
The histogram above shows that classe variable has five possible values; A, B, C, D, and E. Class-A which represents the ideal weight-lifting fashion has the highest number of observations (around 5,500), whereas other fashions have a close number of observations (around 3,500).
Using the head(pmlTrain) or str(pmlTrain) functions in R we can detect some variables that do not contribute to the outcome classification model. Such variables would even make modelling inaccurate. Hence, variables with NA values will be removed using a feature slicer index which is a character vector that acts as a filter for valid variables without NAs, and which will be deployed when needed in both the training and the testing datasets. Besides, the first 7 columns are recognized to be irrelevant that can also be removed from the datasets.
featureSlice <- colnames(pmlTrain[colSums(is.na(pmlTrain)) == 0])
featureSlice <- featureSlice[-c(1:7)]
To train our prediction model then test its accuracy we need to split the pml-Training dataset into training and testing data (70/30 ratio) while applying the feature slicer filter.
library(caret)
set.seed(3030)
inTrain<-createDataPartition(y=pmlTrain$classe,p=0.7,list = FALSE)
training<-pmlTrain[inTrain,featureSlice]
testing<-pmlTrain[-inTrain,featureSlice]
dim(training);dim(testing)
## [1] 13737 53
## [1] 5885 53
Both resulted datasets have 53 variables with 13737 observations for training and 5885 observations for testing.
It is a good practice to make sure the training data does not include predictors with no variability. I.e. predictors that have one or very few unique values relative to the number of observations. This can be detected with the nzv value of the NearZeroVar function results.
nearZeroVar(training,saveMetrics = TRUE)
## freqRatio percentUnique zeroVar nzv
## roll_belt 1.119427 7.98573196 FALSE FALSE
## pitch_belt 1.141791 12.21518527 FALSE FALSE
## yaw_belt 1.111437 13.15425493 FALSE FALSE
## total_accel_belt 1.074377 0.20382907 FALSE FALSE
## gyros_belt_x 1.079533 0.90267162 FALSE FALSE
## gyros_belt_y 1.168469 0.47317464 FALSE FALSE
## gyros_belt_z 1.051241 1.18657640 FALSE FALSE
## accel_belt_x 1.089552 1.15745796 FALSE FALSE
## accel_belt_y 1.103226 1.01186576 FALSE FALSE
## accel_belt_z 1.090461 2.09652763 FALSE FALSE
## magnet_belt_x 1.027778 2.20572177 FALSE FALSE
## magnet_belt_y 1.061269 2.10380724 FALSE FALSE
## magnet_belt_z 1.060423 3.13751183 FALSE FALSE
## roll_arm 50.893617 17.62393536 FALSE FALSE
## pitch_arm 77.193548 20.27371333 FALSE FALSE
## yaw_arm 33.690141 19.03617966 FALSE FALSE
## total_accel_arm 1.007911 0.47317464 FALSE FALSE
## gyros_arm_x 1.028090 4.59343379 FALSE FALSE
## gyros_arm_y 1.391076 2.69345563 FALSE FALSE
## gyros_arm_z 1.042216 1.72526753 FALSE FALSE
## accel_arm_x 1.051724 5.55434229 FALSE FALSE
## accel_arm_y 1.032895 3.81451554 FALSE FALSE
## accel_arm_z 1.043011 5.53978307 FALSE FALSE
## magnet_arm_x 1.000000 9.59452573 FALSE FALSE
## magnet_arm_y 1.031250 6.23862561 FALSE FALSE
## magnet_arm_z 1.062500 9.14318993 FALSE FALSE
## roll_dumbbell 1.076923 86.70015287 FALSE FALSE
## pitch_dumbbell 2.395604 84.38523695 FALSE FALSE
## yaw_dumbbell 1.010989 85.99403072 FALSE FALSE
## total_accel_dumbbell 1.087185 0.30574361 FALSE FALSE
## gyros_dumbbell_x 1.002262 1.70342870 FALSE FALSE
## gyros_dumbbell_y 1.231678 1.96549465 FALSE FALSE
## gyros_dumbbell_z 1.019139 1.41224430 FALSE FALSE
## accel_dumbbell_x 1.021097 2.99191963 FALSE FALSE
## accel_dumbbell_y 1.034884 3.31222246 FALSE FALSE
## accel_dumbbell_z 1.120690 2.93368275 FALSE FALSE
## magnet_dumbbell_x 1.081301 7.74550484 FALSE FALSE
## magnet_dumbbell_y 1.213115 6.03479653 FALSE FALSE
## magnet_dumbbell_z 1.094488 4.84822014 FALSE FALSE
## roll_forearm 11.269710 13.60559074 FALSE FALSE
## pitch_forearm 63.162791 19.16721264 FALSE FALSE
## yaw_forearm 15.970588 12.97954430 FALSE FALSE
## total_accel_forearm 1.161176 0.48773386 FALSE FALSE
## gyros_forearm_x 1.113208 2.04557036 FALSE FALSE
## gyros_forearm_y 1.018519 5.20492102 FALSE FALSE
## gyros_forearm_z 1.201893 2.13292568 FALSE FALSE
## accel_forearm_x 1.138462 5.65625682 FALSE FALSE
## accel_forearm_y 1.026667 7.09761957 FALSE FALSE
## accel_forearm_z 1.009259 4.04746306 FALSE FALSE
## magnet_forearm_x 1.072727 10.61367111 FALSE FALSE
## magnet_forearm_y 1.033333 13.22705103 FALSE FALSE
## magnet_forearm_z 1.023256 11.79296790 FALSE FALSE
## classe 1.469526 0.03639805 FALSE FALSE
As shown above all selected predictors have FALSE nzv value which indicates that all of them have reasonable variability in the dataset.
Since the outcome (classe) is a categorical variable our algorithm should be based on one of those which are able to model categorical rather than regression models such as glm, Tree, and Random Forest. Yet, the glm requires further complicated preprocessing of the outcome variable as it is designed to model 2-value/binary values data whereas the classe variable have five different values.
Hence, we will start with deploying the Classification Tree algorithm and test its accuracy in predicting the outcome.
In order to increase the accuracy of our algorithm we will train our model using the built-in option of cross validation in the Classification Tree. Cross validation is employed here with 5 resampling iterations. The code result below shows the outcome model of training using the classification tree algorithm.
set.seed(1320)
modFit1<-train(classe~ .,data=training,
method="rpart",
trControl= trainControl(method = "cv",number = 5,allowParallel = TRUE))
modFit1$finalModel
## n= 13737
##
## node), split, n, loss, yval, (yprob)
## * denotes terminal node
##
## 1) root 13737 9831 A (0.28 0.19 0.17 0.16 0.18)
## 2) roll_belt< 130.5 12582 8686 A (0.31 0.21 0.19 0.18 0.11)
## 4) pitch_forearm< -33.15 1109 10 A (0.99 0.009 0 0 0) *
## 5) pitch_forearm>=-33.15 11473 8676 A (0.24 0.23 0.21 0.2 0.12)
## 10) magnet_dumbbell_y< 436.5 9665 6925 A (0.28 0.18 0.24 0.19 0.11)
## 20) roll_forearm< 123.5 6014 3553 A (0.41 0.18 0.18 0.17 0.06) *
## 21) roll_forearm>=123.5 3651 2429 C (0.076 0.18 0.33 0.23 0.18) *
## 11) magnet_dumbbell_y>=436.5 1808 899 B (0.032 0.5 0.041 0.23 0.19) *
## 3) roll_belt>=130.5 1155 10 E (0.0087 0 0 0 0.99) *
After training the model we evaluate it using the testing dataset, which is another cross validation step, then compare its outcome to the testing data actual outcome using the Confusion Matrix.
treePred<-predict(modFit1,testing)
treeCM<-confusionMatrix(treePred,testing$classe)
treeCM
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 1514 474 486 443 161
## B 38 393 35 162 145
## C 118 272 505 359 290
## D 0 0 0 0 0
## E 4 0 0 0 486
##
## Overall Statistics
##
## Accuracy : 0.4924
## 95% CI : (0.4796, 0.5053)
## No Information Rate : 0.2845
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.3363
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 0.9044 0.34504 0.49220 0.0000 0.44917
## Specificity 0.6286 0.91993 0.78617 1.0000 0.99917
## Pos Pred Value 0.4919 0.50841 0.32707 NaN 0.99184
## Neg Pred Value 0.9430 0.85407 0.87998 0.8362 0.88953
## Prevalence 0.2845 0.19354 0.17434 0.1638 0.18386
## Detection Rate 0.2573 0.06678 0.08581 0.0000 0.08258
## Detection Prevalence 0.5230 0.13135 0.26236 0.0000 0.08326
## Balanced Accuracy 0.7665 0.63249 0.63919 0.5000 0.72417
modFit1Acc<-round(as.numeric(treeCM$overall[1]),4)
modFit1Err<-round(1-modFit1Acc,4)
modFit1Acc;modFit1Err
## [1] 0.4924
## [1] 0.5076
Unfortunately, the confusion matrix revealed a very low accuracy (0.4924). Knowing that the out-of-sample (Generalization) error equals (1-Accuracy) the out-of-sample error is estimated to be high with a value of 0.5076. This implies that the classification tree is a weak prediction algorithm for this dataset and we need to look for another algorithm type.
Random Forest algorithm is known for its high accuracy in prediction where the algorithm grows multiple trees and vote for the best classifier. Hence, the next step is to train our model using Random Forest with cross validation of 5 resampling iterations.
set.seed(1400)
modFit2<-train(classe~ .,data=training,
method="rf",
trControl= trainControl(method = "cv",number = 5,allowParallel = TRUE))
modFit2
## Random Forest
##
## 13737 samples
## 52 predictor
## 5 classes: 'A', 'B', 'C', 'D', 'E'
##
## No pre-processing
## Resampling: Cross-Validated (5 fold)
## Summary of sample sizes: 10990, 10989, 10989, 10991, 10989
## Resampling results across tuning parameters:
##
## mtry Accuracy Kappa Accuracy SD Kappa SD
## 2 0.9908282 0.9883973 0.003549429 0.004491519
## 27 0.9900273 0.9873838 0.001809959 0.002290599
## 52 0.9842752 0.9801036 0.005004105 0.006335028
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 2.
It is worth seeing how predictors are ranked in terms of importance after training the model. This can be achieved with the below plot of the top 20 predictors.
varsImp<-varImp(modFit2,scale = FALSE)
varsImp
## rf variable importance
##
## only 20 most important variables shown (out of 52)
##
## Overall
## roll_belt 489.9
## yaw_belt 417.8
## magnet_dumbbell_z 377.4
## magnet_dumbbell_y 365.9
## pitch_belt 339.4
## pitch_forearm 328.0
## magnet_dumbbell_x 308.2
## roll_forearm 299.8
## accel_belt_z 277.4
## accel_dumbbell_y 276.6
## magnet_belt_z 275.7
## roll_dumbbell 264.4
## magnet_belt_y 258.8
## accel_dumbbell_z 242.4
## roll_arm 236.5
## accel_forearm_x 232.4
## gyros_belt_z 217.1
## total_accel_dumbbell 214.0
## accel_arm_x 209.8
## gyros_dumbbell_y 206.9
plot(varsImp,top = 20,main="Variable Importance")
As we did with the previous algorithm, we apply cross validation by testing the Random Forest model on the testing dataset then compare its outcome to the actual outcome of the testing data using the Confusion Matrix.
set.seed(1420)
rfPred<-predict(modFit2,testing)
rfCM<-confusionMatrix(rfPred,testing$classe)
rfCM
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 1673 9 0 0 0
## B 0 1129 2 0 0
## C 0 1 1021 26 0
## D 0 0 3 938 2
## E 1 0 0 0 1080
##
## Overall Statistics
##
## Accuracy : 0.9925
## 95% CI : (0.99, 0.9946)
## No Information Rate : 0.2845
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.9905
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 0.9994 0.9912 0.9951 0.9730 0.9982
## Specificity 0.9979 0.9996 0.9944 0.9990 0.9998
## Pos Pred Value 0.9946 0.9982 0.9742 0.9947 0.9991
## Neg Pred Value 0.9998 0.9979 0.9990 0.9947 0.9996
## Prevalence 0.2845 0.1935 0.1743 0.1638 0.1839
## Detection Rate 0.2843 0.1918 0.1735 0.1594 0.1835
## Detection Prevalence 0.2858 0.1922 0.1781 0.1602 0.1837
## Balanced Accuracy 0.9986 0.9954 0.9948 0.9860 0.9990
modFit2Acc<-round(as.numeric(rfCM$overall[1]),4)
modFit2Err<-round(1-modFit2Acc,4)
modFit2Acc;modFit2Err
## [1] 0.9925
## [1] 0.0075
Here, the confusion matrix revealed a very high accuracy (0.9925). Knowing that the out-of-sample (Generalization) error equals (1-Accuracy) the out-of-sample error is estimated to be very low with a value of 0.0075. This implies that the Random Forest is a strong prediction algorithm for this dataset and can be used to predict new datasets.
The selected model will be used to predict the outcome of the pml-testing dataset that consists of 20 new observations. Yet, since columns of the new dataset must be identical to those used for training the model we need to rename the last column to “classe” then we apply prediction.
lastColNumber<-length(colnames(pmlTest))
colnames(pmlTest)[lastColNumber]<-"classe"
Notice that we need to apply the same preprocessing we did on the training dataset to the new dataset. So, we pass the feature slicer index to the dataset during prediction as shown in the code below.
predict(modFit2,newdata = pmlTest[,featureSlice])
## [1] B A B A A E D B A A B C B A E E A B B B
## Levels: A B C D E
The selected model was able to predict the outcome value for each of the 20 observations in the pml-testing dataset.