One thing that people regularly do is quantify how much of a particular activity they do, but they rarely quantify how well they do it. In this report, our goal is to use data from accelerometers on the belt, forearm, arm, and dumbbell of 6 participants and build a model to quantify how well weight lifting exercises are done. Please see this webpage for information on this research.
After data exploration and data cleaning, cross-validation technique is used to build models using four different techniques. The best model shows an accuracy of 99.5% and that model is used to predict the outcome for the supplied test data.
The training data file and the test data file are loaded. We look at the structure of the data, head of the data, tail of the data and the column names. It is clear there are many columns that contain NA, #DIV/0! and blank as values. These are read as NA values to help the analysis.
There is a total of 19622 observations of 160 variables in the training data set. Test data set contains 20 observations of 160 variables.
# load libraries, set seed for reproducibility
library(caret)
library(gbm)
library(corrplot)
set.seed(1)
# load data
trainDF <- read.csv("C:/Users/mgrav/Desktop/pml-training.csv", na.strings = c("NA", "#DIV/0!", ""))
testDF <- read.csv("C:/Users/mgrav/Desktop/pml-testing.csv", na.strings = c("NA", "#DIV/0!", ""))
# explore data
str(trainDF)
head(trainDF)
tail(trainDF)
colnames(trainDF)
There are 100 columns that contain more than 50% NA values. In fact, the number is the same whether are looking for 20% NA values or 80% NA values. These columns are removed.
# remove "NA" columns
naStats <- colSums(is.na(trainDF))
naCols <- (naStats/nrow(trainDF) > 0.5)
sum(naCols == TRUE)
sum((naStats/nrow(trainDF) > 0.2) == TRUE)
sum((naStats/nrow(trainDF) > 0.8) == TRUE)
trainDF <- trainDF[,!naCols]
testDF <- testDF[,!naCols]
There is no documentation from the original research about column names. Examination of the titles of the remaining column names show the first seven columns are not relevant to our modeling. Please see below.
# remove obviously unneeded columns
colnames(trainDF)
## [1] "X" "user_name" "raw_timestamp_part_1"
## [4] "raw_timestamp_part_2" "cvtd_timestamp" "new_window"
## [7] "num_window" "roll_belt" "pitch_belt"
## [10] "yaw_belt" "total_accel_belt" "gyros_belt_x"
## [13] "gyros_belt_y" "gyros_belt_z" "accel_belt_x"
## [16] "accel_belt_y" "accel_belt_z" "magnet_belt_x"
## [19] "magnet_belt_y" "magnet_belt_z" "roll_arm"
## [22] "pitch_arm" "yaw_arm" "total_accel_arm"
## [25] "gyros_arm_x" "gyros_arm_y" "gyros_arm_z"
## [28] "accel_arm_x" "accel_arm_y" "accel_arm_z"
## [31] "magnet_arm_x" "magnet_arm_y" "magnet_arm_z"
## [34] "roll_dumbbell" "pitch_dumbbell" "yaw_dumbbell"
## [37] "total_accel_dumbbell" "gyros_dumbbell_x" "gyros_dumbbell_y"
## [40] "gyros_dumbbell_z" "accel_dumbbell_x" "accel_dumbbell_y"
## [43] "accel_dumbbell_z" "magnet_dumbbell_x" "magnet_dumbbell_y"
## [46] "magnet_dumbbell_z" "roll_forearm" "pitch_forearm"
## [49] "yaw_forearm" "total_accel_forearm" "gyros_forearm_x"
## [52] "gyros_forearm_y" "gyros_forearm_z" "accel_forearm_x"
## [55] "accel_forearm_y" "accel_forearm_z" "magnet_forearm_x"
## [58] "magnet_forearm_y" "magnet_forearm_z" "classe"
trainDF <- trainDF[,-c(1:7)]
testDF <- testDF[,-c(1:7)]
Let us take a look at the summary data for the remaining columns. Please note the distribution of classe variable which is our predictor variable (last one in the list).
summary(trainDF)
## roll_belt pitch_belt yaw_belt total_accel_belt
## Min. :-28.90 Min. :-55.8000 Min. :-180.00 Min. : 0.00
## 1st Qu.: 1.10 1st Qu.: 1.7600 1st Qu.: -88.30 1st Qu.: 3.00
## Median :113.00 Median : 5.2800 Median : -13.00 Median :17.00
## Mean : 64.41 Mean : 0.3053 Mean : -11.21 Mean :11.31
## 3rd Qu.:123.00 3rd Qu.: 14.9000 3rd Qu.: 12.90 3rd Qu.:18.00
## Max. :162.00 Max. : 60.3000 Max. : 179.00 Max. :29.00
## gyros_belt_x gyros_belt_y gyros_belt_z
## Min. :-1.040000 Min. :-0.64000 Min. :-1.4600
## 1st Qu.:-0.030000 1st Qu.: 0.00000 1st Qu.:-0.2000
## Median : 0.030000 Median : 0.02000 Median :-0.1000
## Mean :-0.005592 Mean : 0.03959 Mean :-0.1305
## 3rd Qu.: 0.110000 3rd Qu.: 0.11000 3rd Qu.:-0.0200
## Max. : 2.220000 Max. : 0.64000 Max. : 1.6200
## accel_belt_x accel_belt_y accel_belt_z magnet_belt_x
## Min. :-120.000 Min. :-69.00 Min. :-275.00 Min. :-52.0
## 1st Qu.: -21.000 1st Qu.: 3.00 1st Qu.:-162.00 1st Qu.: 9.0
## Median : -15.000 Median : 35.00 Median :-152.00 Median : 35.0
## Mean : -5.595 Mean : 30.15 Mean : -72.59 Mean : 55.6
## 3rd Qu.: -5.000 3rd Qu.: 61.00 3rd Qu.: 27.00 3rd Qu.: 59.0
## Max. : 85.000 Max. :164.00 Max. : 105.00 Max. :485.0
## magnet_belt_y magnet_belt_z roll_arm pitch_arm
## Min. :354.0 Min. :-623.0 Min. :-180.00 Min. :-88.800
## 1st Qu.:581.0 1st Qu.:-375.0 1st Qu.: -31.77 1st Qu.:-25.900
## Median :601.0 Median :-320.0 Median : 0.00 Median : 0.000
## Mean :593.7 Mean :-345.5 Mean : 17.83 Mean : -4.612
## 3rd Qu.:610.0 3rd Qu.:-306.0 3rd Qu.: 77.30 3rd Qu.: 11.200
## Max. :673.0 Max. : 293.0 Max. : 180.00 Max. : 88.500
## yaw_arm total_accel_arm gyros_arm_x gyros_arm_y
## Min. :-180.0000 Min. : 1.00 Min. :-6.37000 Min. :-3.4400
## 1st Qu.: -43.1000 1st Qu.:17.00 1st Qu.:-1.33000 1st Qu.:-0.8000
## Median : 0.0000 Median :27.00 Median : 0.08000 Median :-0.2400
## Mean : -0.6188 Mean :25.51 Mean : 0.04277 Mean :-0.2571
## 3rd Qu.: 45.8750 3rd Qu.:33.00 3rd Qu.: 1.57000 3rd Qu.: 0.1400
## Max. : 180.0000 Max. :66.00 Max. : 4.87000 Max. : 2.8400
## gyros_arm_z accel_arm_x accel_arm_y accel_arm_z
## Min. :-2.3300 Min. :-404.00 Min. :-318.0 Min. :-636.00
## 1st Qu.:-0.0700 1st Qu.:-242.00 1st Qu.: -54.0 1st Qu.:-143.00
## Median : 0.2300 Median : -44.00 Median : 14.0 Median : -47.00
## Mean : 0.2695 Mean : -60.24 Mean : 32.6 Mean : -71.25
## 3rd Qu.: 0.7200 3rd Qu.: 84.00 3rd Qu.: 139.0 3rd Qu.: 23.00
## Max. : 3.0200 Max. : 437.00 Max. : 308.0 Max. : 292.00
## magnet_arm_x magnet_arm_y magnet_arm_z roll_dumbbell
## Min. :-584.0 Min. :-392.0 Min. :-597.0 Min. :-153.71
## 1st Qu.:-300.0 1st Qu.: -9.0 1st Qu.: 131.2 1st Qu.: -18.49
## Median : 289.0 Median : 202.0 Median : 444.0 Median : 48.17
## Mean : 191.7 Mean : 156.6 Mean : 306.5 Mean : 23.84
## 3rd Qu.: 637.0 3rd Qu.: 323.0 3rd Qu.: 545.0 3rd Qu.: 67.61
## Max. : 782.0 Max. : 583.0 Max. : 694.0 Max. : 153.55
## pitch_dumbbell yaw_dumbbell total_accel_dumbbell
## Min. :-149.59 Min. :-150.871 Min. : 0.00
## 1st Qu.: -40.89 1st Qu.: -77.644 1st Qu.: 4.00
## Median : -20.96 Median : -3.324 Median :10.00
## Mean : -10.78 Mean : 1.674 Mean :13.72
## 3rd Qu.: 17.50 3rd Qu.: 79.643 3rd Qu.:19.00
## Max. : 149.40 Max. : 154.952 Max. :58.00
## gyros_dumbbell_x gyros_dumbbell_y gyros_dumbbell_z
## Min. :-204.0000 Min. :-2.10000 Min. : -2.380
## 1st Qu.: -0.0300 1st Qu.:-0.14000 1st Qu.: -0.310
## Median : 0.1300 Median : 0.03000 Median : -0.130
## Mean : 0.1611 Mean : 0.04606 Mean : -0.129
## 3rd Qu.: 0.3500 3rd Qu.: 0.21000 3rd Qu.: 0.030
## Max. : 2.2200 Max. :52.00000 Max. :317.000
## accel_dumbbell_x accel_dumbbell_y accel_dumbbell_z magnet_dumbbell_x
## Min. :-419.00 Min. :-189.00 Min. :-334.00 Min. :-643.0
## 1st Qu.: -50.00 1st Qu.: -8.00 1st Qu.:-142.00 1st Qu.:-535.0
## Median : -8.00 Median : 41.50 Median : -1.00 Median :-479.0
## Mean : -28.62 Mean : 52.63 Mean : -38.32 Mean :-328.5
## 3rd Qu.: 11.00 3rd Qu.: 111.00 3rd Qu.: 38.00 3rd Qu.:-304.0
## Max. : 235.00 Max. : 315.00 Max. : 318.00 Max. : 592.0
## magnet_dumbbell_y magnet_dumbbell_z roll_forearm pitch_forearm
## Min. :-3600 Min. :-262.00 Min. :-180.0000 Min. :-72.50
## 1st Qu.: 231 1st Qu.: -45.00 1st Qu.: -0.7375 1st Qu.: 0.00
## Median : 311 Median : 13.00 Median : 21.7000 Median : 9.24
## Mean : 221 Mean : 46.05 Mean : 33.8265 Mean : 10.71
## 3rd Qu.: 390 3rd Qu.: 95.00 3rd Qu.: 140.0000 3rd Qu.: 28.40
## Max. : 633 Max. : 452.00 Max. : 180.0000 Max. : 89.80
## yaw_forearm total_accel_forearm gyros_forearm_x
## Min. :-180.00 Min. : 0.00 Min. :-22.000
## 1st Qu.: -68.60 1st Qu.: 29.00 1st Qu.: -0.220
## Median : 0.00 Median : 36.00 Median : 0.050
## Mean : 19.21 Mean : 34.72 Mean : 0.158
## 3rd Qu.: 110.00 3rd Qu.: 41.00 3rd Qu.: 0.560
## Max. : 180.00 Max. :108.00 Max. : 3.970
## gyros_forearm_y gyros_forearm_z accel_forearm_x accel_forearm_y
## Min. : -7.02000 Min. : -8.0900 Min. :-498.00 Min. :-632.0
## 1st Qu.: -1.46000 1st Qu.: -0.1800 1st Qu.:-178.00 1st Qu.: 57.0
## Median : 0.03000 Median : 0.0800 Median : -57.00 Median : 201.0
## Mean : 0.07517 Mean : 0.1512 Mean : -61.65 Mean : 163.7
## 3rd Qu.: 1.62000 3rd Qu.: 0.4900 3rd Qu.: 76.00 3rd Qu.: 312.0
## Max. :311.00000 Max. :231.0000 Max. : 477.00 Max. : 923.0
## accel_forearm_z magnet_forearm_x magnet_forearm_y magnet_forearm_z
## Min. :-446.00 Min. :-1280.0 Min. :-896.0 Min. :-973.0
## 1st Qu.:-182.00 1st Qu.: -616.0 1st Qu.: 2.0 1st Qu.: 191.0
## Median : -39.00 Median : -378.0 Median : 591.0 Median : 511.0
## Mean : -55.29 Mean : -312.6 Mean : 380.1 Mean : 393.6
## 3rd Qu.: 26.00 3rd Qu.: -73.0 3rd Qu.: 737.0 3rd Qu.: 653.0
## Max. : 291.00 Max. : 672.0 Max. :1480.0 Max. :1090.0
## classe
## A:5580
## B:3797
## C:3422
## D:3216
## E:3607
##
Let us do a correlation plot!
corDF <- trainDF
corDF$classe <- as.numeric(corDF$classe)
trainCor <- cor(corDF)
corrplot(trainCor, method = "color", tl.col = "black", tl.cex = 0.6)
Test data set has only 20 rows whereas the training data set has 19622 rows. Recommended training to test ratio is about 70 to 30. Hence, we are going to divide the training data into two data sets: cross validation training set and cross validation test set. We are using a ratio of 75-25 for this.
The following plot shows the distribution of classe variable in the cross validation training data set.
# cross validation requires subsetting trainDF
inTrain = createDataPartition(trainDF$classe, p = 3/4)[[1]]
cvTrainDF = trainDF[ inTrain,]
cvTestDF = trainDF[-inTrain,]
# plot showing distribution of predictor in Cross Validation training set
barplot(table(cvTrainDF$classe), main = "Distribution of classe for Cross Validation Training Set",
ylab = "Frequency", xlab = 'classe', col = "salmon")
The following algorithms are used for building models and evaluating them:
Random Forest
Gradient Boosting Machine (GBM)
Linear Discriminant Analysis (LDA)
Recursive Paritioning for Classification (RPART)
mdlRF <- train(classe ~ ., data = cvTrainDF, method = "rf")
mdlGBM <- train(classe ~ ., data = cvTrainDF, method = "gbm")
mdlLDA <- train(classe ~ ., data = cvTrainDF, method = "lda")
mdlRPART <- train(classe ~ ., data = cvTrainDF, method = "rpart")
# for faster model building!
ctrl <- trainControl(allowParallel = T, method = "cv", number = 4)
mdlRF2 <- train(classe ~ ., data = cvTrainDF, method = "rf", ntree = 100, trControl = ctrl)
We use the cross validation test data set to perform predictions for all the above models and calculate the confusion matrix.
# predict
predictionRF <- predict(mdlRF, cvTestDF)
predictionGBM <- predict(mdlGBM, cvTestDF)
predictionLDA <- predict(mdlLDA, cvTestDF)
predictionRPART <- predict(mdlRPART, cvTestDF)
predictionRF2 <- predict(mdlRF2, cvTestDF)
# accuracy
cmRF <- confusionMatrix(cvTestDF$classe, predictionRF)
cmGBM <- confusionMatrix(cvTestDF$classe, predictionGBM)
cmLDA <- confusionMatrix(cvTestDF$classe, predictionLDA)
cmRPART <- confusionMatrix(cvTestDF$classe, predictionRPART)
cmRF2 <- confusionMatrix(cvTestDF$classe, predictionRF2)
Please see below for the accuracy of various models.
accuracyStr <-paste0("ACCURACY OF MODELS", "\n",
"Random Forest: ", format(round(cmRF$overall['Accuracy'], 3), nsmall = 3), "\n",
"Random Forest2: ", format(round(cmRF2$overall['Accuracy'], 3), nsmall = 3), "\n",
"GBM: ", format(round(cmGBM$overall['Accuracy'], 3), nsmall = 3), "\n",
"LDA: ", format(round(cmLDA$overall['Accuracy'], 3), nsmall = 3), "\n",
"RPART: ", format(round(cmRPART$overall['Accuracy'], 3), nsmall = 3))
cat(accuracyStr)
## ACCURACY OF MODELS
## Random Forest: 0.995
## Random Forest2: 0.995
## GBM: 0.963
## LDA: 0.709
## RPART: 0.501
Random Forest is the model of our choice as the accuracy is 0.995. Since Error Rate is (1 - Accuracy), the error rate is 0.005. Please see the confusion matrix for the Random Forest Model!
cmRF
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 1391 1 2 0 1
## B 5 943 1 0 0
## C 0 4 851 0 0
## D 0 0 9 794 1
## E 0 0 0 2 899
##
## Overall Statistics
##
## Accuracy : 0.9947
## 95% CI : (0.9922, 0.9965)
## No Information Rate : 0.2847
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.9933
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 0.9964 0.9947 0.9861 0.9975 0.9978
## Specificity 0.9989 0.9985 0.9990 0.9976 0.9995
## Pos Pred Value 0.9971 0.9937 0.9953 0.9876 0.9978
## Neg Pred Value 0.9986 0.9987 0.9970 0.9995 0.9995
## Prevalence 0.2847 0.1933 0.1760 0.1623 0.1837
## Detection Rate 0.2836 0.1923 0.1735 0.1619 0.1833
## Detection Prevalence 0.2845 0.1935 0.1743 0.1639 0.1837
## Balanced Accuracy 0.9976 0.9966 0.9926 0.9975 0.9986
cmRF2
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 1392 1 2 0 0
## B 5 943 1 0 0
## C 0 4 849 2 0
## D 0 0 7 796 1
## E 0 0 0 3 898
##
## Overall Statistics
##
## Accuracy : 0.9947
## 95% CI : (0.9922, 0.9965)
## No Information Rate : 0.2849
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.9933
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 0.9964 0.9947 0.9884 0.9938 0.9989
## Specificity 0.9991 0.9985 0.9985 0.9981 0.9993
## Pos Pred Value 0.9978 0.9937 0.9930 0.9900 0.9967
## Neg Pred Value 0.9986 0.9987 0.9975 0.9988 0.9998
## Prevalence 0.2849 0.1933 0.1752 0.1633 0.1833
## Detection Rate 0.2838 0.1923 0.1731 0.1623 0.1831
## Detection Prevalence 0.2845 0.1935 0.1743 0.1639 0.1837
## Balanced Accuracy 0.9978 0.9966 0.9934 0.9959 0.9991
Using the Random Forest Model, the following shows the prediction for the original testing data set we started with!
predictionFinalRF <- predict(mdlRF, testDF)
predictionFinalRF
## [1] B A B A A E D B A A B C B A E E A B B B
## Levels: A B C D E
Velloso, E.; Bulling, A.; Gellersen, H.; Ugulino, W.; Fuks, H. Qualitative Activity Recognition of Weight Lifting Exercises. Proceedings of 4th International Conference in Cooperation with SIGCHI (Augmented Human ’13) . Stuttgart, Germany: ACM SIGCHI, 2013
THE END