Please refer to the Appendix for all code and graphs.
Using devices such as Jawbone Up, Nike FuelBand, and Fitbit it is now possible to collect a large amount of data about personal activity relatively inexpensively. These type of devices are part of the quantified self movement – a group of enthusiasts who take measurements about themselves regularly to improve their health, to find patterns in their behavior, or because they are tech geeks. One thing that people regularly do is quantify how much of a particular activity they do, but they rarely quantify how well they do it.
The goal of this project is to use data from accelerometers on the belt, forearm, arm, and dumbell of 6 participants and predict what class of activity was taking place based on the predictor variables in the data set.
The final model uses Random Forest modelling giving an out-of-sample accuracy of 99.54% with a 95% confidence interval of 99.28% to 99.73%. Recursive Partitioning failed to produce a reliable model while pre-processing Random Forest with Principal Component Analysis degraded accuracy by approximately 3%.
The training data consists of 19622 observations with 160 variables.
There are 100 variables containing mostly empty or invalid data which are not considered in the model. Outside of these, there are a further 7 variables containing data irrelevant to prediction which are also removed from the the data set prior to any analysis and modelling, leaving 53 variables (52 possible predictors).
The data set is randomly split 80/20 into training and testing with the testing set used solely to measure out-of-sample error rate.
A test for Near-Zero Variance was made, no variables tested positive.
More than half the variables showed some degree of collinearity, some heavily so.
Please refer to Model Analysis in the Appendix for code and full details.
An initial test using Recursive Partitioning was made. The model performed poorly with an accuracy rate of only 49.5% on out-of-sample data.
Due to the high collinearity discovered in EDA, Principal Component Analysis was run with an 80% threshold. PCA reduced the predictor count from 52 to 12.
The pre-processed PCA data was passed to a Random Forest training model with k-fold cross-validation, using 10 folds. This gave an out-of-sample accuracy of 96.8%.
Finally, the full 52 predictors were passed directly to Random Forest without PCA. Accuracy improved to 99.54%. This was the chosen model.
In-data accuracy was 100% while out-of-data accuracy showed 99.54% with a 95% confidence interval of 99.28% to 99.73%.
Applying the final random forest model to the pml-testing data set produced the following predictions:
|1| |2| |3| |4| |5| |6| |7| |8| |9| |10| |11| |12| |13| |14| |15| |16| |17| |18| |19| |20|
B A B A A E D B A A B C B A E E A B B B
Basic Recursive Partitioning performed poorly with this data set producing an accuracy level of only 49.5%.
Principal Component Analysis and Random Forest performed well, dealing with high degrees of collinearity to produce an out-of-sample error rate of 96.8 ± 0.6% at the 95% confidence level.
Running Random Forest against the full 52 predictor set achieved an out-of-data accuracy showed 99.54% with a 95% confidence interval of 99.28% to 99.73%.
While PCA can be used to reduce variables with collinearity, the result is a loss of accuracy, Random Forest does a better job of dealing with this.
With a sample size of 20, we would expect no misclassifications for
the validation set (from pml-testing.csv).
classepml.train <- read_csv('./data/pml-training.csv', na = c("", "NA", "#DIV/0!")) %>%
select_if(~sum(is.na(.)) == 0) %>%
select(-(`...1`:num_window)) %>%
mutate(classe = as.factor(classe))
set.seed(123456)
inTrain <- createDataPartition(pml.train$classe, p = 0.8, list = FALSE)
train.data <- pml.train[inTrain, ]
train.test <- pml.train[-inTrain, ]
rm(inTrain)
rm(pml.train)Check for near-zero variance predictors
nearZeroVar(train.data, saveMetrics = TRUE)## freqRatio percentUnique zeroVar nzv
## roll_belt 1.093315 7.49729282 FALSE FALSE
## pitch_belt 1.050633 11.08987834 FALSE FALSE
## yaw_belt 1.007426 11.79055991 FALSE FALSE
## total_accel_belt 1.053962 0.18472514 FALSE FALSE
## gyros_belt_x 1.088785 0.84718772 FALSE FALSE
## gyros_belt_y 1.136075 0.43314861 FALSE FALSE
## gyros_belt_z 1.070822 1.04465253 FALSE FALSE
## accel_belt_x 1.061056 1.03191286 FALSE FALSE
## accel_belt_y 1.107819 0.89814638 FALSE FALSE
## accel_belt_z 1.116959 1.87273075 FALSE FALSE
## magnet_belt_x 1.048110 1.98738773 FALSE FALSE
## magnet_belt_y 1.095785 1.85999108 FALSE FALSE
## magnet_belt_z 1.015584 2.80909612 FALSE FALSE
## roll_arm 50.254545 15.93095102 FALSE FALSE
## pitch_arm 98.750000 18.59354099 FALSE FALSE
## yaw_arm 32.904762 17.40875215 FALSE FALSE
## total_accel_arm 1.015152 0.42040894 FALSE FALSE
## gyros_arm_x 1.039702 4.05758329 FALSE FALSE
## gyros_arm_y 1.444444 2.35046818 FALSE FALSE
## gyros_arm_z 1.125604 1.50965030 FALSE FALSE
## accel_arm_x 1.051095 4.89840117 FALSE FALSE
## accel_arm_y 1.075145 3.35053188 FALSE FALSE
## accel_arm_z 1.151515 4.96209950 FALSE FALSE
## magnet_arm_x 1.013889 8.48461685 FALSE FALSE
## magnet_arm_y 1.057971 5.52264475 FALSE FALSE
## magnet_arm_z 1.000000 8.01961908 FALSE FALSE
## roll_dumbbell 1.047619 85.78253392 FALSE FALSE
## pitch_dumbbell 2.263636 83.31740875 FALSE FALSE
## yaw_dumbbell 1.134021 85.10096184 FALSE FALSE
## total_accel_dumbbell 1.058288 0.27390280 FALSE FALSE
## gyros_dumbbell_x 1.002037 1.52238996 FALSE FALSE
## gyros_dumbbell_y 1.210417 1.73896427 FALSE FALSE
## gyros_dumbbell_z 1.082105 1.28033633 FALSE FALSE
## accel_dumbbell_x 1.011070 2.64985031 FALSE FALSE
## accel_dumbbell_y 1.059701 2.94286260 FALSE FALSE
## accel_dumbbell_z 1.071429 2.56704249 FALSE FALSE
## magnet_dumbbell_x 1.063830 7.00044589 FALSE FALSE
## magnet_dumbbell_y 1.240876 5.30607045 FALSE FALSE
## magnet_dumbbell_z 1.086667 4.26778776 FALSE FALSE
## roll_forearm 11.227437 12.65685712 FALSE FALSE
## pitch_forearm 60.980392 17.38964265 FALSE FALSE
## yaw_forearm 14.947115 11.70138225 FALSE FALSE
## total_accel_forearm 1.130916 0.43951844 FALSE FALSE
## gyros_forearm_x 1.011601 1.84725142 FALSE FALSE
## gyros_forearm_y 1.035831 4.63086821 FALSE FALSE
## gyros_forearm_z 1.046569 1.89821008 FALSE FALSE
## accel_forearm_x 1.057971 5.00031849 FALSE FALSE
## accel_forearm_y 1.120000 6.27428499 FALSE FALSE
## accel_forearm_z 1.116667 3.57984585 FALSE FALSE
## magnet_forearm_x 1.000000 9.44646156 FALSE FALSE
## magnet_forearm_y 1.205882 11.75234091 FALSE FALSE
## magnet_forearm_z 1.083333 10.46563475 FALSE FALSE
## classe 1.469388 0.03184916 FALSE FALSE
Check collinearity
# Build correlation matrix
# order by first principal component
corrplot(abs(cor(subset(train.data, select = -classe))),
method= "square", type = "lower", title = "Correlation Matrix Analysis",
diag=FALSE, order="FPC",
tl.cex=0.55, tl.col="black", tl.srt = 45,
cl.pos = 'n', mar=c(0,0,1,0))Look at initial recursive partitioning model
fit.rpart <- train(classe ~ ., data = train.data, method = "rpart")## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 1014 17 83 0 2
## B 301 265 193 0 0
## C 333 21 330 0 0
## D 285 108 250 0 0
## E 111 93 184 0 333
##
## Overall Statistics
##
## Accuracy : 0.495
## 95% CI : (0.4793, 0.5108)
## No Information Rate : 0.521
## P-Value [Acc > NIR] : 0.9995
##
## Kappa : 0.3399
##
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 0.4961 0.52579 0.31731 NA 0.99403
## Specificity 0.9457 0.85551 0.87721 0.8361 0.89186
## Pos Pred Value 0.9086 0.34914 0.48246 NA 0.46186
## Neg Pred Value 0.6331 0.92446 0.78080 NA 0.99938
## Prevalence 0.5210 0.12847 0.26510 0.0000 0.08539
## Detection Rate 0.2585 0.06755 0.08412 0.0000 0.08488
## Detection Prevalence 0.2845 0.19347 0.17436 0.1639 0.18379
## Balanced Accuracy 0.7209 0.69065 0.59726 NA 0.94295
Only 49.5% accuracy on test set
Preprocess with PCA to reduce collinear predictors
preProc <- preProcess(train.data[, -which(names(train.data) == "classe")],
method="pca", thresh=0.8)
PCA.train <- predict(preProc, train.data[, -which(names(train.data) == "classe")]) %>%
mutate(classe = train.data$classe)
str(PCA.train)## 'data.frame': 15699 obs. of 13 variables:
## $ PC1 : num 4.48 4.49 4.5 4.5 4.49 ...
## $ PC2 : num 1.32 1.35 1.4 1.37 1.32 ...
## $ PC3 : num -2.7 -2.7 -2.66 -2.7 -2.67 ...
## $ PC4 : num 0.923 0.923 0.912 0.921 0.92 ...
## $ PC5 : num -1.29 -1.3 -1.33 -1.31 -1.3 ...
## $ PC6 : num 2.03 2.04 2.12 2.06 2.05 ...
## $ PC7 : num -0.151 -0.179 -0.221 -0.188 -0.2 ...
## $ PC8 : num -2.75 -2.73 -2.71 -2.74 -2.74 ...
## $ PC9 : num -0.0374 -0.00543 -0.02964 -0.00252 -0.01568 ...
## $ PC10 : num -0.232 -0.255 -0.287 -0.236 -0.263 ...
## $ PC11 : num 0.762 0.754 0.705 0.741 0.754 ...
## $ PC12 : num -1.019 -0.945 -0.935 -0.982 -1.014 ...
## $ classe: Factor w/ 5 levels "A","B","C","D",..: 1 1 1 1 1 1 1 1 1 1 ...
PCA reduces to 12 predictors
Apply random forest model with k-fold cross-validation with 10 folds on PCA processed data
# use random forest
fit.rf <- train(classe ~ ., data = PCA.train, method = "rf",
trControl = trainControl(method = "cv", number = 10))
print(fit.rf)## Random Forest
##
## 15699 samples
## 12 predictor
## 5 classes: 'A', 'B', 'C', 'D', 'E'
##
## No pre-processing
## Resampling: Cross-Validated (10 fold)
## Summary of sample sizes: 14128, 14129, 14129, 14130, 14128, 14129, ...
## Resampling results across tuning parameters:
##
## mtry Accuracy Kappa
## 2 0.9623553 0.9523788
## 7 0.9561761 0.9445798
## 12 0.9509534 0.9379776
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 2.
cf.train <- confusionMatrix(PCA.train$classe, predict(fit.rf, PCA.train))
cf.train## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 4464 0 0 0 0
## B 0 3038 0 0 0
## C 0 0 2738 0 0
## D 0 0 0 2573 0
## E 0 0 0 0 2886
##
## Overall Statistics
##
## Accuracy : 1
## 95% CI : (0.9998, 1)
## No Information Rate : 0.2843
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 1
##
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 1.0000 1.0000 1.0000 1.0000 1.0000
## Specificity 1.0000 1.0000 1.0000 1.0000 1.0000
## Pos Pred Value 1.0000 1.0000 1.0000 1.0000 1.0000
## Neg Pred Value 1.0000 1.0000 1.0000 1.0000 1.0000
## Prevalence 0.2843 0.1935 0.1744 0.1639 0.1838
## Detection Rate 0.2843 0.1935 0.1744 0.1639 0.1838
## Detection Prevalence 0.2843 0.1935 0.1744 0.1639 0.1838
## Balanced Accuracy 1.0000 1.0000 1.0000 1.0000 1.0000
Accuracy on train data is 100%
#create PCA train.test data set
PCA.test <- predict(preProc, train.test[, -which(names(train.test) == "classe")]) %>%
mutate(classe = train.test$classe)
cf.test <- confusionMatrix(PCA.test$classe, predict(fit.rf, PCA.test))
cf.test## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 1095 8 9 3 1
## B 12 723 14 7 3
## C 2 9 666 5 2
## D 5 2 27 608 1
## E 0 4 4 7 706
##
## Overall Statistics
##
## Accuracy : 0.9681
## 95% CI : (0.9622, 0.9734)
## No Information Rate : 0.284
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.9597
##
## Mcnemar's Test P-Value : 0.0005748
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 0.9829 0.9692 0.9250 0.9651 0.9902
## Specificity 0.9925 0.9887 0.9944 0.9894 0.9953
## Pos Pred Value 0.9812 0.9526 0.9737 0.9456 0.9792
## Neg Pred Value 0.9932 0.9927 0.9833 0.9933 0.9978
## Prevalence 0.2840 0.1902 0.1835 0.1606 0.1817
## Detection Rate 0.2791 0.1843 0.1698 0.1550 0.1800
## Detection Prevalence 0.2845 0.1935 0.1744 0.1639 0.1838
## Balanced Accuracy 0.9877 0.9789 0.9597 0.9772 0.9928
Accuracy on test data is 96.8%
Run random forest on full training data without PCA to test if PCA adversely affects accuracy.
## Random Forest
##
## 15699 samples
## 52 predictor
## 5 classes: 'A', 'B', 'C', 'D', 'E'
##
## No pre-processing
## Resampling: Cross-Validated (5 fold)
## Summary of sample sizes: 12559, 12558, 12560, 12559, 12560
## Resampling results across tuning parameters:
##
## mtry Accuracy Kappa
## 2 0.9933116 0.9915390
## 27 0.9924201 0.9904116
## 52 0.9857313 0.9819499
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 2.
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 1116 0 0 0 0
## B 5 754 0 0 0
## C 0 1 682 1 0
## D 0 0 6 636 1
## E 0 0 0 4 717
##
## Overall Statistics
##
## Accuracy : 0.9954
## 95% CI : (0.9928, 0.9973)
## No Information Rate : 0.2858
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.9942
##
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 0.9955 0.9987 0.9913 0.9922 0.9986
## Specificity 1.0000 0.9984 0.9994 0.9979 0.9988
## Pos Pred Value 1.0000 0.9934 0.9971 0.9891 0.9945
## Neg Pred Value 0.9982 0.9997 0.9981 0.9985 0.9997
## Prevalence 0.2858 0.1925 0.1754 0.1634 0.1830
## Detection Rate 0.2845 0.1922 0.1738 0.1621 0.1828
## Detection Prevalence 0.2845 0.1935 0.1744 0.1639 0.1838
## Balanced Accuracy 0.9978 0.9985 0.9953 0.9950 0.9987
Accuracy on test data is 99.54% with a 95% confidence interval of 99.28% to 99.73%.
This is a good improvement on the model using PCA and will be used going forward.
Load and predict validation data
pml.validation <- read_csv('./data/pml-testing.csv', na = c("", "NA", "#DIV/0!"))
t(as.data.frame(predict(fit.rf2, pml.validation))) %>% `rownames<-`( NULL )## [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12] [,13] [,14]
## [1,] "B" "A" "B" "A" "A" "E" "D" "B" "A" "A" "B" "C" "B" "A"
## [,15] [,16] [,17] [,18] [,19] [,20]
## [1,] "E" "E" "A" "B" "B" "B"