Please refer to the Appendix for all code and graphs.



Synopsis


Using devices such as Jawbone Up, Nike FuelBand, and Fitbit it is now possible to collect a large amount of data about personal activity relatively inexpensively. These type of devices are part of the quantified self movement – a group of enthusiasts who take measurements about themselves regularly to improve their health, to find patterns in their behavior, or because they are tech geeks. One thing that people regularly do is quantify how much of a particular activity they do, but they rarely quantify how well they do it.

The goal of this project is to use data from accelerometers on the belt, forearm, arm, and dumbell of 6 participants and predict what class of activity was taking place based on the predictor variables in the data set.


Executive Summary


The final model uses Random Forest modelling giving an out-of-sample accuracy of 99.54% with a 95% confidence interval of 99.28% to 99.73%. Recursive Partitioning failed to produce a reliable model while pre-processing Random Forest with Principal Component Analysis degraded accuracy by approximately 3%.


Data


The training data consists of 19622 observations with 160 variables.

There are 100 variables containing mostly empty or invalid data which are not considered in the model. Outside of these, there are a further 7 variables containing data irrelevant to prediction which are also removed from the the data set prior to any analysis and modelling, leaving 53 variables (52 possible predictors).

The data set is randomly split 80/20 into training and testing with the testing set used solely to measure out-of-sample error rate.


Exploratory Data Analysis Summary


A test for Near-Zero Variance was made, no variables tested positive.

More than half the variables showed some degree of collinearity, some heavily so.


Model Analysis Summary


Please refer to Model Analysis in the Appendix for code and full details.


Choosing a Model


An initial test using Recursive Partitioning was made. The model performed poorly with an accuracy rate of only 49.5% on out-of-sample data.

Due to the high collinearity discovered in EDA, Principal Component Analysis was run with an 80% threshold. PCA reduced the predictor count from 52 to 12.

The pre-processed PCA data was passed to a Random Forest training model with k-fold cross-validation, using 10 folds. This gave an out-of-sample accuracy of 96.8%.

Finally, the full 52 predictors were passed directly to Random Forest without PCA. Accuracy improved to 99.54%. This was the chosen model.


Testing the Model


In-data accuracy was 100% while out-of-data accuracy showed 99.54% with a 95% confidence interval of 99.28% to 99.73%.


Validating the model


Applying the final random forest model to the pml-testing data set produced the following predictions:

|1|  |2|  |3|  |4|  |5|  |6|  |7|  |8|  |9|  |10| |11| |12| |13| |14| |15| |16| |17| |18| |19| |20|
 B    A    B    A    A    E    D    B    A     A    B    C    B    A    E    E    A    B    B    B  

Conclusion


Basic Recursive Partitioning performed poorly with this data set producing an accuracy level of only 49.5%.

Principal Component Analysis and Random Forest performed well, dealing with high degrees of collinearity to produce an out-of-sample error rate of 96.8 ± 0.6% at the 95% confidence level.

Running Random Forest against the full 52 predictor set achieved an out-of-data accuracy showed 99.54% with a 95% confidence interval of 99.28% to 99.73%.

While PCA can be used to reduce variables with collinearity, the result is a loss of accuracy, Random Forest does a better job of dealing with this.

With a sample size of 20, we would expect no misclassifications for the validation set (from pml-testing.csv).


Appendix


Data


  • Read csv
  • Set empty & invalid values to NA
  • Drop columns with NAs
  • Drop first 7 columns which are irrelevant to prediction
  • Add factor to classe
  • Split training data into train and test subsets (80/20)
pml.train <- read_csv('./data/pml-training.csv', na = c("", "NA", "#DIV/0!")) %>% 
    select_if(~sum(is.na(.)) == 0) %>%
    select(-(`...1`:num_window)) %>%
    mutate(classe = as.factor(classe))

set.seed(123456) 
inTrain <- createDataPartition(pml.train$classe, p = 0.8, list = FALSE)
train.data <- pml.train[inTrain, ]
train.test <- pml.train[-inTrain, ]
rm(inTrain)
rm(pml.train)

Exploratory Data Analysis


Check for near-zero variance predictors

nearZeroVar(train.data, saveMetrics = TRUE)
##                      freqRatio percentUnique zeroVar   nzv
## roll_belt             1.093315    7.49729282   FALSE FALSE
## pitch_belt            1.050633   11.08987834   FALSE FALSE
## yaw_belt              1.007426   11.79055991   FALSE FALSE
## total_accel_belt      1.053962    0.18472514   FALSE FALSE
## gyros_belt_x          1.088785    0.84718772   FALSE FALSE
## gyros_belt_y          1.136075    0.43314861   FALSE FALSE
## gyros_belt_z          1.070822    1.04465253   FALSE FALSE
## accel_belt_x          1.061056    1.03191286   FALSE FALSE
## accel_belt_y          1.107819    0.89814638   FALSE FALSE
## accel_belt_z          1.116959    1.87273075   FALSE FALSE
## magnet_belt_x         1.048110    1.98738773   FALSE FALSE
## magnet_belt_y         1.095785    1.85999108   FALSE FALSE
## magnet_belt_z         1.015584    2.80909612   FALSE FALSE
## roll_arm             50.254545   15.93095102   FALSE FALSE
## pitch_arm            98.750000   18.59354099   FALSE FALSE
## yaw_arm              32.904762   17.40875215   FALSE FALSE
## total_accel_arm       1.015152    0.42040894   FALSE FALSE
## gyros_arm_x           1.039702    4.05758329   FALSE FALSE
## gyros_arm_y           1.444444    2.35046818   FALSE FALSE
## gyros_arm_z           1.125604    1.50965030   FALSE FALSE
## accel_arm_x           1.051095    4.89840117   FALSE FALSE
## accel_arm_y           1.075145    3.35053188   FALSE FALSE
## accel_arm_z           1.151515    4.96209950   FALSE FALSE
## magnet_arm_x          1.013889    8.48461685   FALSE FALSE
## magnet_arm_y          1.057971    5.52264475   FALSE FALSE
## magnet_arm_z          1.000000    8.01961908   FALSE FALSE
## roll_dumbbell         1.047619   85.78253392   FALSE FALSE
## pitch_dumbbell        2.263636   83.31740875   FALSE FALSE
## yaw_dumbbell          1.134021   85.10096184   FALSE FALSE
## total_accel_dumbbell  1.058288    0.27390280   FALSE FALSE
## gyros_dumbbell_x      1.002037    1.52238996   FALSE FALSE
## gyros_dumbbell_y      1.210417    1.73896427   FALSE FALSE
## gyros_dumbbell_z      1.082105    1.28033633   FALSE FALSE
## accel_dumbbell_x      1.011070    2.64985031   FALSE FALSE
## accel_dumbbell_y      1.059701    2.94286260   FALSE FALSE
## accel_dumbbell_z      1.071429    2.56704249   FALSE FALSE
## magnet_dumbbell_x     1.063830    7.00044589   FALSE FALSE
## magnet_dumbbell_y     1.240876    5.30607045   FALSE FALSE
## magnet_dumbbell_z     1.086667    4.26778776   FALSE FALSE
## roll_forearm         11.227437   12.65685712   FALSE FALSE
## pitch_forearm        60.980392   17.38964265   FALSE FALSE
## yaw_forearm          14.947115   11.70138225   FALSE FALSE
## total_accel_forearm   1.130916    0.43951844   FALSE FALSE
## gyros_forearm_x       1.011601    1.84725142   FALSE FALSE
## gyros_forearm_y       1.035831    4.63086821   FALSE FALSE
## gyros_forearm_z       1.046569    1.89821008   FALSE FALSE
## accel_forearm_x       1.057971    5.00031849   FALSE FALSE
## accel_forearm_y       1.120000    6.27428499   FALSE FALSE
## accel_forearm_z       1.116667    3.57984585   FALSE FALSE
## magnet_forearm_x      1.000000    9.44646156   FALSE FALSE
## magnet_forearm_y      1.205882   11.75234091   FALSE FALSE
## magnet_forearm_z      1.083333   10.46563475   FALSE FALSE
## classe                1.469388    0.03184916   FALSE FALSE

Check collinearity

# Build correlation matrix
# order by first principal component
corrplot(abs(cor(subset(train.data, select = -classe))), 
         method= "square", type = "lower", title = "Correlation Matrix Analysis",
         diag=FALSE, order="FPC", 
         tl.cex=0.55, tl.col="black", tl.srt = 45, 
         cl.pos = 'n', mar=c(0,0,1,0))


Model Analysis


Recursive Partition Test


Look at initial recursive partitioning model

fit.rpart <- train(classe ~ ., data = train.data, method = "rpart")

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 1014   17   83    0    2
##          B  301  265  193    0    0
##          C  333   21  330    0    0
##          D  285  108  250    0    0
##          E  111   93  184    0  333
## 
## Overall Statistics
##                                           
##                Accuracy : 0.495           
##                  95% CI : (0.4793, 0.5108)
##     No Information Rate : 0.521           
##     P-Value [Acc > NIR] : 0.9995          
##                                           
##                   Kappa : 0.3399          
##                                           
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            0.4961  0.52579  0.31731       NA  0.99403
## Specificity            0.9457  0.85551  0.87721   0.8361  0.89186
## Pos Pred Value         0.9086  0.34914  0.48246       NA  0.46186
## Neg Pred Value         0.6331  0.92446  0.78080       NA  0.99938
## Prevalence             0.5210  0.12847  0.26510   0.0000  0.08539
## Detection Rate         0.2585  0.06755  0.08412   0.0000  0.08488
## Detection Prevalence   0.2845  0.19347  0.17436   0.1639  0.18379
## Balanced Accuracy      0.7209  0.69065  0.59726       NA  0.94295

Only 49.5% accuracy on test set


Principle Component Analysis


Preprocess with PCA to reduce collinear predictors

preProc <- preProcess(train.data[, -which(names(train.data) == "classe")], 
                      method="pca", thresh=0.8)
PCA.train <- predict(preProc, train.data[, -which(names(train.data) == "classe")]) %>%
    mutate(classe = train.data$classe)
str(PCA.train)
## 'data.frame':    15699 obs. of  13 variables:
##  $ PC1   : num  4.48 4.49 4.5 4.5 4.49 ...
##  $ PC2   : num  1.32 1.35 1.4 1.37 1.32 ...
##  $ PC3   : num  -2.7 -2.7 -2.66 -2.7 -2.67 ...
##  $ PC4   : num  0.923 0.923 0.912 0.921 0.92 ...
##  $ PC5   : num  -1.29 -1.3 -1.33 -1.31 -1.3 ...
##  $ PC6   : num  2.03 2.04 2.12 2.06 2.05 ...
##  $ PC7   : num  -0.151 -0.179 -0.221 -0.188 -0.2 ...
##  $ PC8   : num  -2.75 -2.73 -2.71 -2.74 -2.74 ...
##  $ PC9   : num  -0.0374 -0.00543 -0.02964 -0.00252 -0.01568 ...
##  $ PC10  : num  -0.232 -0.255 -0.287 -0.236 -0.263 ...
##  $ PC11  : num  0.762 0.754 0.705 0.741 0.754 ...
##  $ PC12  : num  -1.019 -0.945 -0.935 -0.982 -1.014 ...
##  $ classe: Factor w/ 5 levels "A","B","C","D",..: 1 1 1 1 1 1 1 1 1 1 ...

PCA reduces to 12 predictors


Random Forest Modelling with PCA


Apply random forest model with k-fold cross-validation with 10 folds on PCA processed data

# use random forest
fit.rf <- train(classe ~ ., data = PCA.train, method = "rf", 
                trControl = trainControl(method = "cv", number = 10))
print(fit.rf)
## Random Forest 
## 
## 15699 samples
##    12 predictor
##     5 classes: 'A', 'B', 'C', 'D', 'E' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 14128, 14129, 14129, 14130, 14128, 14129, ... 
## Resampling results across tuning parameters:
## 
##   mtry  Accuracy   Kappa    
##    2    0.9623553  0.9523788
##    7    0.9561761  0.9445798
##   12    0.9509534  0.9379776
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 2.

Testing the Model


cf.train <- confusionMatrix(PCA.train$classe, predict(fit.rf, PCA.train))
cf.train
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 4464    0    0    0    0
##          B    0 3038    0    0    0
##          C    0    0 2738    0    0
##          D    0    0    0 2573    0
##          E    0    0    0    0 2886
## 
## Overall Statistics
##                                      
##                Accuracy : 1          
##                  95% CI : (0.9998, 1)
##     No Information Rate : 0.2843     
##     P-Value [Acc > NIR] : < 2.2e-16  
##                                      
##                   Kappa : 1          
##                                      
##  Mcnemar's Test P-Value : NA         
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            1.0000   1.0000   1.0000   1.0000   1.0000
## Specificity            1.0000   1.0000   1.0000   1.0000   1.0000
## Pos Pred Value         1.0000   1.0000   1.0000   1.0000   1.0000
## Neg Pred Value         1.0000   1.0000   1.0000   1.0000   1.0000
## Prevalence             0.2843   0.1935   0.1744   0.1639   0.1838
## Detection Rate         0.2843   0.1935   0.1744   0.1639   0.1838
## Detection Prevalence   0.2843   0.1935   0.1744   0.1639   0.1838
## Balanced Accuracy      1.0000   1.0000   1.0000   1.0000   1.0000

Accuracy on train data is 100%

#create PCA train.test data set
PCA.test <- predict(preProc, train.test[, -which(names(train.test) == "classe")]) %>%
    mutate(classe = train.test$classe)
cf.test <- confusionMatrix(PCA.test$classe, predict(fit.rf, PCA.test))
cf.test
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 1095    8    9    3    1
##          B   12  723   14    7    3
##          C    2    9  666    5    2
##          D    5    2   27  608    1
##          E    0    4    4    7  706
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9681          
##                  95% CI : (0.9622, 0.9734)
##     No Information Rate : 0.284           
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.9597          
##                                           
##  Mcnemar's Test P-Value : 0.0005748       
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            0.9829   0.9692   0.9250   0.9651   0.9902
## Specificity            0.9925   0.9887   0.9944   0.9894   0.9953
## Pos Pred Value         0.9812   0.9526   0.9737   0.9456   0.9792
## Neg Pred Value         0.9932   0.9927   0.9833   0.9933   0.9978
## Prevalence             0.2840   0.1902   0.1835   0.1606   0.1817
## Detection Rate         0.2791   0.1843   0.1698   0.1550   0.1800
## Detection Prevalence   0.2845   0.1935   0.1744   0.1639   0.1838
## Balanced Accuracy      0.9877   0.9789   0.9597   0.9772   0.9928

Accuracy on test data is 96.8%


Random Forest Modelling without PCA


Run random forest on full training data without PCA to test if PCA adversely affects accuracy.

## Random Forest 
## 
## 15699 samples
##    52 predictor
##     5 classes: 'A', 'B', 'C', 'D', 'E' 
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold) 
## Summary of sample sizes: 12559, 12558, 12560, 12559, 12560 
## Resampling results across tuning parameters:
## 
##   mtry  Accuracy   Kappa    
##    2    0.9933116  0.9915390
##   27    0.9924201  0.9904116
##   52    0.9857313  0.9819499
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 2.
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 1116    0    0    0    0
##          B    5  754    0    0    0
##          C    0    1  682    1    0
##          D    0    0    6  636    1
##          E    0    0    0    4  717
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9954          
##                  95% CI : (0.9928, 0.9973)
##     No Information Rate : 0.2858          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.9942          
##                                           
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            0.9955   0.9987   0.9913   0.9922   0.9986
## Specificity            1.0000   0.9984   0.9994   0.9979   0.9988
## Pos Pred Value         1.0000   0.9934   0.9971   0.9891   0.9945
## Neg Pred Value         0.9982   0.9997   0.9981   0.9985   0.9997
## Prevalence             0.2858   0.1925   0.1754   0.1634   0.1830
## Detection Rate         0.2845   0.1922   0.1738   0.1621   0.1828
## Detection Prevalence   0.2845   0.1935   0.1744   0.1639   0.1838
## Balanced Accuracy      0.9978   0.9985   0.9953   0.9950   0.9987

Accuracy on test data is 99.54% with a 95% confidence interval of 99.28% to 99.73%.

This is a good improvement on the model using PCA and will be used going forward.


Validation


Load and predict validation data

pml.validation <- read_csv('./data/pml-testing.csv', na = c("", "NA", "#DIV/0!")) 
t(as.data.frame(predict(fit.rf2, pml.validation))) %>% `rownames<-`( NULL )
##      [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12] [,13] [,14]
## [1,] "B"  "A"  "B"  "A"  "A"  "E"  "D"  "B"  "A"  "A"   "B"   "C"   "B"   "A"  
##      [,15] [,16] [,17] [,18] [,19] [,20]
## [1,] "E"   "E"   "A"   "B"   "B"   "B"