Instructions

Using devices such as Jawbone Up, Nike FuelBand, and Fitbit it is now possible to collect a large amount of data about personal activity relatively inexpensively. These type of devices are part of the quantified self movement – a group of enthusiasts who take measurements about themselves regularly to improve their health, to find patterns in their behavior, or because they are tech geeks. One thing that people regularly do is quantify how much of a particular activity they do, but they rarely quantify how well they do it.

In this project, the goal will be to use data from accelerometers on the belt, forearm, arm, and dumbell of 6 participants. They were asked to perform barbell lifts correctly and incorrectly in 5 different ways. More information is available from the website here (see the section on the Weight Lifting Exercise Dataset).

The goal of the project is to predict the manner in which they did the exercise. This is the “classe” variable in the training set. You may use any of the other variables to predict with. You should create a report describing how you built your model, how you used cross validation, what you think the expected out of sample error is, and why you made the choices you did. You will also use your prediction model to predict 20 different test cases.

Download & load data

download.file('https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv',
              destfile = './pml-training.csv', method = 'curl', quiet = T)

download.file('https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv',
              destfile = './pml-testing.csv', method = 'curl', quiet = T)

train <- read.csv("pml-training.csv", header=T, na.strings=c("","NA"))
test <- read.csv("pml-testing.csv", header=T, na.strings=c("","NA"))

Pre-process data

# remove NaNs
train <- train[, colSums(is.na(train)) == 0]
test <- test[, colSums(is.na(test)) == 0]

#remove non-numeric vars
train <- train[-c(1:7)]
test <- test[-c(1:7)]
test <- test[-c(53)]

#replace outcome character var with dummy coding
train[train$classe == "A",]$classe = 1
train[train$classe == "B",]$classe = 2
train[train$classe == "C",]$classe = 3
train[train$classe == "D",]$classe = 4
train[train$classe == "E",]$classe = 5

#create outcome var for training data set and delete var
train$classe <- as.factor(train$classe)
train_classe <- train[c(53)]
train <- train[-c(53)]

#Center and scale data
preProcValues <- preProcess(train, method = c("center", "scale"))
train <- predict(preProcValues, train)
test <- predict(preProcValues, test)

#Remove highly correlated predictors (>.8)
df2 = cor(train)
hc = findCorrelation(df2, cutoff=0.8) # put any value as a "cutoff" 
hc = sort(hc)
train_reduced = train[,-c(hc)]
test_reduced = test[,-c(hc)]

#Add back the outcome var to the training data set
new_train <- cbind(train_reduced, train_classe)

Data analyses

Split data into 70% training and 30% validation data sets

set.seed(58789)
inTrain <- createDataPartition(y = new_train$classe, p = 0.7, list = F)
training <- new_train[inTrain, ]
validation <- new_train[-inTrain, ]

Bagging

Create bootstrap aggregation (bagging) prediction model

fit_bag <- bagging(classe ~. , data = training)
pred_bag <- predict(fit_bag, validation)
confusionMatrix(as.factor(pred_bag$class), validation$classe)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    1    2    3    4    5
##          1 1549  226   46   72   72
##          2   38  513   36   50  117
##          3   33  263  888  161  215
##          4   52  102   36  623   92
##          5    2   35   20   58  586
## 
## Overall Statistics
##                                           
##                Accuracy : 0.7067          
##                  95% CI : (0.6949, 0.7183)
##     No Information Rate : 0.2845          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.6273          
##                                           
##  Mcnemar's Test P-Value : < 2.2e-16       
## 
## Statistics by Class:
## 
##                      Class: 1 Class: 2 Class: 3 Class: 4 Class: 5
## Sensitivity            0.9253  0.45040   0.8655   0.6463  0.54159
## Specificity            0.9012  0.94922   0.8617   0.9427  0.97606
## Pos Pred Value         0.7883  0.68037   0.5692   0.6884  0.83595
## Neg Pred Value         0.9681  0.87800   0.9681   0.9315  0.90432
## Prevalence             0.2845  0.19354   0.1743   0.1638  0.18386
## Detection Rate         0.2632  0.08717   0.1509   0.1059  0.09958
## Detection Prevalence   0.3339  0.12812   0.2651   0.1538  0.11912
## Balanced Accuracy      0.9133  0.69981   0.8636   0.7945  0.75882

Report accuracy metrics

cat("The accuracy of the bagging model is:", 
    confusionMatrix(as.factor(pred_bag$class), validation$classe)$overall[1])

## The accuracy of the bagging model is: 0.706712

cat("The Kappa statisic of the bagging model is:"
  , confusionMatrix(as.factor(pred_bag$class), validation$classe)$overall[2])

## The Kappa statisic of the bagging model is: 0.6272959

Perform 5-fold cross validation applied with bagging using 50 trees

library(rpart)
fit_bag_cv <- bagging.cv(classe ~. , data = training, v = 5, mfinal = 50)

Calculate the out of sample error of the cross validation

cat("The out of sample error estimate of the 5-folds cross validation procedure 
    applied with bagging is:", fit_bag_cv$error)

## The out of sample error estimate of the 5-folds cross validation procedure 
##     applied with bagging is: 0.2751693

Boosting

Create sequential ensemble (boosting) prediction model

fit_bst <- boosting(classe ~. , data = training)
pred_bst <- predict.boosting(fit_bst, newdata = validation, type = "class")
confusionMatrix(as.factor(pred_bst$class), validation$classe)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    1    2    3    4    5
##          1 1637   92    1   16    4
##          2    9  932   65    7   20
##          3   13   92  937   64   43
##          4   11   19   21  849   30
##          5    4    4    2   28  985
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9074          
##                  95% CI : (0.8997, 0.9147)
##     No Information Rate : 0.2845          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.8827          
##                                           
##  Mcnemar's Test P-Value : < 2.2e-16       
## 
## Statistics by Class:
## 
##                      Class: 1 Class: 2 Class: 3 Class: 4 Class: 5
## Sensitivity            0.9779   0.8183   0.9133   0.8807   0.9104
## Specificity            0.9732   0.9787   0.9564   0.9835   0.9921
## Pos Pred Value         0.9354   0.9022   0.8155   0.9129   0.9629
## Neg Pred Value         0.9911   0.9573   0.9812   0.9768   0.9800
## Prevalence             0.2845   0.1935   0.1743   0.1638   0.1839
## Detection Rate         0.2782   0.1584   0.1592   0.1443   0.1674
## Detection Prevalence   0.2974   0.1755   0.1952   0.1580   0.1738
## Balanced Accuracy      0.9755   0.8985   0.9348   0.9321   0.9512

Report accuracy metrics for the boosting model

cat("The accuracy of the boosting model is:"
  , confusionMatrix(as.factor(pred_bst$class), validation$classe)$overall[1])

## The accuracy of the boosting model is: 0.9073917

cat("The Kappa statisic of the boosting model is:"
  , confusionMatrix(as.factor(pred_bag$class), validation$classe)$overall[2])

## The Kappa statisic of the boosting model is: 0.6272959

Perform 5-fold cross validation applied with boosting using 50 trees

fit_bst_cv <- boosting.cv(classe ~. , data = training, v = 5,  mfinal = 50)

## i:  1 Tue Feb  7 13:04:56 2023 
## i:  2 Tue Feb  7 13:06:31 2023 
## i:  3 Tue Feb  7 13:08:07 2023 
## i:  4 Tue Feb  7 13:09:44 2023 
## i:  5 Tue Feb  7 13:11:18 2023

Report out of sample error estimate for the cross validation

cat("The out of sample error estimate of the 5-folds cross validation procedure 
    applied with boosting is:", fit_bst_cv$error)

## The out of sample error estimate of the 5-folds cross validation procedure 
##     applied with boosting is: 0.09288782

Random forest (RF)

Create a RF model using the random forest package.

library(randomForest)
fit_rf <- randomForest(classe ~. , data = training, ntree=500, 
                       mtry = 6, nodesize = 5, importance = TRUE) 
pred_rf <- predict(fit_rf, validation)
confusionMatrix(pred_rf, validation$classe)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    1    2    3    4    5
##          1 1674    2    0    0    0
##          2    0 1131   10    0    0
##          3    0    6 1016   15    0
##          4    0    0    0  948    4
##          5    0    0    0    1 1078
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9935          
##                  95% CI : (0.9911, 0.9954)
##     No Information Rate : 0.2845          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.9918          
##                                           
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: 1 Class: 2 Class: 3 Class: 4 Class: 5
## Sensitivity            1.0000   0.9930   0.9903   0.9834   0.9963
## Specificity            0.9995   0.9979   0.9957   0.9992   0.9998
## Pos Pred Value         0.9988   0.9912   0.9797   0.9958   0.9991
## Neg Pred Value         1.0000   0.9983   0.9979   0.9968   0.9992
## Prevalence             0.2845   0.1935   0.1743   0.1638   0.1839
## Detection Rate         0.2845   0.1922   0.1726   0.1611   0.1832
## Detection Prevalence   0.2848   0.1939   0.1762   0.1618   0.1833
## Balanced Accuracy      0.9998   0.9954   0.9930   0.9913   0.9980

Report accuracy metric

cat("The accuracy of the RF model is:"
  , confusionMatrix(pred_rf, validation$classe)$overall[1])

## The accuracy of the RF model is: 0.9935429

cat("The Kappa statisic of the RF model is:"
  , confusionMatrix(pred_rf, validation$classe)$overall[2])

## The Kappa statisic of the RF model is: 0.991832

Report out-of-bag error metric

cat("The OOB error of the RF model is:"
  , (1 - confusionMatrix(pred_rf, validation$classe)$overall[1]))

## The OOB error of the RF model is: 0.006457094

Note: In RFs, there is no need for cross validation or a separate test set to get an unbiased estimate of the test set error as it is estimated internally, during the run.

Conclusion

The RF model achieved the best performance with an error rate of .6% and an accuracy of 99%.

¶ ML: Course Project

FJ Haran

2023-02-06

Executive Summary