1. Introduction

The goal of this project is to predict the manner in which participants performed a barbell lift exercise. The outcome to predict is the variable classe in the Weight Lifting Exercise Dataset. This outcome has five levels (A–E) representing one correct and four incorrect ways to perform the exercise.

Using the training data, I build a machine learning model to predict classe from sensor measurements recorded on the belt, forearm, arm and dumbbell. I also estimate the expected out-of-sample error using cross-validation and a separate validation set, and then apply the final model to 20 test cases.

2. Data Loading

trainUrl <- "https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv"
testUrl  <- "https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv"

if (!file.exists("pml-training.csv")) download.file(trainUrl, "pml-training.csv", method="curl")
if (!file.exists("pml-testing.csv"))  download.file(testUrl, "pml-testing.csv", method="curl")

training <- read.csv("pml-training.csv", na.strings=c("NA", "", "#DIV/0!"))
testing  <- read.csv("pml-testing.csv",  na.strings=c("NA", "", "#DIV/0!"))

dim(training)
## [1] 19622   160
dim(testing)
## [1]  20 160

The training dataset contains many predictor variables and the outcome classe. The test dataset contains the same predictors but without classe; it consists of 20 cases for which I must generate predictions.

3. Data Cleaning and Preprocessing

The raw data contain non-predictive variables (record index, user name, timestamps) and many variables that are almost entirely missing. These can degrade model performance and increase computation time, so I remove them

# Remove ID and timestamp columns
training <- training[, -(1:7)]
testing  <- testing[,  -(1:7)]

# Remove variables with >95% NA
naFraction <- sapply(training, function(x) mean(is.na(x)))
training <- training[, naFraction < 0.95]
testing  <- testing[,  naFraction < 0.95]

training$classe <- as.factor(training$classe)

dim(training)
## [1] 19622    53
dim(testing)
## [1] 20 53

4. Partitioning into Training and Validation Sets

set.seed(12345)
inTrain <- createDataPartition(training$classe, p = 0.7, list = FALSE)
trainSet <- training[inTrain, ]
validSet <- training[-inTrain, ]

dim(trainSet)
## [1] 13737    53
dim(validSet)
## [1] 5885   53

Exploratory Analysis

I first look at the distribution of the outcome classe in the training set.

ggplot(trainSet, aes(x=classe)) +
geom_bar() +
ggtitle("Distribution of Classe Variable")

6. Model Building

I consider two classification algorithms:

ctrl <- trainControl(method = "cv", number = 5)

6.1 Decision Tree Model

set.seed(12345)
rpartModel <- train(
  classe ~ .,
  data = trainSet,
  method = "rpart",
  trControl = ctrl
)
rpartModel
## CART 
## 
## 13737 samples
##    52 predictor
##     5 classes: 'A', 'B', 'C', 'D', 'E' 
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold) 
## Summary of sample sizes: 10990, 10990, 10989, 10991, 10988 
## Resampling results across tuning parameters:
## 
##   cp          Accuracy  Kappa     
##   0.03458448  0.517724  0.37929136
##   0.06092971  0.418492  0.21205591
##   0.11595972  0.314708  0.04656203
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was cp = 0.03458448.

6.2 Random Forest

set.seed(12345)
rfModel <- train(classe ~ ., data=trainSet, method="rf", trControl=ctrl, ntree=100)
rfModel
## Random Forest 
## 
## 13737 samples
##    52 predictor
##     5 classes: 'A', 'B', 'C', 'D', 'E' 
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold) 
## Summary of sample sizes: 10990, 10990, 10989, 10991, 10988 
## Resampling results across tuning parameters:
## 
##   mtry  Accuracy   Kappa    
##    2    0.9911913  0.9888565
##   27    0.9911915  0.9888568
##   52    0.9830381  0.9785423
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 27.
varImpPlot(rfModel$finalModel, main="Variable Importance")

#7. Model Evaluation on Validation Data

Decision Tree

pred_rpart <- predict(rpartModel, newdata=validSet)
conf_rpart <- confusionMatrix(pred_rpart, validSet$classe)
conf_rpart
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 1525  484  499  423  153
##          B   29  385   37  187  159
##          C  116  270  490  354  289
##          D    0    0    0    0    0
##          E    4    0    0    0  481
## 
## Overall Statistics
##                                           
##                Accuracy : 0.4895          
##                  95% CI : (0.4767, 0.5024)
##     No Information Rate : 0.2845          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.3324          
##                                           
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            0.9110  0.33802  0.47758   0.0000  0.44455
## Specificity            0.6298  0.91319  0.78823   1.0000  0.99917
## Pos Pred Value         0.4945  0.48306  0.32258      NaN  0.99175
## Neg Pred Value         0.9468  0.85181  0.87723   0.8362  0.88870
## Prevalence             0.2845  0.19354  0.17434   0.1638  0.18386
## Detection Rate         0.2591  0.06542  0.08326   0.0000  0.08173
## Detection Prevalence   0.5240  0.13543  0.25811   0.0000  0.08241
## Balanced Accuracy      0.7704  0.62560  0.63291   0.5000  0.72186
pred_rf <- predict(rfModel, newdata=validSet)
conf_rf <- confusionMatrix(pred_rf, validSet$classe)
conf_rf
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 1673    7    0    0    0
##          B    1 1129    5    0    0
##          C    0    3 1018    7    3
##          D    0    0    3  957    1
##          E    0    0    0    0 1078
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9949          
##                  95% CI : (0.9927, 0.9966)
##     No Information Rate : 0.2845          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.9936          
##                                           
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            0.9994   0.9912   0.9922   0.9927   0.9963
## Specificity            0.9983   0.9987   0.9973   0.9992   1.0000
## Pos Pred Value         0.9958   0.9947   0.9874   0.9958   1.0000
## Neg Pred Value         0.9998   0.9979   0.9984   0.9986   0.9992
## Prevalence             0.2845   0.1935   0.1743   0.1638   0.1839
## Detection Rate         0.2843   0.1918   0.1730   0.1626   0.1832
## Detection Prevalence   0.2855   0.1929   0.1752   0.1633   0.1832
## Balanced Accuracy      0.9989   0.9950   0.9948   0.9960   0.9982
rf_accuracy <- conf_rf$overall["Accuracy"]
rf_accuracy
##  Accuracy 
## 0.9949023
rf_oose <- 1 - rf_accuracy
rf_oose
##    Accuracy 
## 0.005097706

8. Final Model and Test Predictions

To use all available data, I retrain the random forest model on the full cleaned training data.

set.seed(12345)
finalModel <- train(classe ~ ., data=training, method="rf", trControl=ctrl, ntree=100)
finalModel
## Random Forest 
## 
## 19622 samples
##    52 predictor
##     5 classes: 'A', 'B', 'C', 'D', 'E' 
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold) 
## Summary of sample sizes: 15698, 15698, 15697, 15698, 15697 
## Resampling results across tuning parameters:
## 
##   mtry  Accuracy   Kappa    
##    2    0.9933238  0.9915542
##   27    0.9937825  0.9921350
##   52    0.9872591  0.9838807
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 27.
finalPred <- predict(finalModel, newdata=testing)
finalPred
##  [1] B A B A A E D B A A B C B A E E A B B B
## Levels: A B C D E
pml_write_files <- function(x){
n <- length(x)
for(i in 1){
filename <- paste0("problem_id_", i, ".txt")
write.table(x[i], file=filename, quote=FALSE,
row.names=FALSE, col.names=FALSE)
}
}

pml_write_files(finalPred)

9. Conclusion

I built and evaluated predictive models to classify the quality of barbell lifting exercises using accelerometer data from multiple sensors. After cleaning the data and removing variables with many missing values, I split the data into training and validation sets and compared a decision tree with a random forest. The random forest achieved much higher accuracy and a low estimated out-of-sample error, so it was selected as the final model and used to generate predictions for the 20 test cases required by the course project.