Executive Summary

The goal of this project is to predict the manner in which participants performed weight lifting exercises using accelerometer data. The variable classe is the outcome variable. Different machine learning models were considered, and Random Forest was selected because of its high accuracy and low out-of-sample error.


Loading Data

library(caret)
## Loading required package: ggplot2
## Loading required package: lattice
library(randomForest)
## randomForest 4.6-14
## Type rfNews() to see new features/changes/bug fixes.
## 
## Attaching package: 'randomForest'
## The following object is masked from 'package:ggplot2':
## 
##     margin
training <- read.csv("pml-training.csv")
testing <- read.csv("pml-testing.csv")

Data Cleaning

Remove variables with many missing values and unnecessary columns.

na_columns <- sapply(training, function(x) mean(is.na(x))) > 0.95

training <- training[, !na_columns]
testing <- testing[, !na_columns]

training <- training[, -(1:7)]
testing <- testing[, -(1:7)]

Cross Validation

We split the training data into training and validation sets.

training$classe <- as.factor(training$classe)

set.seed(12345)

inTrain <- createDataPartition(training$classe, p = 0.7, list = FALSE)

trainData <- training[inTrain, ]
validationData <- training[-inTrain, ]

Building the Model

A Random Forest model was used.

model_rf <- randomForest(classe ~ ., data = trainData)

model_rf
## 
## Call:
##  randomForest(formula = classe ~ ., data = trainData) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 9
## 
##         OOB estimate of  error rate: 0.57%
## Confusion matrix:
##      A    B    C    D    E  class.error
## A 3903    3    0    0    0 0.0007680492
## B   14 2640    4    0    0 0.0067720090
## C    0   18 2373    5    0 0.0095993322
## D    0    0   25 2225    2 0.0119893428
## E    0    0    1    6 2518 0.0027722772

Prediction and Accuracy

predictions <- predict(model_rf, validationData)

confusionMatrix(predictions, validationData$classe)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 1673    1    0    0    0
##          B    1 1137    3    0    0
##          C    0    1 1023   14    0
##          D    0    0    0  950    1
##          E    0    0    0    0 1081
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9964          
##                  95% CI : (0.9946, 0.9978)
##     No Information Rate : 0.2845          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.9955          
##                                           
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            0.9994   0.9982   0.9971   0.9855   0.9991
## Specificity            0.9998   0.9992   0.9969   0.9998   1.0000
## Pos Pred Value         0.9994   0.9965   0.9855   0.9989   1.0000
## Neg Pred Value         0.9998   0.9996   0.9994   0.9972   0.9998
## Prevalence             0.2845   0.1935   0.1743   0.1638   0.1839
## Detection Rate         0.2843   0.1932   0.1738   0.1614   0.1837
## Detection Prevalence   0.2845   0.1939   0.1764   0.1616   0.1837
## Balanced Accuracy      0.9996   0.9987   0.9970   0.9926   0.9995

The Random Forest model produced very high accuracy and a very low out-of-sample error.


Variable Importance

varImpPlot(model_rf)

The plot shows the most important variables used for prediction.


Final Prediction on Test Data

final_predictions <- predict(model_rf, testing)

final_predictions
##  [1] <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA>
## [16] <NA> <NA> <NA> <NA> <NA>
## Levels: A B C D E

These are the predictions for the 20 test cases required for submission.


Conclusion

The Random Forest model performed extremely well for predicting exercise quality. Cross validation showed high prediction accuracy and low expected out-of-sample error. Therefore, Random Forest was selected as the final model for this project.