The goal of this project is to predict the manner in which participants performed weight lifting exercises using accelerometer data. The variable classe is the outcome variable. Different machine learning models were considered, and Random Forest was selected because of its high accuracy and low out-of-sample error.
library(caret)
## Loading required package: ggplot2
## Loading required package: lattice
library(randomForest)
## randomForest 4.6-14
## Type rfNews() to see new features/changes/bug fixes.
##
## Attaching package: 'randomForest'
## The following object is masked from 'package:ggplot2':
##
## margin
training <- read.csv("pml-training.csv")
testing <- read.csv("pml-testing.csv")
Remove variables with many missing values and unnecessary columns.
na_columns <- sapply(training, function(x) mean(is.na(x))) > 0.95
training <- training[, !na_columns]
testing <- testing[, !na_columns]
training <- training[, -(1:7)]
testing <- testing[, -(1:7)]
We split the training data into training and validation sets.
training$classe <- as.factor(training$classe)
set.seed(12345)
inTrain <- createDataPartition(training$classe, p = 0.7, list = FALSE)
trainData <- training[inTrain, ]
validationData <- training[-inTrain, ]
A Random Forest model was used.
model_rf <- randomForest(classe ~ ., data = trainData)
model_rf
##
## Call:
## randomForest(formula = classe ~ ., data = trainData)
## Type of random forest: classification
## Number of trees: 500
## No. of variables tried at each split: 9
##
## OOB estimate of error rate: 0.57%
## Confusion matrix:
## A B C D E class.error
## A 3903 3 0 0 0 0.0007680492
## B 14 2640 4 0 0 0.0067720090
## C 0 18 2373 5 0 0.0095993322
## D 0 0 25 2225 2 0.0119893428
## E 0 0 1 6 2518 0.0027722772
predictions <- predict(model_rf, validationData)
confusionMatrix(predictions, validationData$classe)
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 1673 1 0 0 0
## B 1 1137 3 0 0
## C 0 1 1023 14 0
## D 0 0 0 950 1
## E 0 0 0 0 1081
##
## Overall Statistics
##
## Accuracy : 0.9964
## 95% CI : (0.9946, 0.9978)
## No Information Rate : 0.2845
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.9955
##
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 0.9994 0.9982 0.9971 0.9855 0.9991
## Specificity 0.9998 0.9992 0.9969 0.9998 1.0000
## Pos Pred Value 0.9994 0.9965 0.9855 0.9989 1.0000
## Neg Pred Value 0.9998 0.9996 0.9994 0.9972 0.9998
## Prevalence 0.2845 0.1935 0.1743 0.1638 0.1839
## Detection Rate 0.2843 0.1932 0.1738 0.1614 0.1837
## Detection Prevalence 0.2845 0.1939 0.1764 0.1616 0.1837
## Balanced Accuracy 0.9996 0.9987 0.9970 0.9926 0.9995
The Random Forest model produced very high accuracy and a very low out-of-sample error.
varImpPlot(model_rf)
The plot shows the most important variables used for prediction.
final_predictions <- predict(model_rf, testing)
final_predictions
## [1] <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA>
## [16] <NA> <NA> <NA> <NA> <NA>
## Levels: A B C D E
These are the predictions for the 20 test cases required for submission.
The Random Forest model performed extremely well for predicting exercise quality. Cross validation showed high prediction accuracy and low expected out-of-sample error. Therefore, Random Forest was selected as the final model for this project.