This project uses accelerometer data from belt, forearm, arm, and dumbell of 6 participants to predict how well they performed barbell lifts. The goal is to predict the classe variable (A, B, C, D, or E) using a Random Forest machine learning model. Cross validation is used to estimate out-of-sample error.
library(caret)
library(randomForest)
library(ggplot2)
# Use local files already in sandbox
training <- read.csv("pml-training.csv", na.strings = c("NA", "", "#DIV/0!"))
testing <- read.csv("pml-testing.csv", na.strings = c("NA", "", "#DIV/0!"))
cat("Training dimensions:", dim(training), "\n")
## Training dimensions: 19622 160
cat("Testing dimensions:", dim(testing), "\n")
## Testing dimensions: 20 160
# Remove first 7 columns (ID, name, timestamps - not useful for prediction)
training <- training[, -c(1:7)]
testing <- testing[, -c(1:7)]
# Remove columns with more than 60% NA values
cleanCols <- colSums(is.na(training)) / nrow(training) < 0.60
training <- training[, cleanCols]
testing <- testing[, cleanCols]
# Remove near-zero variance columns
nzv <- nearZeroVar(training)
if(length(nzv) > 0){
training <- training[, -nzv]
testing <- testing[, -nzv]
}
# Make sure classe is a factor
training$classe <- as.factor(training$classe)
cat("Cleaned training dimensions:", dim(training), "\n")
## Cleaned training dimensions: 19622 53
cat("Remaining columns:", ncol(training), "\n")
## Remaining columns: 53
ggplot(training, aes(x = classe, fill = classe)) +
geom_bar() +
scale_fill_manual(values = c("#2196F3","#4CAF50","#FF9800","#E91E63","#9C27B0")) +
labs(title = "Distribution of Exercise Classes",
x = "Class", y = "Count") +
theme_minimal() +
theme(legend.position = "none")
Figure 1: Distribution of exercise classes
Class A corresponds to correct execution. Classes B-E are common mistakes.
set.seed(12345)
# 70% training, 30% validation
inTrain <- createDataPartition(training$classe, p = 0.70, list = FALSE)
trainSet <- training[inTrain, ]
validSet <- training[-inTrain, ]
cat("Training set:", dim(trainSet), "\n")
## Training set: 13737 53
cat("Validation set:", dim(validSet), "\n")
## Validation set: 5885 53
Random Forest was selected because:
To keep computation manageable in the sandbox environment, we use a sample of 3000 observations with 3-fold cross validation and 50 trees.
set.seed(12345)
# Sample to reduce memory usage in sandbox
smallTrain <- trainSet[sample(nrow(trainSet), 3000), ]
# 3-fold cross validation
control <- trainControl(method = "cv",
number = 3,
verboseIter = FALSE)
# Train Random Forest
modelRF <- train(classe ~ .,
data = smallTrain,
method = "rf",
trControl = control,
ntree = 50)
print(modelRF$finalModel)
##
## Call:
## randomForest(x = x, y = y, ntree = 50, mtry = param$mtry)
## Type of random forest: classification
## Number of trees: 50
## No. of variables tried at each split: 27
##
## OOB estimate of error rate: 4.8%
## Confusion matrix:
## A B C D E class.error
## A 887 2 2 2 1 0.007829978
## B 20 536 16 8 4 0.082191781
## C 1 19 482 10 1 0.060428850
## D 1 4 24 444 0 0.061310782
## E 3 10 7 9 507 0.054104478
# Predict on full validation set
predRF <- predict(modelRF, newdata = validSet)
# Confusion matrix
confMat <- confusionMatrix(predRF, validSet$classe)
print(confMat)
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 1659 53 0 0 0
## B 4 1058 42 1 11
## C 7 16 971 35 15
## D 4 5 8 925 11
## E 0 7 5 3 1045
##
## Overall Statistics
##
## Accuracy : 0.9614
## 95% CI : (0.9562, 0.9662)
## No Information Rate : 0.2845
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.9512
##
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 0.9910 0.9289 0.9464 0.9595 0.9658
## Specificity 0.9874 0.9878 0.9850 0.9943 0.9969
## Pos Pred Value 0.9690 0.9480 0.9301 0.9706 0.9858
## Neg Pred Value 0.9964 0.9830 0.9886 0.9921 0.9923
## Prevalence 0.2845 0.1935 0.1743 0.1638 0.1839
## Detection Rate 0.2819 0.1798 0.1650 0.1572 0.1776
## Detection Prevalence 0.2909 0.1896 0.1774 0.1619 0.1801
## Balanced Accuracy 0.9892 0.9583 0.9657 0.9769 0.9813
accuracy <- as.numeric(confMat$overall["Accuracy"])
oosError <- 1 - accuracy
cat("Validation Accuracy:", round(accuracy * 100, 2), "%\n")
## Validation Accuracy: 96.14 %
cat("Expected Out-of-Sample Error:", round(oosError * 100, 2), "%\n")
## Expected Out-of-Sample Error: 3.86 %
varImpPlot(modelRF$finalModel,
n.var = 15,
main = "Top 15 Most Important Variables",
col = "steelblue")
Figure 2: Top 15 most important variables
The out-of-sample error is estimated using the validation set (30% of training data held out from model training):
Using cross validation ensures the model is not overfitting to the training data. The held-out validation set provides an unbiased estimate of real-world performance.
# Final predictions on 20 test cases
finalPredictions <- predict(modelRF, newdata = testing)
# Display results
results <- data.frame(
Problem_ID = 1:20,
Prediction = as.character(finalPredictions)
)
print(results)
## Problem_ID Prediction
## 1 1 B
## 2 2 A
## 3 3 B
## 4 4 A
## 5 5 A
## 6 6 E
## 7 7 D
## 8 8 C
## 9 9 A
## 10 10 A
## 11 11 B
## 12 12 C
## 13 13 B
## 14 14 A
## 15 15 E
## 16 16 E
## 17 17 A
## 18 18 D
## 19 19 A
## 20 20 B
A Random Forest classifier was built to predict exercise quality from accelerometer measurements. Key findings:
The model successfully predicts the 20 test cases for the quiz submission.