Practical Machine Learning - Prediction Assignment

Overview

This project uses accelerometer data from belt, forearm, arm, and dumbell of 6 participants to predict how well they performed barbell lifts. The goal is to predict the classe variable (A, B, C, D, or E) using a Random Forest machine learning model. Cross validation is used to estimate out-of-sample error.

Loading Libraries and Data

library(caret)
library(randomForest)
library(ggplot2)

# Use local files already in sandbox
training <- read.csv("pml-training.csv", na.strings = c("NA", "", "#DIV/0!"))
testing  <- read.csv("pml-testing.csv",  na.strings = c("NA", "", "#DIV/0!"))

cat("Training dimensions:", dim(training), "\n")

## Training dimensions: 19622 160

cat("Testing dimensions:", dim(testing), "\n")

## Testing dimensions: 20 160

Data Cleaning

# Remove first 7 columns (ID, name, timestamps - not useful for prediction)
training <- training[, -c(1:7)]
testing  <- testing[,  -c(1:7)]

# Remove columns with more than 60% NA values
cleanCols <- colSums(is.na(training)) / nrow(training) < 0.60
training  <- training[, cleanCols]
testing   <- testing[,  cleanCols]

# Remove near-zero variance columns
nzv <- nearZeroVar(training)
if(length(nzv) > 0){
  training <- training[, -nzv]
  testing  <- testing[,  -nzv]
}

# Make sure classe is a factor
training$classe <- as.factor(training$classe)

cat("Cleaned training dimensions:", dim(training), "\n")

## Cleaned training dimensions: 19622 53

cat("Remaining columns:", ncol(training), "\n")

## Remaining columns: 53

Exploratory Data Analysis

ggplot(training, aes(x = classe, fill = classe)) +
  geom_bar() +
  scale_fill_manual(values = c("#2196F3","#4CAF50","#FF9800","#E91E63","#9C27B0")) +
  labs(title = "Distribution of Exercise Classes",
       x = "Class", y = "Count") +
  theme_minimal() +
  theme(legend.position = "none")

Figure 1: Distribution of exercise classes

Class A corresponds to correct execution. Classes B-E are common mistakes.

Data Splitting for Cross Validation

set.seed(12345)

# 70% training, 30% validation
inTrain  <- createDataPartition(training$classe, p = 0.70, list = FALSE)
trainSet <- training[inTrain, ]
validSet <- training[-inTrain, ]

cat("Training set:", dim(trainSet), "\n")

## Training set: 13737 53

cat("Validation set:", dim(validSet), "\n")

## Validation set: 5885 53

Model Building — Random Forest

Random Forest was selected because:

It handles high-dimensional data well
It is robust to outliers and noise
It provides built-in feature importance
It typically achieves high accuracy for classification

To keep computation manageable in the sandbox environment, we use a sample of 3000 observations with 3-fold cross validation and 50 trees.

set.seed(12345)

# Sample to reduce memory usage in sandbox
smallTrain <- trainSet[sample(nrow(trainSet), 3000), ]

# 3-fold cross validation
control <- trainControl(method  = "cv",
                        number  = 3,
                        verboseIter = FALSE)

# Train Random Forest
modelRF <- train(classe ~ .,
                 data      = smallTrain,
                 method    = "rf",
                 trControl = control,
                 ntree     = 50)

print(modelRF$finalModel)

## 
## Call:
##  randomForest(x = x, y = y, ntree = 50, mtry = param$mtry) 
##                Type of random forest: classification
##                      Number of trees: 50
## No. of variables tried at each split: 27
## 
##         OOB estimate of  error rate: 4.8%
## Confusion matrix:
##     A   B   C   D   E class.error
## A 887   2   2   2   1 0.007829978
## B  20 536  16   8   4 0.082191781
## C   1  19 482  10   1 0.060428850
## D   1   4  24 444   0 0.061310782
## E   3  10   7   9 507 0.054104478

Model Evaluation

# Predict on full validation set
predRF <- predict(modelRF, newdata = validSet)

# Confusion matrix
confMat <- confusionMatrix(predRF, validSet$classe)
print(confMat)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 1659   53    0    0    0
##          B    4 1058   42    1   11
##          C    7   16  971   35   15
##          D    4    5    8  925   11
##          E    0    7    5    3 1045
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9614          
##                  95% CI : (0.9562, 0.9662)
##     No Information Rate : 0.2845          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.9512          
##                                           
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            0.9910   0.9289   0.9464   0.9595   0.9658
## Specificity            0.9874   0.9878   0.9850   0.9943   0.9969
## Pos Pred Value         0.9690   0.9480   0.9301   0.9706   0.9858
## Neg Pred Value         0.9964   0.9830   0.9886   0.9921   0.9923
## Prevalence             0.2845   0.1935   0.1743   0.1638   0.1839
## Detection Rate         0.2819   0.1798   0.1650   0.1572   0.1776
## Detection Prevalence   0.2909   0.1896   0.1774   0.1619   0.1801
## Balanced Accuracy      0.9892   0.9583   0.9657   0.9769   0.9813

accuracy <- as.numeric(confMat$overall["Accuracy"])
oosError <- 1 - accuracy

cat("Validation Accuracy:", round(accuracy * 100, 2), "%\n")

## Validation Accuracy: 96.14 %

cat("Expected Out-of-Sample Error:", round(oosError * 100, 2), "%\n")

## Expected Out-of-Sample Error: 3.86 %

Variable Importance

varImpPlot(modelRF$finalModel,
           n.var = 15,
           main  = "Top 15 Most Important Variables",
           col   = "steelblue")

Figure 2: Top 15 most important variables

Expected Out-of-Sample Error Discussion

The out-of-sample error is estimated using the validation set (30% of training data held out from model training):

Model: Random Forest with 50 trees
Cross Validation: 3-fold CV on training set
Validation Accuracy: 96.14%
Expected Out-of-Sample Error: 3.86%

Using cross validation ensures the model is not overfitting to the training data. The held-out validation set provides an unbiased estimate of real-world performance.

Predicting 20 Test Cases

# Final predictions on 20 test cases
finalPredictions <- predict(modelRF, newdata = testing)

# Display results
results <- data.frame(
  Problem_ID  = 1:20,
  Prediction  = as.character(finalPredictions)
)

print(results)

##    Problem_ID Prediction
## 1           1          B
## 2           2          A
## 3           3          B
## 4           4          A
## 5           5          A
## 6           6          E
## 7           7          D
## 8           8          C
## 9           9          A
## 10         10          A
## 11         11          B
## 12         12          C
## 13         13          B
## 14         14          A
## 15         15          E
## 16         16          E
## 17         17          A
## 18         18          D
## 19         19          A
## 20         20          B

Conclusion

A Random Forest classifier was built to predict exercise quality from accelerometer measurements. Key findings:

The model achieved 96.14% accuracy on the validation set
Expected out-of-sample error is 3.86%
The most important predictors were roll_belt, yaw_belt, and pitch_forearm
3-fold cross validation was used to prevent overfitting

The model successfully predicts the 20 test cases for the quiz submission.