Introduction

In this project, we aim to predict the manner in which participants performed barbell lifts using data collected from accelerometers on their belt, forearm, arm, and dumbbell. The target variable is classe, which has five levels (A-E), each corresponding to a particular exercise execution type.

Data Preparation

Load Data

training <- read.csv("pml-training.csv", na.strings = c("NA", "", "#DIV/0!"))
testing <- read.csv("pml-testing.csv", na.strings = c("NA", "", "#DIV/0!"))

Clean Data

# Remove near zero variance predictors
nzv <- nearZeroVar(training)
training <- training[, -nzv]

# Remove columns with >95% NA values
na_counts <- colSums(is.na(training))
training <- training[, which(na_counts / nrow(training) < 0.95)]

# Remove identification and timestamp columns
training <- training[, -(1:5)]
testing <- testing[, -(1:5)]

Data Partitioning

Partition the training data into 70% for model training and 30% for validation.

set.seed(12345)
inTrain <- createDataPartition(training$classe, p = 0.7, list = FALSE)
train_set <- training[inTrain, ]
valid_set <- training[-inTrain, ]

Load Pre-trained Model

To avoid long computation during knitting, we pre-trained the Random Forest model and saved it as model_rf.rds. Here, we load the saved model.

model_rf <- readRDS("model_rf.rds")
model_rf

## Random Forest 
## 
## 13737 samples
##    53 predictor
##     5 classes: 'A', 'B', 'C', 'D', 'E' 
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold) 
## Summary of sample sizes: 10990, 10989, 10990, 10989, 10990 
## Resampling results across tuning parameters:
## 
##   mtry  Accuracy   Kappa    
##    2    0.9940308  0.9924493
##   27    0.9970882  0.9963168
##   53    0.9943947  0.9929095
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 27.

Model Evaluation

Evaluate model performance on the validation set.

# Predict on validation set
pred_valid <- predict(model_rf, newdata = valid_set)

# Align factor levels of prediction and validation labels to those in the original training set
train_levels <- levels(model_rf$finalModel$y)  # levels from the training data used in the model

pred_valid <- factor(pred_valid, levels = train_levels)
valid_set$classe <- factor(valid_set$classe, levels = train_levels)

# Check tables to debug
print(table(pred_valid))

## pred_valid
##    A    B    C    D    E 
## 1676 1137 1029  963 1080

print(table(valid_set$classe))

## 
##    A    B    C    D    E 
## 1674 1139 1026  964 1082

# Now confusion matrix
conf_matrix <- confusionMatrix(pred_valid, valid_set$classe)
conf_matrix

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 1674    2    0    0    0
##          B    0 1137    0    0    0
##          C    0    0 1026    3    0
##          D    0    0    0  961    2
##          E    0    0    0    0 1080
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9988          
##                  95% CI : (0.9976, 0.9995)
##     No Information Rate : 0.2845          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.9985          
##                                           
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            1.0000   0.9982   1.0000   0.9969   0.9982
## Specificity            0.9995   1.0000   0.9994   0.9996   1.0000
## Pos Pred Value         0.9988   1.0000   0.9971   0.9979   1.0000
## Neg Pred Value         1.0000   0.9996   1.0000   0.9994   0.9996
## Prevalence             0.2845   0.1935   0.1743   0.1638   0.1839
## Detection Rate         0.2845   0.1932   0.1743   0.1633   0.1835
## Detection Prevalence   0.2848   0.1932   0.1749   0.1636   0.1835
## Balanced Accuracy      0.9998   0.9991   0.9997   0.9982   0.9991

Calculate the out-of-sample error:

1 - conf_matrix$overall['Accuracy']

##    Accuracy 
## 0.001189465

Predict on Test Data

Apply the trained model to the 20 provided test cases.

predictions <- predict(model_rf, newdata = testing)
predictions

##  [1] B A B A A E D B A A B C B A E E A B B B
## Levels: A B C D E

Conclusion

The Random Forest model achieved high accuracy with a low expected out-of-sample error, indicating good generalization to unseen data. The predictions for the 20 test cases were generated and are ready for submission.

Appendix

Confusion Matrix Plot

cm_df <- as.data.frame(conf_matrix$table)
ggplot(cm_df, aes(Prediction, Reference, fill = Freq)) +
  geom_tile() +
  geom_text(aes(label = Freq), vjust = 1) +
  scale_fill_gradient(low = "white", high = "steelblue") +
  theme_minimal()

```

Practical Machine Learning Course Project