In this project, we aim to predict the manner in which participants
performed barbell lifts using data collected from accelerometers on
their belt, forearm, arm, and dumbbell. The target variable is
classe
, which has five levels (A-E), each corresponding to
a particular exercise execution type.
training <- read.csv("pml-training.csv", na.strings = c("NA", "", "#DIV/0!"))
testing <- read.csv("pml-testing.csv", na.strings = c("NA", "", "#DIV/0!"))
# Remove near zero variance predictors
nzv <- nearZeroVar(training)
training <- training[, -nzv]
# Remove columns with >95% NA values
na_counts <- colSums(is.na(training))
training <- training[, which(na_counts / nrow(training) < 0.95)]
# Remove identification and timestamp columns
training <- training[, -(1:5)]
testing <- testing[, -(1:5)]
Partition the training data into 70% for model training and 30% for validation.
set.seed(12345)
inTrain <- createDataPartition(training$classe, p = 0.7, list = FALSE)
train_set <- training[inTrain, ]
valid_set <- training[-inTrain, ]
To avoid long computation during knitting, we pre-trained the Random
Forest model and saved it as model_rf.rds
. Here, we load
the saved model.
model_rf <- readRDS("model_rf.rds")
model_rf
## Random Forest
##
## 13737 samples
## 53 predictor
## 5 classes: 'A', 'B', 'C', 'D', 'E'
##
## No pre-processing
## Resampling: Cross-Validated (5 fold)
## Summary of sample sizes: 10990, 10989, 10990, 10989, 10990
## Resampling results across tuning parameters:
##
## mtry Accuracy Kappa
## 2 0.9940308 0.9924493
## 27 0.9970882 0.9963168
## 53 0.9943947 0.9929095
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 27.
Evaluate model performance on the validation set.
# Predict on validation set
pred_valid <- predict(model_rf, newdata = valid_set)
# Align factor levels of prediction and validation labels to those in the original training set
train_levels <- levels(model_rf$finalModel$y) # levels from the training data used in the model
pred_valid <- factor(pred_valid, levels = train_levels)
valid_set$classe <- factor(valid_set$classe, levels = train_levels)
# Check tables to debug
print(table(pred_valid))
## pred_valid
## A B C D E
## 1676 1137 1029 963 1080
print(table(valid_set$classe))
##
## A B C D E
## 1674 1139 1026 964 1082
# Now confusion matrix
conf_matrix <- confusionMatrix(pred_valid, valid_set$classe)
conf_matrix
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 1674 2 0 0 0
## B 0 1137 0 0 0
## C 0 0 1026 3 0
## D 0 0 0 961 2
## E 0 0 0 0 1080
##
## Overall Statistics
##
## Accuracy : 0.9988
## 95% CI : (0.9976, 0.9995)
## No Information Rate : 0.2845
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.9985
##
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 1.0000 0.9982 1.0000 0.9969 0.9982
## Specificity 0.9995 1.0000 0.9994 0.9996 1.0000
## Pos Pred Value 0.9988 1.0000 0.9971 0.9979 1.0000
## Neg Pred Value 1.0000 0.9996 1.0000 0.9994 0.9996
## Prevalence 0.2845 0.1935 0.1743 0.1638 0.1839
## Detection Rate 0.2845 0.1932 0.1743 0.1633 0.1835
## Detection Prevalence 0.2848 0.1932 0.1749 0.1636 0.1835
## Balanced Accuracy 0.9998 0.9991 0.9997 0.9982 0.9991
Calculate the out-of-sample error:
1 - conf_matrix$overall['Accuracy']
## Accuracy
## 0.001189465
Apply the trained model to the 20 provided test cases.
predictions <- predict(model_rf, newdata = testing)
predictions
## [1] B A B A A E D B A A B C B A E E A B B B
## Levels: A B C D E
The Random Forest model achieved high accuracy with a low expected out-of-sample error, indicating good generalization to unseen data. The predictions for the 20 test cases were generated and are ready for submission.
cm_df <- as.data.frame(conf_matrix$table)
ggplot(cm_df, aes(Prediction, Reference, fill = Freq)) +
geom_tile() +
geom_text(aes(label = Freq), vjust = 1) +
scale_fill_gradient(low = "white", high = "steelblue") +
theme_minimal()
```