This analysis aims to predict the manner in which participants performed barbell lifts using data from accelerometers. The goal is to classify whether exercises were performed correctly (classe A) or with one of four common mistakes (classes B-E). A Random Forest model achieved 99.5% accuracy on the validation set with an estimated out-of-sample error rate of 0.5%.
Using devices such as Jawbone Up, Nike FuelBand, and Fitbit, it is now possible to collect a large amount of data about personal activity relatively inexpensively. These type of devices are part of the quantified self movement – a group of enthusiasts who take measurements about themselves regularly to improve their health, to find patterns in their behavior, or because they are tech geeks.
One thing that people regularly do is quantify how much of a particular activity they do, but they rarely quantify how well they do it. In this project, we use data from accelerometers on the belt, forearm, arm, and dumbell of 6 participants to predict the quality of barbell lifts.
The participants were asked to perform barbell lifts correctly and incorrectly in 5 different ways: - Class A: exactly according to the specification - Class B: throwing the elbows to the front - Class C: lifting the dumbbell only halfway - Class D: lowering the dumbbell only halfway - Class E: throwing the hips to the front
# Load the data - using pre-loaded datasets from environment
if (exists("train_data") && exists("test_data")) {
training <- train_data
testing <- test_data
} else {
# Fallback to loading from CSV files if not pre-loaded
training <- read.csv("pml-training.csv", na.strings = c("NA", "", "#DIV/0!"))
testing <- read.csv("pml-testing.csv", na.strings = c("NA", "", "#DIV/0!"))
}
# Basic data exploration
cat("Training set dimensions:", dim(training), "\n")
## Training set dimensions: 19622 160
cat("Testing set dimensions:", dim(testing), "\n")
## Testing set dimensions: 20 160
cat("\nClasse distribution in training set:\n")
##
## Classe distribution in training set:
print(table(training$classe))
##
## A B C D E
## 5580 3797 3422 3216 3607
# Check for missing values
na_count <- sapply(training, function(x) sum(is.na(x)))
na_percent <- na_count / nrow(training) * 100
# Variables with high missing value percentage
high_na_vars <- names(na_percent[na_percent > 90])
cat("Variables with >90% missing values:", length(high_na_vars), "\n")
## Variables with >90% missing values: 100
# Variables with low/no missing values
low_na_vars <- names(na_percent[na_percent < 5])
cat("Variables with <5% missing values:", length(low_na_vars), "\n")
## Variables with <5% missing values: 60
# Show structure of some key variables
str(training[, c("user_name", "classe", "num_window", "roll_belt", "pitch_belt", "yaw_belt")])
## 'data.frame': 19622 obs. of 6 variables:
## $ user_name : chr "carlitos" "carlitos" "carlitos" "carlitos" ...
## $ classe : chr "A" "A" "A" "A" ...
## $ num_window: int 11 11 11 12 12 12 12 12 12 12 ...
## $ roll_belt : num 1.41 1.41 1.42 1.48 1.48 1.45 1.42 1.42 1.43 1.45 ...
## $ pitch_belt: num 8.07 8.07 8.07 8.05 8.07 8.06 8.09 8.13 8.16 8.17 ...
## $ yaw_belt : num -94.4 -94.4 -94.4 -94.4 -94.4 -94.4 -94.4 -94.4 -94.4 -94.4 ...
# Focus on complete cases for modeling
complete_vars <- names(training)[na_count == 0]
# Remove non-predictive variables
remove_vars <- c("X", "user_name", "raw_timestamp_part_1", "raw_timestamp_part_2",
"cvtd_timestamp", "new_window", "num_window")
complete_vars <- complete_vars[!complete_vars %in% remove_vars]
cat("Final number of predictive variables:", length(complete_vars) - 1, "\n") # -1 for classe
## Final number of predictive variables: 52
cat("Variables selected for modeling:\n")
## Variables selected for modeling:
print(head(complete_vars, 10))
## [1] "roll_belt" "pitch_belt" "yaw_belt" "total_accel_belt"
## [5] "gyros_belt_x" "gyros_belt_y" "gyros_belt_z" "accel_belt_x"
## [9] "accel_belt_y" "accel_belt_z"
# Create clean training dataset with complete variables only
training_clean <- training[, complete_vars]
testing_clean <- testing[, complete_vars[complete_vars != "classe"]] # testing doesn't have classe
# Ensure classe is a factor
training_clean$classe <- as.factor(training_clean$classe)
# Check for near zero variance predictors
nzv <- nearZeroVar(training_clean[, -ncol(training_clean)], saveMetrics = TRUE)
nzv_vars <- rownames(nzv[nzv$nzv == TRUE, ])
cat("Near zero variance variables:", length(nzv_vars), "\n")
## Near zero variance variables: 0
# Remove near zero variance variables if any
if (length(nzv_vars) > 0) {
training_clean <- training_clean[, !names(training_clean) %in% nzv_vars]
testing_clean <- testing_clean[, !names(testing_clean) %in% nzv_vars]
}
cat("Final training set dimensions:", dim(training_clean), "\n")
## Final training set dimensions: 19622 53
cat("Final testing set dimensions:", dim(testing_clean), "\n")
## Final testing set dimensions: 20 52
# Create data partition for cross-validation
set.seed(12345)
inTrain <- createDataPartition(training_clean$classe, p = 0.7, list = FALSE)
train_set <- training_clean[inTrain, ]
validation_set <- training_clean[-inTrain, ]
cat("Training set size:", nrow(train_set), "\n")
## Training set size: 13737
cat("Validation set size:", nrow(validation_set), "\n")
## Validation set size: 5885
# Set up cross-validation control
ctrl <- trainControl(method = "cv", number = 5, verboseIter = FALSE)
# Train Random Forest model
cat("Training Random Forest model...\n")
## Training Random Forest model...
rf_model <- train(classe ~ .,
data = train_set,
method = "rf",
trControl = ctrl,
ntree = 100, # Reduced for faster computation
importance = TRUE)
print(rf_model)
## Random Forest
##
## 13737 samples
## 52 predictor
## 5 classes: 'A', 'B', 'C', 'D', 'E'
##
## No pre-processing
## Resampling: Cross-Validated (5 fold)
## Summary of sample sizes: 10987, 10990, 10990, 10991, 10990
## Resampling results across tuning parameters:
##
## mtry Accuracy Kappa
## 2 0.9895171 0.9867390
## 27 0.9893718 0.9865556
## 52 0.9836219 0.9792811
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 2.
print(rf_model$finalModel)
##
## Call:
## randomForest(x = x, y = y, ntree = 100, mtry = min(param$mtry, ncol(x)), importance = TRUE)
## Type of random forest: classification
## Number of trees: 100
## No. of variables tried at each split: 2
##
## OOB estimate of error rate: 0.86%
## Confusion matrix:
## A B C D E class.error
## A 3901 4 1 0 0 0.001280082
## B 20 2630 8 0 0 0.010534236
## C 0 23 2365 8 0 0.012938230
## D 1 0 43 2205 3 0.020870337
## E 0 1 2 4 2518 0.002772277
# Make predictions on validation set
rf_pred <- predict(rf_model, validation_set)
# Calculate confusion matrix
conf_matrix <- confusionMatrix(rf_pred, validation_set$classe)
print(conf_matrix)
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 1672 5 0 0 0
## B 2 1134 3 0 0
## C 0 0 1023 21 0
## D 0 0 0 942 0
## E 0 0 0 1 1082
##
## Overall Statistics
##
## Accuracy : 0.9946
## 95% CI : (0.9923, 0.9963)
## No Information Rate : 0.2845
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.9931
##
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 0.9988 0.9956 0.9971 0.9772 1.0000
## Specificity 0.9988 0.9989 0.9957 1.0000 0.9998
## Pos Pred Value 0.9970 0.9956 0.9799 1.0000 0.9991
## Neg Pred Value 0.9995 0.9989 0.9994 0.9955 1.0000
## Prevalence 0.2845 0.1935 0.1743 0.1638 0.1839
## Detection Rate 0.2841 0.1927 0.1738 0.1601 0.1839
## Detection Prevalence 0.2850 0.1935 0.1774 0.1601 0.1840
## Balanced Accuracy 0.9988 0.9973 0.9964 0.9886 0.9999
# Calculate accuracy and error rates
accuracy <- conf_matrix$overall['Accuracy']
out_of_sample_error <- 1 - accuracy
cat("\n=== MODEL PERFORMANCE SUMMARY ===\n")
##
## === MODEL PERFORMANCE SUMMARY ===
cat("Validation Set Accuracy:", round(accuracy * 100, 2), "%\n")
## Validation Set Accuracy: 99.46 %
cat("Estimated Out-of-Sample Error:", round(out_of_sample_error * 100, 2), "%\n")
## Estimated Out-of-Sample Error: 0.54 %
# Plot variable importance
varImp_plot <- plot(varImp(rf_model), top = 20, main = "Top 20 Most Important Variables")
print(varImp_plot)
# Make predictions on the test set
test_predictions <- predict(rf_model, testing_clean)
cat("Predictions for the 20 test cases:\n")
## Predictions for the 20 test cases:
for(i in 1:length(test_predictions)) {
cat("Test Case", i, ":", as.character(test_predictions[i]), "\n")
}
## Test Case 1 : B
## Test Case 2 : A
## Test Case 3 : B
## Test Case 4 : A
## Test Case 5 : A
## Test Case 6 : E
## Test Case 7 : D
## Test Case 8 : B
## Test Case 9 : A
## Test Case 10 : A
## Test Case 11 : B
## Test Case 12 : C
## Test Case 13 : B
## Test Case 14 : A
## Test Case 15 : E
## Test Case 16 : E
## Test Case 17 : A
## Test Case 18 : B
## Test Case 19 : B
## Test Case 20 : B
# Create a data frame with results
results_df <- data.frame(
Problem_ID = 1:20,
Predicted_Class = as.character(test_predictions)
)
print(results_df)
## Problem_ID Predicted_Class
## 1 1 B
## 2 2 A
## 3 3 B
## 4 4 A
## 5 5 A
## 6 6 E
## 7 7 D
## 8 8 B
## 9 9 A
## 10 10 A
## 11 11 B
## 12 12 C
## 13 13 B
## 14 14 A
## 15 15 E
## 16 16 E
## 17 17 A
## 18 18 B
## 19 19 B
## 20 20 B
The Random Forest model demonstrates excellent performance:
Random Forest was chosen as the primary algorithm for several reasons:
The model uses 5-fold cross-validation to ensure robust performance estimates. Additionally, a separate validation set (30% of training data) was used for final model evaluation to provide an unbiased estimate of out-of-sample performance.
Based on the validation set performance, the expected out-of-sample error rate is approximately 0.5%, indicating that the model should correctly classify about 99.5% of new, unseen cases.