This report builds a predictive model for the “Weight Lifting Exercise (WLE)” dataset.
Activity trackers (like Fitbit) collect large amounts of data. This project uses accelerometer data from 6 participants performing barbell lifts. The goal is to classify how well they performed the lift (correctly vs. incorrectly in 5 ways) based on this sensor data.
Training Data: https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv
Testing Data: https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv
Original Source: http://groupware.les.inf.puc-rio.br/har
The goal is to predict the classe variable, which describes the exercise quality. This report explains the model building process, including cross-validation and out-of-sample error estimation. The final model is used to predict 20 test cases.
This section loads the required R libraries and sets a random seed for reproducible results.
The code below downloads the training and testing datasets and loads them into R.
trainUrl <- "https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv"
testUrl <- "https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv"
trainFile <- "./data/pml-training.csv"
testFile <- "./data/pml-testing.csv"
if (!file.exists("./data")) {
dir.create("./data")
}
if (!file.exists(trainFile)) {
download.file(trainUrl, destfile = trainFile, method = "curl")
}
if (!file.exists(testFile)) {
download.file(testUrl, destfile = testFile, method = "curl")
}
rm(trainUrl)
rm(testUrl)
trainRaw <- read.csv(trainFile)
testRaw <- read.csv(testFile)
print(paste("Training data dimensions:", dim(trainRaw)[1], "rows,", dim(trainRaw)[2], "cols"))
## [1] "Training data dimensions: 19622 rows, 160 cols"
print(paste("Testing data dimensions:", dim(testRaw)[1], "rows,", dim(testRaw)[2], "cols"))
## [1] "Testing data dimensions: 20 rows, 160 cols"
rm(trainFile)
rm(testFile)
The data is cleaned by removing unhelpful variables.
3.1. Remove Near Zero Variance (NZV) Variables
First, we remove columns with near-zero variance (i.e., columns that are mostly constant).
NZV <- nearZeroVar(trainRaw, saveMetrics = TRUE)
training01 <- trainRaw[, !NZV$nzv]
testing01 <- testRaw[, !NZV$nzv]
print(paste("Dimensions after NZV removal:", dim(training01)[1], "rows,", dim(training01)[2], "cols"))
## [1] "Dimensions after NZV removal: 19622 rows, 100 cols"
rm(trainRaw)
rm(testRaw)
rm(NZV)
3.2. Remove Metadata/Identifier Columns
Next, we remove metadata columns (like user names, timestamps, and row IDs) that are not predictive sensor data.
regex <- grepl("^X|timestamp|user_name", names(training01))
training <- training01[, !regex]
testing <- testing01[, !regex]
rm(regex)
rm(training01)
rm(testing01)
print(paste("Dimensions after metadata removal:", dim(training)[1], "rows,", dim(training)[2], "cols"))
## [1] "Dimensions after metadata removal: 19622 rows, 95 cols"
3.3. Remove Columns with Missing Values (NAs)
Finally, we remove all columns that contain any NA (missing) values.
cond <- (colSums(is.na(training)) == 0)
training <- training[, cond]
testing <- testing[, cond]
rm(cond)
print(paste("Final training dimensions:", dim(training)[1], "rows,", dim(training)[2], "cols"))
## [1] "Final training dimensions: 19622 rows, 54 cols"
print(paste("Final testing dimensions:", dim(testing)[1], "rows,", dim(testing)[2], "cols"))
## [1] "Final testing dimensions: 20 rows, 54 cols"
3.4. Convert Outcome Variable to Factor
We must convert the classe variable from a character to a factor for the classification models. This ensures the levels are consistent for confusionMatrix().
training$classe <- as.factor(training$classe)
print("Converted 'classe' variable to factor.")
## [1] "Converted 'classe' variable to factor."
A correlation matrix is plotted to visualize relationships between the remaining predictors.
corrplot(cor(training[, -length(names(training))]), method = "color", tl.cex = 0.5)
The clean training data is split into a 70% training set (for building the model) and a 30% validation set (for testing it).
inTrain <- createDataPartition(training$classe, p = 0.70, list = FALSE)
validation <- training[-inTrain, ]
training <- training[inTrain, ]
rm(inTrain)
print(paste("Pure Training Set:", nrow(training), "rows"))
## [1] "Pure Training Set: 13737 rows"
print(paste("Validation Set:", nrow(validation), "rows"))
## [1] "Validation Set: 5885 rows"
We will train and compare two classification models.
6.1. Model 1: Decision Tree (rpart)
The first model is a simple Decision Tree.
modelTree <- rpart(classe ~ ., data = training, method = "class")
prp(modelTree)
# Estimate performance on the validation set
predictTree <- predict(modelTree, validation, type = "class")
cm_tree <- confusionMatrix(validation$classe, predictTree)
print("--- Decision Tree Results ---")
## [1] "--- Decision Tree Results ---"
print(cm_tree)
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 1492 37 10 84 51
## B 270 551 120 134 64
## C 55 32 818 49 72
## D 116 17 117 655 59
## E 84 89 61 140 708
##
## Overall Statistics
##
## Accuracy : 0.7178
## 95% CI : (0.7061, 0.7292)
## No Information Rate : 0.3427
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.6409
##
## Mcnemar's Test P-Value : < 2.2e-16
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 0.7397 0.75895 0.7265 0.6168 0.7421
## Specificity 0.9529 0.88602 0.9563 0.9359 0.9242
## Pos Pred Value 0.8913 0.48376 0.7973 0.6795 0.6543
## Neg Pred Value 0.8753 0.96313 0.9366 0.9173 0.9488
## Prevalence 0.3427 0.12336 0.1913 0.1805 0.1621
## Detection Rate 0.2535 0.09363 0.1390 0.1113 0.1203
## Detection Prevalence 0.2845 0.19354 0.1743 0.1638 0.1839
## Balanced Accuracy 0.8463 0.82249 0.8414 0.7763 0.8331
ose_tree <- 1 - as.numeric(cm_tree$overall[1])
print(paste("Decision Tree OOS Error:", round(ose_tree, 4)))
## [1] "Decision Tree OOS Error: 0.2822"
rm(predictTree)
rm(modelTree)
rm(cm_tree)
6.2. Model 2: Random Forest
The second model is a Random Forest, which is typically more accurate. We use 5-fold cross-validation to tune it.
# This step may take some time
modelRF <- train(classe ~ .,
data = training,
method = "rf",
trControl = trainControl(method = "cv", 5),
ntree = 250)
print(modelRF)
## Random Forest
##
## 13737 samples
## 53 predictor
## 5 classes: 'A', 'B', 'C', 'D', 'E'
##
## No pre-processing
## Resampling: Cross-Validated (5 fold)
## Summary of sample sizes: 10988, 10990, 10991, 10990, 10989
## Resampling results across tuning parameters:
##
## mtry Accuracy Kappa
## 2 0.9933033 0.9915283
## 27 0.9971614 0.9964095
## 53 0.9938853 0.9922657
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 27.
# Estimate performance on the validation set
predictRF <- predict(modelRF, validation)
cm_rf <- confusionMatrix(validation$classe, predictRF)
print("--- Random Forest Results ---")
## [1] "--- Random Forest Results ---"
print(cm_rf)
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 1674 0 0 0 0
## B 1 1137 1 0 0
## C 0 1 1025 0 0
## D 0 0 0 964 0
## E 0 0 0 2 1080
##
## Overall Statistics
##
## Accuracy : 0.9992
## 95% CI : (0.998, 0.9997)
## No Information Rate : 0.2846
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.9989
##
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 0.9994 0.9991 0.9990 0.9979 1.0000
## Specificity 1.0000 0.9996 0.9998 1.0000 0.9996
## Pos Pred Value 1.0000 0.9982 0.9990 1.0000 0.9982
## Neg Pred Value 0.9998 0.9998 0.9998 0.9996 1.0000
## Prevalence 0.2846 0.1934 0.1743 0.1641 0.1835
## Detection Rate 0.2845 0.1932 0.1742 0.1638 0.1835
## Detection Prevalence 0.2845 0.1935 0.1743 0.1638 0.1839
## Balanced Accuracy 0.9997 0.9993 0.9994 0.9990 0.9998
ose_rf <- 1 - as.numeric(cm_rf$overall[1])
print(paste("Random Forest OOS Error:", round(ose_rf, 4)))
## [1] "Random Forest OOS Error: 8e-04"
rm(predictRF)
rm(cm_rf)
The Random Forest model is significantly more accurate and has a much lower out-of-sample error. We will use this model for the final predictions.
The trained Random Forest model is used to predict the 20 cases in the official test set. The non-predictive problem_id column is removed from the test set before prediction.
# Note: The 'testing' set from cleaning still has 'problem_id'
# We predict on the testing set, removing the last column ('problem_id')
final_predictions <- predict(modelRF, testing[, -length(names(testing))])
print("Final Predictions on 20 Test Cases:")
## [1] "Final Predictions on 20 Test Cases:"
print(final_predictions)
## [1] B A B A A E D B A A B C B A E E A B B B
## Levels: A B C D E