Introduction

This report builds a predictive model for the “Weight Lifting Exercise (WLE)” dataset.

Background

Activity trackers (like Fitbit) collect large amounts of data. This project uses accelerometer data from 6 participants performing barbell lifts. The goal is to classify how well they performed the lift (correctly vs. incorrectly in 5 ways) based on this sensor data.

Data Sources

Training Data: https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv

Testing Data: https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv

Original Source: http://groupware.les.inf.puc-rio.br/har

Intended Results

The goal is to predict the classe variable, which describes the exercise quality. This report explains the model building process, including cross-validation and out-of-sample error estimation. The final model is used to predict 20 test cases.

  1. Reproducibility and Setup

This section loads the required R libraries and sets a random seed for reproducible results.

  1. Getting and Reading Data

The code below downloads the training and testing datasets and loads them into R.

trainUrl <- "https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv"
testUrl <- "https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv"
trainFile <- "./data/pml-training.csv"
testFile  <- "./data/pml-testing.csv"

if (!file.exists("./data")) {
  dir.create("./data")
}
if (!file.exists(trainFile)) {
  download.file(trainUrl, destfile = trainFile, method = "curl")
}
if (!file.exists(testFile)) {
  download.file(testUrl, destfile = testFile, method = "curl")
}
rm(trainUrl)
rm(testUrl)


trainRaw <- read.csv(trainFile)
testRaw <- read.csv(testFile)
print(paste("Training data dimensions:", dim(trainRaw)[1], "rows,", dim(trainRaw)[2], "cols"))
## [1] "Training data dimensions: 19622 rows, 160 cols"
print(paste("Testing data dimensions:", dim(testRaw)[1], "rows,", dim(testRaw)[2], "cols"))
## [1] "Testing data dimensions: 20 rows, 160 cols"
rm(trainFile)
rm(testFile)
  1. Cleaning Data

The data is cleaned by removing unhelpful variables.

3.1. Remove Near Zero Variance (NZV) Variables

First, we remove columns with near-zero variance (i.e., columns that are mostly constant).

NZV <- nearZeroVar(trainRaw, saveMetrics = TRUE)
training01 <- trainRaw[, !NZV$nzv]
testing01 <- testRaw[, !NZV$nzv]
print(paste("Dimensions after NZV removal:", dim(training01)[1], "rows,", dim(training01)[2], "cols"))
## [1] "Dimensions after NZV removal: 19622 rows, 100 cols"
rm(trainRaw)
rm(testRaw)
rm(NZV)

3.2. Remove Metadata/Identifier Columns

Next, we remove metadata columns (like user names, timestamps, and row IDs) that are not predictive sensor data.

regex <- grepl("^X|timestamp|user_name", names(training01))
training <- training01[, !regex]
testing <- testing01[, !regex]
rm(regex)
rm(training01)
rm(testing01)
print(paste("Dimensions after metadata removal:", dim(training)[1], "rows,", dim(training)[2], "cols"))
## [1] "Dimensions after metadata removal: 19622 rows, 95 cols"

3.3. Remove Columns with Missing Values (NAs)

Finally, we remove all columns that contain any NA (missing) values.

cond <- (colSums(is.na(training)) == 0)
training <- training[, cond]
testing <- testing[, cond]
rm(cond)
print(paste("Final training dimensions:", dim(training)[1], "rows,", dim(training)[2], "cols"))
## [1] "Final training dimensions: 19622 rows, 54 cols"
print(paste("Final testing dimensions:", dim(testing)[1], "rows,", dim(testing)[2], "cols"))
## [1] "Final testing dimensions: 20 rows, 54 cols"

3.4. Convert Outcome Variable to Factor

We must convert the classe variable from a character to a factor for the classification models. This ensures the levels are consistent for confusionMatrix().

training$classe <- as.factor(training$classe)
print("Converted 'classe' variable to factor.")
## [1] "Converted 'classe' variable to factor."
  1. Correlation Matrix

A correlation matrix is plotted to visualize relationships between the remaining predictors.

corrplot(cor(training[, -length(names(training))]), method = "color", tl.cex = 0.5)

  1. Partitioning Training Set

The clean training data is split into a 70% training set (for building the model) and a 30% validation set (for testing it).

inTrain <- createDataPartition(training$classe, p = 0.70, list = FALSE)
validation <- training[-inTrain, ]
training <- training[inTrain, ]
rm(inTrain)

print(paste("Pure Training Set:", nrow(training), "rows"))
## [1] "Pure Training Set: 13737 rows"
print(paste("Validation Set:", nrow(validation), "rows"))
## [1] "Validation Set: 5885 rows"
  1. Data Modelling

We will train and compare two classification models.

6.1. Model 1: Decision Tree (rpart)

The first model is a simple Decision Tree.

modelTree <- rpart(classe ~ ., data = training, method = "class")
prp(modelTree)

# Estimate performance on the validation set
predictTree <- predict(modelTree, validation, type = "class")
cm_tree <- confusionMatrix(validation$classe, predictTree)
print("--- Decision Tree Results ---")
## [1] "--- Decision Tree Results ---"
print(cm_tree)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 1492   37   10   84   51
##          B  270  551  120  134   64
##          C   55   32  818   49   72
##          D  116   17  117  655   59
##          E   84   89   61  140  708
## 
## Overall Statistics
##                                           
##                Accuracy : 0.7178          
##                  95% CI : (0.7061, 0.7292)
##     No Information Rate : 0.3427          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.6409          
##                                           
##  Mcnemar's Test P-Value : < 2.2e-16       
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            0.7397  0.75895   0.7265   0.6168   0.7421
## Specificity            0.9529  0.88602   0.9563   0.9359   0.9242
## Pos Pred Value         0.8913  0.48376   0.7973   0.6795   0.6543
## Neg Pred Value         0.8753  0.96313   0.9366   0.9173   0.9488
## Prevalence             0.3427  0.12336   0.1913   0.1805   0.1621
## Detection Rate         0.2535  0.09363   0.1390   0.1113   0.1203
## Detection Prevalence   0.2845  0.19354   0.1743   0.1638   0.1839
## Balanced Accuracy      0.8463  0.82249   0.8414   0.7763   0.8331
ose_tree <- 1 - as.numeric(cm_tree$overall[1])
print(paste("Decision Tree OOS Error:", round(ose_tree, 4)))
## [1] "Decision Tree OOS Error: 0.2822"
rm(predictTree)
rm(modelTree)
rm(cm_tree)

6.2. Model 2: Random Forest

The second model is a Random Forest, which is typically more accurate. We use 5-fold cross-validation to tune it.

# This step may take some time
modelRF <- train(classe ~ ., 
                 data = training, 
                 method = "rf", 
                 trControl = trainControl(method = "cv", 5), 
                 ntree = 250)
print(modelRF)
## Random Forest 
## 
## 13737 samples
##    53 predictor
##     5 classes: 'A', 'B', 'C', 'D', 'E' 
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold) 
## Summary of sample sizes: 10988, 10990, 10991, 10990, 10989 
## Resampling results across tuning parameters:
## 
##   mtry  Accuracy   Kappa    
##    2    0.9933033  0.9915283
##   27    0.9971614  0.9964095
##   53    0.9938853  0.9922657
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 27.
# Estimate performance on the validation set
predictRF <- predict(modelRF, validation)
cm_rf <- confusionMatrix(validation$classe, predictRF)
print("--- Random Forest Results ---")
## [1] "--- Random Forest Results ---"
print(cm_rf)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 1674    0    0    0    0
##          B    1 1137    1    0    0
##          C    0    1 1025    0    0
##          D    0    0    0  964    0
##          E    0    0    0    2 1080
## 
## Overall Statistics
##                                          
##                Accuracy : 0.9992         
##                  95% CI : (0.998, 0.9997)
##     No Information Rate : 0.2846         
##     P-Value [Acc > NIR] : < 2.2e-16      
##                                          
##                   Kappa : 0.9989         
##                                          
##  Mcnemar's Test P-Value : NA             
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            0.9994   0.9991   0.9990   0.9979   1.0000
## Specificity            1.0000   0.9996   0.9998   1.0000   0.9996
## Pos Pred Value         1.0000   0.9982   0.9990   1.0000   0.9982
## Neg Pred Value         0.9998   0.9998   0.9998   0.9996   1.0000
## Prevalence             0.2846   0.1934   0.1743   0.1641   0.1835
## Detection Rate         0.2845   0.1932   0.1742   0.1638   0.1835
## Detection Prevalence   0.2845   0.1935   0.1743   0.1638   0.1839
## Balanced Accuracy      0.9997   0.9993   0.9994   0.9990   0.9998
ose_rf <- 1 - as.numeric(cm_rf$overall[1])
print(paste("Random Forest OOS Error:", round(ose_rf, 4)))
## [1] "Random Forest OOS Error: 8e-04"
rm(predictRF)
rm(cm_rf)

The Random Forest model is significantly more accurate and has a much lower out-of-sample error. We will use this model for the final predictions.

  1. Final Predictions on Test Set

The trained Random Forest model is used to predict the 20 cases in the official test set. The non-predictive problem_id column is removed from the test set before prediction.

# Note: The 'testing' set from cleaning still has 'problem_id'
# We predict on the testing set, removing the last column ('problem_id')
final_predictions <- predict(modelRF, testing[, -length(names(testing))])

print("Final Predictions on 20 Test Cases:")
## [1] "Final Predictions on 20 Test Cases:"
print(final_predictions)
##  [1] B A B A A E D B A A B C B A E E A B B B
## Levels: A B C D E