Prediction Assignment

Introduction

The goal of this project is to build a classification model to predict the manner (classe variable) in which participants performed a barbell lift. The classe variable has five levels (A, B, C, D, E) indicating correct or incorrect form. We will use sensor data from the participants to train a Random Forest model.

Data Loading

First, we load the training and testing datasets from their URLs. We’ll treat common missing value indicators (““, NA, #DIV/0!) as NA.

training_url <- "https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv"
testing_url <- "https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv"

# Load data, treating common missing value indicators as NA
training_raw <- readr::read_csv(training_url, na = c("", "NA", "#DIV/0!"))

## New names:
## Rows: 19622 Columns: 160
## ── Column specification
## ──────────────────────────────────────────────────────── Delimiter: "," chr
## (4): user_name, cvtd_timestamp, new_window, classe dbl (150): ...1,
## raw_timestamp_part_1, raw_timestamp_part_2, num_window, rol... lgl (6):
## kurtosis_yaw_belt, skewness_yaw_belt, kurtosis_yaw_dumbbell, skew...
## ℹ Use `spec()` to retrieve the full column specification for this data. ℹ
## Specify the column types or set `show_col_types = FALSE` to quiet this message.
## • `` -> `...1`

testing_raw <- readr::read_csv(testing_url, na = c("", "NA", "#DIV/0!"))

## New names:
## Rows: 20 Columns: 160
## ── Column specification
## ──────────────────────────────────────────────────────── Delimiter: "," chr
## (3): user_name, cvtd_timestamp, new_window dbl (57): ...1,
## raw_timestamp_part_1, raw_timestamp_part_2, num_window, rol... lgl (100):
## kurtosis_roll_belt, kurtosis_picth_belt, kurtosis_yaw_belt, skewn...
## ℹ Use `spec()` to retrieve the full column specification for this data. ℹ
## Specify the column types or set `show_col_types = FALSE` to quiet this message.
## • `` -> `...1`

print(paste("Raw training data loaded:", nrow(training_raw), "rows,", ncol(training_raw), "cols"))

## [1] "Raw training data loaded: 19622 rows, 160 cols"

print(paste("Raw testing data loaded:", nrow(testing_raw), "rows,", ncol(testing_raw), "cols"))

## [1] "Raw testing data loaded: 20 rows, 160 cols"

Data Preprocessing

The raw data requires cleaning before modeling. We will perform three main steps:

Remove columns that are mostly NA (over 95% missing).

Remove metadata and identifier columns (like user_name, timestamps) that are not predictive.

Remove Near-Zero Variance (NZV) columns that are constant or almost constant.

print("Cleaning data...")

## [1] "Cleaning data..."

# Step 2a: Remove columns with a high percentage of NAs (> 95%)
na_cols <- which(colSums(is.na(training_raw)) / nrow(training_raw) > 0.95)
training_clean <- training_raw[, -na_cols]
testing_clean <- testing_raw[, -na_cols]
print(paste("Removed", length(na_cols), "high-NA columns."))

## [1] "Removed 100 high-NA columns."

# Step 2b: Remove metadata/identifier columns
meta_cols <- c("user_name", "raw_timestamp_part_1", "raw_timestamp_part_2", "cvtd_timestamp", "new_window", "num_window")
training_clean <- training_clean %>%
  select(-matches("^X1$"), -all_of(meta_cols))
testing_clean <- testing_clean %>%
  select(-matches("^X1$"), -all_of(meta_cols))
print(paste("Removed", length(meta_cols) + 1, "metadata/ID columns."))

## [1] "Removed 7 metadata/ID columns."

# Step 2c: Remove Near-Zero Variance (NZV) predictors
nzv <- nearZeroVar(training_clean, saveMetrics = TRUE)
nzv_cols <- rownames(nzv)[nzv$nzv == TRUE]
training_clean <- training_clean %>%
  select(-all_of(nzv_cols))
testing_clean <- testing_clean %>%
  select(-all_of(nzv_cols))
print(paste("Removed", length(nzv_cols), "near-zero variance columns."))

## [1] "Removed 0 near-zero variance columns."

# Step 2d: Convert target variable 'classe' to a factor
training_clean$classe <- as.factor(training_clean$classe)

# Final check
if(any(is.na(training_clean))) {
  print("Warning: NAs still present after cleaning.")
} else {
  print("Data is clean. No NAs remaining.")
}

## [1] "Data is clean. No NAs remaining."

print(paste("Final training data shape:", nrow(training_clean), "rows,", ncol(training_clean), "cols"))

## [1] "Final training data shape: 19622 rows, 54 cols"

print(paste("Final testing data shape:", nrow(testing_clean), "rows,", ncol(testing_clean), "cols"))

## [1] "Final testing data shape: 20 rows, 54 cols"

Model Building and Validation

Data Splitting

To reliably estimate our model’s performance, we split the full training dataset into a local training set (75%) and a local validation set (25%). We will train on the 75% and test on the 25% to estimate the out-of-sample error.

print("Splitting data into local training and validation sets...")

## [1] "Splitting data into local training and validation sets..."

inTrain <- createDataPartition(y = training_clean$classe, p = 0.75, list = FALSE)
local_train <- training_clean[inTrain, ]
local_validation <- training_clean[-inTrain, ]

print(paste("Local training set:", nrow(local_train), "rows"))

## [1] "Local training set: 14718 rows"

print(paste("Local validation set:", nrow(local_validation), "rows"))

## [1] "Local validation set: 4904 rows"

Model Training

We will train a Random Forest model using 5-fold cross-validation. The caret package will handle the CV and hyperparameter tuning (mtry).

print("Training Random Forest model with 5-fold CV...")

## [1] "Training Random Forest model with 5-fold CV..."

train_control <- trainControl(
  method = "cv",       # Cross-validation
  number = 5,          # 5 folds
  verboseIter = TRUE   # Show progress
)

tune_grid <- expand.grid(
  .mtry = c(5, 10, 15),
  .splitrule = "gini",
  .min.node.size = 1
)

rf_model <- train(
  classe ~ .,
  data = local_train,
  method = "ranger",
  trControl = train_control,
  tuneGrid = tune_grid,
  importance = 'impurity'
)

## + Fold1: mtry= 5, splitrule=gini, min.node.size=1 
## - Fold1: mtry= 5, splitrule=gini, min.node.size=1 
## + Fold1: mtry=10, splitrule=gini, min.node.size=1 
## - Fold1: mtry=10, splitrule=gini, min.node.size=1 
## + Fold1: mtry=15, splitrule=gini, min.node.size=1 
## - Fold1: mtry=15, splitrule=gini, min.node.size=1 
## + Fold2: mtry= 5, splitrule=gini, min.node.size=1 
## - Fold2: mtry= 5, splitrule=gini, min.node.size=1 
## + Fold2: mtry=10, splitrule=gini, min.node.size=1 
## - Fold2: mtry=10, splitrule=gini, min.node.size=1 
## + Fold2: mtry=15, splitrule=gini, min.node.size=1 
## - Fold2: mtry=15, splitrule=gini, min.node.size=1 
## + Fold3: mtry= 5, splitrule=gini, min.node.size=1 
## - Fold3: mtry= 5, splitrule=gini, min.node.size=1 
## + Fold3: mtry=10, splitrule=gini, min.node.size=1 
## - Fold3: mtry=10, splitrule=gini, min.node.size=1 
## + Fold3: mtry=15, splitrule=gini, min.node.size=1 
## - Fold3: mtry=15, splitrule=gini, min.node.size=1 
## + Fold4: mtry= 5, splitrule=gini, min.node.size=1 
## - Fold4: mtry= 5, splitrule=gini, min.node.size=1 
## + Fold4: mtry=10, splitrule=gini, min.node.size=1 
## - Fold4: mtry=10, splitrule=gini, min.node.size=1 
## + Fold4: mtry=15, splitrule=gini, min.node.size=1 
## - Fold4: mtry=15, splitrule=gini, min.node.size=1 
## + Fold5: mtry= 5, splitrule=gini, min.node.size=1 
## - Fold5: mtry= 5, splitrule=gini, min.node.size=1 
## + Fold5: mtry=10, splitrule=gini, min.node.size=1 
## - Fold5: mtry=10, splitrule=gini, min.node.size=1 
## + Fold5: mtry=15, splitrule=gini, min.node.size=1 
## - Fold5: mtry=15, splitrule=gini, min.node.size=1 
## Aggregating results
## Selecting tuning parameters
## Fitting mtry = 10, splitrule = gini, min.node.size = 1 on full training set

print("Model training complete.")

## [1] "Model training complete."

Here are the results of the model training and tuning:

print(rf_model)

## Random Forest 
## 
## 14718 samples
##    53 predictor
##     5 classes: 'A', 'B', 'C', 'D', 'E' 
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold) 
## Summary of sample sizes: 11775, 11775, 11775, 11774, 11773 
## Resampling results across tuning parameters:
## 
##   mtry  Accuracy   Kappa    
##    5    0.9996602  0.9995703
##   10    1.0000000  1.0000000
##   15    1.0000000  1.0000000
## 
## Tuning parameter 'splitrule' was held constant at a value of gini
## 
## Tuning parameter 'min.node.size' was held constant at a value of 1
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were mtry = 10, splitrule = gini
##  and min.node.size = 1.

plot(rf_model)

Model Evaluation

Now, we estimate the out-of-sample error by using our trained model (rf_model) to predict on the local_validation set (the 25% hold-out data).

print("Evaluating model on local validation set...")

## [1] "Evaluating model on local validation set..."

validation_pred <- predict(rf_model, newdata = local_validation)

# Create a confusion matrix to see the results
cm <- confusionMatrix(validation_pred, local_validation$classe)

print("--- Validation Results ---")

## [1] "--- Validation Results ---"

print(cm)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 1395    0    0    0    0
##          B    0  949    0    0    0
##          C    0    0  855    0    0
##          D    0    0    0  804    0
##          E    0    0    0    0  901
## 
## Overall Statistics
##                                      
##                Accuracy : 1          
##                  95% CI : (0.9992, 1)
##     No Information Rate : 0.2845     
##     P-Value [Acc > NIR] : < 2.2e-16  
##                                      
##                   Kappa : 1          
##                                      
##  Mcnemar's Test P-Value : NA         
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            1.0000   1.0000   1.0000   1.0000   1.0000
## Specificity            1.0000   1.0000   1.0000   1.0000   1.0000
## Pos Pred Value         1.0000   1.0000   1.0000   1.0000   1.0000
## Neg Pred Value         1.0000   1.0000   1.0000   1.0000   1.0000
## Prevalence             0.2845   0.1935   0.1743   0.1639   0.1837
## Detection Rate         0.2845   0.1935   0.1743   0.1639   0.1837
## Detection Prevalence   0.2845   0.1935   0.1743   0.1639   0.1837
## Balanced Accuracy      1.0000   1.0000   1.0000   1.0000   1.0000

# Out-of-sample error = 1 - Accuracy
oos_error <- 1 - cm$overall["Accuracy"]
print(paste("Estimated Out-of-Sample Error:", round(oos_error * 100, 3), "%"))

## [1] "Estimated Out-of-Sample Error: 0 %"

We can also inspect the most important variables as determined by the model.

print("Top 20 most important variables:")

## [1] "Top 20 most important variables:"

var_imp <- varImp(rf_model, scale = FALSE)
print(plot(var_imp, top = 20))

Final Predictions

Finally, we use our validated model to make predictions on the official pml-testing.csv dataset.

print("Generating final predictions on 'pml-testing.csv'...")

## [1] "Generating final predictions on 'pml-testing.csv'..."

final_predictions <- predict(rf_model, newdata = testing_clean)

print("Final Predictions:")

## [1] "Final Predictions:"

print(final_predictions)

##  [1] A A A A A A A A A A A A A A A A A A A A
## Levels: A B C D E

# Helper function for writing submission files
pml_write_files = function(x){
  n = length(x)
  for(i in 1:n){
    filename = paste0("problem_id_",i,".txt")
    write.table(x[i],file=filename,quote=FALSE,row.names=FALSE,col.names=FALSE)
  }
}

pml_write_files(final_predictions)
print("Submission files (e.g., 'problem_id_1.txt') created in working directory.")

## [1] "Submission files (e.g., 'problem_id_1.txt') created in working directory."

print("--- Project Complete ---")

## [1] "--- Project Complete ---"

Prediction Assignment

Thomas Lizzi

2025-10-24