The goal of this project is to build a classification model to predict the manner (classe variable) in which participants performed a barbell lift. The classe variable has five levels (A, B, C, D, E) indicating correct or incorrect form. We will use sensor data from the participants to train a Random Forest model.
First, we load the training and testing datasets from their URLs. We’ll treat common missing value indicators (““, NA, #DIV/0!) as NA.
training_url <- "https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv"
testing_url <- "https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv"
# Load data, treating common missing value indicators as NA
training_raw <- readr::read_csv(training_url, na = c("", "NA", "#DIV/0!"))
## New names:
## Rows: 19622 Columns: 160
## ── Column specification
## ──────────────────────────────────────────────────────── Delimiter: "," chr
## (4): user_name, cvtd_timestamp, new_window, classe dbl (150): ...1,
## raw_timestamp_part_1, raw_timestamp_part_2, num_window, rol... lgl (6):
## kurtosis_yaw_belt, skewness_yaw_belt, kurtosis_yaw_dumbbell, skew...
## ℹ Use `spec()` to retrieve the full column specification for this data. ℹ
## Specify the column types or set `show_col_types = FALSE` to quiet this message.
## • `` -> `...1`
testing_raw <- readr::read_csv(testing_url, na = c("", "NA", "#DIV/0!"))
## New names:
## Rows: 20 Columns: 160
## ── Column specification
## ──────────────────────────────────────────────────────── Delimiter: "," chr
## (3): user_name, cvtd_timestamp, new_window dbl (57): ...1,
## raw_timestamp_part_1, raw_timestamp_part_2, num_window, rol... lgl (100):
## kurtosis_roll_belt, kurtosis_picth_belt, kurtosis_yaw_belt, skewn...
## ℹ Use `spec()` to retrieve the full column specification for this data. ℹ
## Specify the column types or set `show_col_types = FALSE` to quiet this message.
## • `` -> `...1`
print(paste("Raw training data loaded:", nrow(training_raw), "rows,", ncol(training_raw), "cols"))
## [1] "Raw training data loaded: 19622 rows, 160 cols"
print(paste("Raw testing data loaded:", nrow(testing_raw), "rows,", ncol(testing_raw), "cols"))
## [1] "Raw testing data loaded: 20 rows, 160 cols"
The raw data requires cleaning before modeling. We will perform three main steps:
Remove columns that are mostly NA (over 95% missing).
Remove metadata and identifier columns (like user_name, timestamps) that are not predictive.
Remove Near-Zero Variance (NZV) columns that are constant or almost constant.
print("Cleaning data...")
## [1] "Cleaning data..."
# Step 2a: Remove columns with a high percentage of NAs (> 95%)
na_cols <- which(colSums(is.na(training_raw)) / nrow(training_raw) > 0.95)
training_clean <- training_raw[, -na_cols]
testing_clean <- testing_raw[, -na_cols]
print(paste("Removed", length(na_cols), "high-NA columns."))
## [1] "Removed 100 high-NA columns."
# Step 2b: Remove metadata/identifier columns
meta_cols <- c("user_name", "raw_timestamp_part_1", "raw_timestamp_part_2", "cvtd_timestamp", "new_window", "num_window")
training_clean <- training_clean %>%
select(-matches("^X1$"), -all_of(meta_cols))
testing_clean <- testing_clean %>%
select(-matches("^X1$"), -all_of(meta_cols))
print(paste("Removed", length(meta_cols) + 1, "metadata/ID columns."))
## [1] "Removed 7 metadata/ID columns."
# Step 2c: Remove Near-Zero Variance (NZV) predictors
nzv <- nearZeroVar(training_clean, saveMetrics = TRUE)
nzv_cols <- rownames(nzv)[nzv$nzv == TRUE]
training_clean <- training_clean %>%
select(-all_of(nzv_cols))
testing_clean <- testing_clean %>%
select(-all_of(nzv_cols))
print(paste("Removed", length(nzv_cols), "near-zero variance columns."))
## [1] "Removed 0 near-zero variance columns."
# Step 2d: Convert target variable 'classe' to a factor
training_clean$classe <- as.factor(training_clean$classe)
# Final check
if(any(is.na(training_clean))) {
print("Warning: NAs still present after cleaning.")
} else {
print("Data is clean. No NAs remaining.")
}
## [1] "Data is clean. No NAs remaining."
print(paste("Final training data shape:", nrow(training_clean), "rows,", ncol(training_clean), "cols"))
## [1] "Final training data shape: 19622 rows, 54 cols"
print(paste("Final testing data shape:", nrow(testing_clean), "rows,", ncol(testing_clean), "cols"))
## [1] "Final testing data shape: 20 rows, 54 cols"
Data Splitting
To reliably estimate our model’s performance, we split the full training dataset into a local training set (75%) and a local validation set (25%). We will train on the 75% and test on the 25% to estimate the out-of-sample error.
print("Splitting data into local training and validation sets...")
## [1] "Splitting data into local training and validation sets..."
inTrain <- createDataPartition(y = training_clean$classe, p = 0.75, list = FALSE)
local_train <- training_clean[inTrain, ]
local_validation <- training_clean[-inTrain, ]
print(paste("Local training set:", nrow(local_train), "rows"))
## [1] "Local training set: 14718 rows"
print(paste("Local validation set:", nrow(local_validation), "rows"))
## [1] "Local validation set: 4904 rows"
Model Training
We will train a Random Forest model using 5-fold cross-validation. The caret package will handle the CV and hyperparameter tuning (mtry).
print("Training Random Forest model with 5-fold CV...")
## [1] "Training Random Forest model with 5-fold CV..."
train_control <- trainControl(
method = "cv", # Cross-validation
number = 5, # 5 folds
verboseIter = TRUE # Show progress
)
tune_grid <- expand.grid(
.mtry = c(5, 10, 15),
.splitrule = "gini",
.min.node.size = 1
)
rf_model <- train(
classe ~ .,
data = local_train,
method = "ranger",
trControl = train_control,
tuneGrid = tune_grid,
importance = 'impurity'
)
## + Fold1: mtry= 5, splitrule=gini, min.node.size=1
## - Fold1: mtry= 5, splitrule=gini, min.node.size=1
## + Fold1: mtry=10, splitrule=gini, min.node.size=1
## - Fold1: mtry=10, splitrule=gini, min.node.size=1
## + Fold1: mtry=15, splitrule=gini, min.node.size=1
## - Fold1: mtry=15, splitrule=gini, min.node.size=1
## + Fold2: mtry= 5, splitrule=gini, min.node.size=1
## - Fold2: mtry= 5, splitrule=gini, min.node.size=1
## + Fold2: mtry=10, splitrule=gini, min.node.size=1
## - Fold2: mtry=10, splitrule=gini, min.node.size=1
## + Fold2: mtry=15, splitrule=gini, min.node.size=1
## - Fold2: mtry=15, splitrule=gini, min.node.size=1
## + Fold3: mtry= 5, splitrule=gini, min.node.size=1
## - Fold3: mtry= 5, splitrule=gini, min.node.size=1
## + Fold3: mtry=10, splitrule=gini, min.node.size=1
## - Fold3: mtry=10, splitrule=gini, min.node.size=1
## + Fold3: mtry=15, splitrule=gini, min.node.size=1
## - Fold3: mtry=15, splitrule=gini, min.node.size=1
## + Fold4: mtry= 5, splitrule=gini, min.node.size=1
## - Fold4: mtry= 5, splitrule=gini, min.node.size=1
## + Fold4: mtry=10, splitrule=gini, min.node.size=1
## - Fold4: mtry=10, splitrule=gini, min.node.size=1
## + Fold4: mtry=15, splitrule=gini, min.node.size=1
## - Fold4: mtry=15, splitrule=gini, min.node.size=1
## + Fold5: mtry= 5, splitrule=gini, min.node.size=1
## - Fold5: mtry= 5, splitrule=gini, min.node.size=1
## + Fold5: mtry=10, splitrule=gini, min.node.size=1
## - Fold5: mtry=10, splitrule=gini, min.node.size=1
## + Fold5: mtry=15, splitrule=gini, min.node.size=1
## - Fold5: mtry=15, splitrule=gini, min.node.size=1
## Aggregating results
## Selecting tuning parameters
## Fitting mtry = 10, splitrule = gini, min.node.size = 1 on full training set
print("Model training complete.")
## [1] "Model training complete."
Here are the results of the model training and tuning:
print(rf_model)
## Random Forest
##
## 14718 samples
## 53 predictor
## 5 classes: 'A', 'B', 'C', 'D', 'E'
##
## No pre-processing
## Resampling: Cross-Validated (5 fold)
## Summary of sample sizes: 11775, 11775, 11775, 11774, 11773
## Resampling results across tuning parameters:
##
## mtry Accuracy Kappa
## 5 0.9996602 0.9995703
## 10 1.0000000 1.0000000
## 15 1.0000000 1.0000000
##
## Tuning parameter 'splitrule' was held constant at a value of gini
##
## Tuning parameter 'min.node.size' was held constant at a value of 1
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were mtry = 10, splitrule = gini
## and min.node.size = 1.
plot(rf_model)
Now, we estimate the out-of-sample error by using our trained model (rf_model) to predict on the local_validation set (the 25% hold-out data).
print("Evaluating model on local validation set...")
## [1] "Evaluating model on local validation set..."
validation_pred <- predict(rf_model, newdata = local_validation)
# Create a confusion matrix to see the results
cm <- confusionMatrix(validation_pred, local_validation$classe)
print("--- Validation Results ---")
## [1] "--- Validation Results ---"
print(cm)
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 1395 0 0 0 0
## B 0 949 0 0 0
## C 0 0 855 0 0
## D 0 0 0 804 0
## E 0 0 0 0 901
##
## Overall Statistics
##
## Accuracy : 1
## 95% CI : (0.9992, 1)
## No Information Rate : 0.2845
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 1
##
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 1.0000 1.0000 1.0000 1.0000 1.0000
## Specificity 1.0000 1.0000 1.0000 1.0000 1.0000
## Pos Pred Value 1.0000 1.0000 1.0000 1.0000 1.0000
## Neg Pred Value 1.0000 1.0000 1.0000 1.0000 1.0000
## Prevalence 0.2845 0.1935 0.1743 0.1639 0.1837
## Detection Rate 0.2845 0.1935 0.1743 0.1639 0.1837
## Detection Prevalence 0.2845 0.1935 0.1743 0.1639 0.1837
## Balanced Accuracy 1.0000 1.0000 1.0000 1.0000 1.0000
# Out-of-sample error = 1 - Accuracy
oos_error <- 1 - cm$overall["Accuracy"]
print(paste("Estimated Out-of-Sample Error:", round(oos_error * 100, 3), "%"))
## [1] "Estimated Out-of-Sample Error: 0 %"
We can also inspect the most important variables as determined by the model.
print("Top 20 most important variables:")
## [1] "Top 20 most important variables:"
var_imp <- varImp(rf_model, scale = FALSE)
print(plot(var_imp, top = 20))
Finally, we use our validated model to make predictions on the official pml-testing.csv dataset.
print("Generating final predictions on 'pml-testing.csv'...")
## [1] "Generating final predictions on 'pml-testing.csv'..."
final_predictions <- predict(rf_model, newdata = testing_clean)
print("Final Predictions:")
## [1] "Final Predictions:"
print(final_predictions)
## [1] A A A A A A A A A A A A A A A A A A A A
## Levels: A B C D E
# Helper function for writing submission files
pml_write_files = function(x){
n = length(x)
for(i in 1:n){
filename = paste0("problem_id_",i,".txt")
write.table(x[i],file=filename,quote=FALSE,row.names=FALSE,col.names=FALSE)
}
}
pml_write_files(final_predictions)
print("Submission files (e.g., 'problem_id_1.txt') created in working directory.")
## [1] "Submission files (e.g., 'problem_id_1.txt') created in working directory."
print("--- Project Complete ---")
## [1] "--- Project Complete ---"