Introduction

This is Practical Machine Learning Project Report for Coursera’s Data Science Specialization Course offered by Johns Hopkins University.

The goal of this project is to predict the manner in which participants performed barbell lifts, encoded as the classe variable in the training set.

We train 4 models: Decision Tree, Random Forest, Gradient Boosted Trees, and Support Vector Machine, using k‑fold cross‑validation on the training set. We then evaluate performance on a validation subset randomly selected from the original training data to estimate accuracy and out‑of‑sample error rate.

The classe variable has five levels (A–E) representing different forms of execution quality (perfect vs common mistakes). We use accelerometer and orientation features from multiple sensors to build the predictive model.

In this report, we describe: - How the model was built.
- How cross‑validation was used.
- What the expected out‑of‑sample error is.
- Why we chose Random Forest over simpler models.

Finally, we use the trained model to predict 20 test cases for the Coursera quiz.

Background

Using devices such as Jawbone Up, Nike FuelBand, and Fitbit it is now possible to collect a large amount of data about personal activity relatively inexpensively. These type of devices are part of the quantified self movement – a group of enthusiasts who take measurements about themselves regularly to improve their health, to find patterns in their behavior, or because they are tech geeks. One thing that people regularly do is quantify how much of a particular activity they do, but they rarely quantify how well they do it. In this project, your goal will be to use data from accelerometers on the belt, forearm, arm, and dumbell of 6 participants. They were asked to perform barbell lifts correctly and incorrectly in 5 different ways. More information is available from the website here: http://web.archive.org/web/20161224072740/http:/groupware.les.inf.puc-rio.br/har (see the section on the Weight Lifting Exercise Dataset).

Data : The training data for this project are available here:

https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv

The test data are available here:

https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv

The data for this project come from this source: http://web.archive.org/web/20161224072740/http:/groupware.les.inf.puc-rio.br/har . If you use the document you create for this class for any purpose please cite them as they have been very generous in allowing their data to be used for this kind of assignment. —

1. Environment Initialization & Parallel Setup

To maximize computational efficiency and avoid memory bottlenecks during ensemble training, we initialize a parallel cluster across available CPU threads using doParallel.

# Package dependency checklist (including visualization engines)
required_packages <- c(
  "caret", "randomForest", "rpart", "rpart.plot", 
  "corrplot", "parallel", "doParallel", "gbm"
)

# Install packages that are not already installed
new_packages <- required_packages[!(required_packages %in% installed.packages()[,"Package"])]
if(length(new_packages)) install.packages(new_packages)

# Load required libraries
library(caret)          # Training and model utilities
library(randomForest)   # Random Forest implementation
library(rpart)          # Decision tree (rpart)
library(rpart.plot)     # Tree plotting
library(corrplot)       # Correlation matrix visualization
library(parallel)       # Detect CPU cores
library(doParallel)     # Parallel backend for caret
library(gbm)            # Gradient Boosting Machine

# Set seed for deterministic reproducibility
set.seed(12345)

# Configure parallel computation (leave 1 core free for OS)
cluster <- makeCluster(detectCores() - 1)
registerDoParallel(cluster)

2. Data Ingestion & Dimensional Analysis

The training and test files are pulled directly from Coursera’s cloud storage.
Missing values, #DIV/0!, and blanks are treated as NA.

train_data_url <-"https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv"
test_data_url <- "https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv"
train_data_file <- "./data/pml-training.csv"
test_data_file  <- "./data/pml-testing.csv"
if (!file.exists("./data")) {
  dir.create("./data")
}
if (!file.exists(train_data_file)) {
  download.file(train_data_url, destfile = train_data_file, method = "curl")
}
if (!file.exists(test_data_file)) {
  download.file(test_data_url, destfile = test_data_file, method = "curl")
}
rm(train_data_url)
rm(test_data_url)
# Read raw CSV files
training_raw <- read.csv("./data/pml-training.csv")
testing_raw  <- read.csv("./data/pml-testing.csv")

# Inspect dimensions of the raw training and testing data
dim(training_raw)
## [1] 19622   160
dim(testing_raw)
## [1]  20 160

The raw training dataset has 19,622 observations across 160 columns, including:

Our job is to use the relevant sensor features to predict classe.


3. Data Cleansing & Feature Engineering

Feeding raw data into tree ensembles leads to memory overflows and high variance.
We apply three filters:

# 3.1 Drop columns containing any missing data
na_counts <- colSums(is.na(training_raw))
good_columns <- na_counts == 0
train_clean <- training_raw[, good_columns]

# 3.2 Remove metadata columns (first 7)
train_clean <- train_clean[, -c(1:7)]

# 3.3 Remove near-zero variance predictors
nzv_metrics <- nearZeroVar(train_clean, saveMetrics = TRUE)
train_clean <- train_clean[, !nzv_metrics$nzv]

# Ensure target variable 'classe' is a factor
train_clean$classe <- as.factor(train_clean$classe)

# Final dimensionality of cleaned dataset
dim(train_clean)
## [1] 19622    53

After cleaning, the dataset is reduced to 53 high‑quality accelerometer and orientation features, ready for modeling.

3.1 Feature Correlation

# Compute correlation matrix for first 20 numeric predictors
corr_matrix <- cor(train_clean[, 1:20])

# Plot correlation matrix as a heatmap
corrplot(
  corr_matrix,
  method = "color",
  type = "lower",
  tl.cex = 0.6,
  tl.col = "black",
  main = "\\nFigure 2: Correlation Matrix of Top 20 Predictors"
)
Figure 2: Correlation Matrix of Top 20 Predictors

Figure 2: Correlation Matrix of Top 20 Predictors

Random Forests are robust to multicollinearity, so we keep these features and let the model decide which combinations are most predictive.


4. Cross‑Validation Strategy

To estimate out‑of‑sample error without using the unlabeled test file (pml-testing.csv), we split the cleaned training data into:

# Stratified partition split (preserves class proportions)
in_train  <- createDataPartition(train_clean$classe, p = 0.75, list = FALSE)
local_train      <- train_clean[in_train, ]
local_validation <- train_clean[-in_train, ]

# Define common cross‑validation control (5‑fold CV, parallel enabled)
fit_control <- trainControl(
  method = "cv",
  number = 5,
  allowParallel = TRUE,
  verboseIter = FALSE
)

Within the Random Forest, GBM, and SVM training steps, we also use 5‑fold CV (fit_control), which further reduces sample bias and improves stability of the estimates.

This two‑level strategy (train/validation split + internal CV) is our justification for why we trust the out‑of‑sample error estimate.


5. Model Building and Choice Justification

We compare four approaches:

5.1 Decision Tree (rpart) Performance

# Train Decision tree model
model_tree <- train(
  classe ~ .,
  data = local_train,
  method = "rpart",
  trControl = fit_control
)

# Predict on local validation and compute confusion matrix
pred_tree <- predict(model_tree, newdata = local_validation)
tree_cm <- confusionMatrix(pred_tree, local_validation$classe)
tree_accuracy <- tree_cm$overall["Accuracy"]

# Print confusion matrix and accuracy
print(tree_cm)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 1252  396  434  343  114
##          B   30  317   24  151  132
##          C   90  236  397  310  229
##          D    0    0    0    0    0
##          E   23    0    0    0  426
## 
## Overall Statistics
##                                           
##                Accuracy : 0.4878          
##                  95% CI : (0.4737, 0.5019)
##     No Information Rate : 0.2845          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.3306          
##                                           
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            0.8975  0.33404  0.46433   0.0000  0.47281
## Specificity            0.6332  0.91479  0.78637   1.0000  0.99425
## Pos Pred Value         0.4931  0.48471  0.31458      NaN  0.94878
## Neg Pred Value         0.9395  0.85129  0.87424   0.8361  0.89338
## Prevalence             0.2845  0.19352  0.17435   0.1639  0.18373
## Detection Rate         0.2553  0.06464  0.08095   0.0000  0.08687
## Detection Prevalence   0.5177  0.13336  0.25734   0.0000  0.09156
## Balanced Accuracy      0.7654  0.62441  0.62535   0.5000  0.73353
tree_accuracy
##  Accuracy 
## 0.4877651
# Plotting the model
plot(model_tree)

The tree’s accuracy is low (~50–55%), confirming it generalizes poorly on this sensor‑data problem.

5.1.1 Tree Visualization

# Plot the final rpart tree structure
rpart.plot(
  model_tree$finalModel,
  main = "Figure 3: Decision Tree Structure (rpart)",
  type = 2,
  fallen.leaves = TRUE,
  cex = 0.6
)
Figure 3: Decision Tree Structure (rpart)

Figure 3: Decision Tree Structure (rpart)

The tree is easy to interpret but typically achieves only ~50–55% accuracy on local validation, because single trees overfit and struggle with non‑linear, multi‑sensor patterns.

5.2 Gradient Boosted Trees (GBM) Performance

# Train GBM model
mod_gbm <- train(
  classe ~ .,
  data = local_train,
  method = "gbm",
  trControl = fit_control,
  tuneLength = 5,
  verbose = FALSE
)
# Evaluate GBM Model
pred_gbm <- predict(mod_gbm, local_validation)
cmgbm <- confusionMatrix(pred_gbm, local_validation$classe)
print(cmgbm)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 1392    2    0    0    0
##          B    3  945    2    0    4
##          C    0    2  848   14    2
##          D    0    0    5  785    3
##          E    0    0    0    5  892
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9914          
##                  95% CI : (0.9884, 0.9938)
##     No Information Rate : 0.2845          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.9892          
##                                           
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            0.9978   0.9958   0.9918   0.9764   0.9900
## Specificity            0.9994   0.9977   0.9956   0.9980   0.9988
## Pos Pred Value         0.9986   0.9906   0.9792   0.9899   0.9944
## Neg Pred Value         0.9991   0.9990   0.9983   0.9954   0.9978
## Prevalence             0.2845   0.1935   0.1743   0.1639   0.1837
## Detection Rate         0.2838   0.1927   0.1729   0.1601   0.1819
## Detection Prevalence   0.2843   0.1945   0.1766   0.1617   0.1829
## Balanced Accuracy      0.9986   0.9968   0.9937   0.9872   0.9944
# Plot GBM results
plot(mod_gbm)

Gradient Boosting builds trees sequentially, correcting errors from previous trees.
It often performs well on this dataset but is more sensitive to tuning and slower to train than Random Forest.

5.3 Support Vector Machine (SVM) Performance

# Train a linear SVM model
mod_svm <- train(
  classe ~ .,
  data = local_train,
  method = "svmLinear",
  trControl = fit_control,
  tuneLength = 5,
  verbose = FALSE
)
# Evaluate SVM Model
pred_svm <- predict(mod_svm, local_validation)
cmsvm <- confusionMatrix(pred_svm, local_validation$classe)
print(cmsvm)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 1291  129   75   48   46
##          B   24  669   79   35  115
##          C   36   59  666   83   55
##          D   33   22   25  601   51
##          E   11   70   10   37  634
## 
## Overall Statistics
##                                           
##                Accuracy : 0.7873          
##                  95% CI : (0.7756, 0.7987)
##     No Information Rate : 0.2845          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.7296          
##                                           
##  Mcnemar's Test P-Value : < 2.2e-16       
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            0.9254   0.7050   0.7789   0.7475   0.7037
## Specificity            0.9151   0.9360   0.9425   0.9680   0.9680
## Pos Pred Value         0.8125   0.7256   0.7408   0.8210   0.8320
## Neg Pred Value         0.9686   0.9297   0.9528   0.9513   0.9355
## Prevalence             0.2845   0.1935   0.1743   0.1639   0.1837
## Detection Rate         0.2633   0.1364   0.1358   0.1226   0.1293
## Detection Prevalence   0.3240   0.1880   0.1833   0.1493   0.1554
## Balanced Accuracy      0.9203   0.8205   0.8607   0.8578   0.8358

Linear SVM tries to separate classes with hyperplanes; it is much less flexible than Random Forest on complex, non‑linear sensor data and usually underperforms here.

5.4 Random Forest (rf) Performance

# Train Random Forest model
model_rf <- train(
  classe ~ .,
  data = local_train,
  method = "rf",
  trControl = fit_control,
  ntree = 150
)

# Predict on local validation and compute confusion matrix
pred_rf <- predict(model_rf, newdata = local_validation)
conf_matrix_rf <- confusionMatrix(pred_rf, local_validation$classe)
print(conf_matrix_rf)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 1394    3    0    0    0
##          B    1  945    6    0    0
##          C    0    1  849   15    1
##          D    0    0    0  785    1
##          E    0    0    0    4  899
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9935          
##                  95% CI : (0.9908, 0.9955)
##     No Information Rate : 0.2845          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.9917          
##                                           
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            0.9993   0.9958   0.9930   0.9764   0.9978
## Specificity            0.9991   0.9982   0.9958   0.9998   0.9990
## Pos Pred Value         0.9979   0.9926   0.9804   0.9987   0.9956
## Neg Pred Value         0.9997   0.9990   0.9985   0.9954   0.9995
## Prevalence             0.2845   0.1935   0.1743   0.1639   0.1837
## Detection Rate         0.2843   0.1927   0.1731   0.1601   0.1833
## Detection Prevalence   0.2849   0.1941   0.1766   0.1603   0.1841
## Balanced Accuracy      0.9992   0.9970   0.9944   0.9881   0.9984
# Extract accuracy and out‑of‑sample error
accuracy <- conf_matrix_rf$overall["Accuracy"]
out_of_sample_error <- 1 - accuracy

# Plot model
plot(model_rf)

We chose Random Forest because:

  • It combines hundreds of trees (bagging), reducing variance and overfitting.
  • At each split, it uses a random subset of features, which makes the ensemble robust to noisy sensor data.
  • It handles non‑linear relationships and multicollinearity very well, which is ideal for accelerometer and gyroscope readings.

6. Accuracies and Out‑of‑Sample Errors

We evaluate all models on the local validation set and compare their Expected Out‑of‑Sample Error Rates in a single table.

##      Accuracy OOS_Error
## Tree    0.488     0.512
## RF      0.993     0.007
## GBM     0.991     0.009
## SVM     0.787     0.213

This table shows that:

This comparison clearly justifies why Random Forest is the best choice for this project.


7. Predicting 20 Test Cases

We now use the final Random Forest model to predict the 20 unlabeled test cases.

# Match test columns to training columns (exclude 'classe')
clean_test_columns <- names(train_clean)[names(train_clean) != "classe"]
final_test_set <- testing_raw[, clean_test_columns]

# Align data types between test and local_train columns
for (col in names(final_test_set)) {
    class(final_test_set[[col]]) <- class(local_train[[col]])
}

# Run final predictions on the cleaned test set
final_quiz_predictions <- predict(model_rf, newdata = final_test_set)

# Create submission‑style table (Problem_ID and predicted class)
data.frame(
  Problem_ID     = testing_raw$problem_id,
  Predicted_Class = final_quiz_predictions
)
##    Problem_ID Predicted_Class
## 1           1               B
## 2           2               A
## 3           3               B
## 4           4               A
## 5           5               A
## 6           6               E
## 7           7               D
## 8           8               B
## 9           9               A
## 10         10               A
## 11         11               B
## 12         12               C
## 13         13               B
## 14         14               A
## 15         15               E
## 16         16               E
## 17         17               A
## 18         18               B
## 19         19               B
## 20         20               B

These 20 predicted classes (AE) are the answers for the Coursera 20‑question quiz.


8. Cleanup Environment

# De‑allocate parallel cluster and revert to sequential backend
stopCluster(cluster)
registerDoSEQ()

# Print confirmation message
cat("\\nProcess complete. Parallel cluster closed successfully.\\n")
## \nProcess complete. Parallel cluster closed successfully.\n

Conclusion