This is Practical Machine Learning Project Report for Coursera’s Data Science Specialization Course offered by Johns Hopkins University.
The goal of this project is to predict the manner in which
participants performed barbell lifts, encoded as the classe
variable in the training set.
We train 4 models: Decision Tree, Random Forest, Gradient Boosted Trees, and Support Vector Machine, using k‑fold cross‑validation on the training set. We then evaluate performance on a validation subset randomly selected from the original training data to estimate accuracy and out‑of‑sample error rate.
The classe variable has five levels (A–E) representing
different forms of execution quality (perfect vs common mistakes). We
use accelerometer and orientation features from multiple sensors to
build the predictive model.
In this report, we describe: - How the model was
built.
- How cross‑validation was used.
- What the expected out‑of‑sample error is.
- Why we chose Random Forest over simpler models.
Finally, we use the trained model to predict 20 test cases for the Coursera quiz.
Using devices such as Jawbone Up, Nike FuelBand, and Fitbit it is now possible to collect a large amount of data about personal activity relatively inexpensively. These type of devices are part of the quantified self movement – a group of enthusiasts who take measurements about themselves regularly to improve their health, to find patterns in their behavior, or because they are tech geeks. One thing that people regularly do is quantify how much of a particular activity they do, but they rarely quantify how well they do it. In this project, your goal will be to use data from accelerometers on the belt, forearm, arm, and dumbell of 6 participants. They were asked to perform barbell lifts correctly and incorrectly in 5 different ways. More information is available from the website here: http://web.archive.org/web/20161224072740/http:/groupware.les.inf.puc-rio.br/har (see the section on the Weight Lifting Exercise Dataset).
Data : The training data for this project are available here:
https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv
The test data are available here:
https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv
The data for this project come from this source: http://web.archive.org/web/20161224072740/http:/groupware.les.inf.puc-rio.br/har . If you use the document you create for this class for any purpose please cite them as they have been very generous in allowing their data to be used for this kind of assignment. —
To maximize computational efficiency and avoid memory bottlenecks
during ensemble training, we initialize a parallel cluster across
available CPU threads using doParallel.
# Package dependency checklist (including visualization engines)
required_packages <- c(
"caret", "randomForest", "rpart", "rpart.plot",
"corrplot", "parallel", "doParallel", "gbm"
)
# Install packages that are not already installed
new_packages <- required_packages[!(required_packages %in% installed.packages()[,"Package"])]
if(length(new_packages)) install.packages(new_packages)
# Load required libraries
library(caret) # Training and model utilities
library(randomForest) # Random Forest implementation
library(rpart) # Decision tree (rpart)
library(rpart.plot) # Tree plotting
library(corrplot) # Correlation matrix visualization
library(parallel) # Detect CPU cores
library(doParallel) # Parallel backend for caret
library(gbm) # Gradient Boosting Machine
# Set seed for deterministic reproducibility
set.seed(12345)
# Configure parallel computation (leave 1 core free for OS)
cluster <- makeCluster(detectCores() - 1)
registerDoParallel(cluster)
The training and test files are pulled directly from Coursera’s cloud
storage.
Missing values, #DIV/0!, and blanks are treated as
NA.
train_data_url <-"https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv"
test_data_url <- "https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv"
train_data_file <- "./data/pml-training.csv"
test_data_file <- "./data/pml-testing.csv"
if (!file.exists("./data")) {
dir.create("./data")
}
if (!file.exists(train_data_file)) {
download.file(train_data_url, destfile = train_data_file, method = "curl")
}
if (!file.exists(test_data_file)) {
download.file(test_data_url, destfile = test_data_file, method = "curl")
}
rm(train_data_url)
rm(test_data_url)
# Read raw CSV files
training_raw <- read.csv("./data/pml-training.csv")
testing_raw <- read.csv("./data/pml-testing.csv")
# Inspect dimensions of the raw training and testing data
dim(training_raw)
## [1] 19622 160
dim(testing_raw)
## [1] 20 160
The raw training dataset has 19,622 observations across 160 columns, including:
Our job is to use the relevant sensor features to predict
classe.
Feeding raw data into tree ensembles leads to memory overflows and
high variance.
We apply three filters:
# 3.1 Drop columns containing any missing data
na_counts <- colSums(is.na(training_raw))
good_columns <- na_counts == 0
train_clean <- training_raw[, good_columns]
# 3.2 Remove metadata columns (first 7)
train_clean <- train_clean[, -c(1:7)]
# 3.3 Remove near-zero variance predictors
nzv_metrics <- nearZeroVar(train_clean, saveMetrics = TRUE)
train_clean <- train_clean[, !nzv_metrics$nzv]
# Ensure target variable 'classe' is a factor
train_clean$classe <- as.factor(train_clean$classe)
# Final dimensionality of cleaned dataset
dim(train_clean)
## [1] 19622 53
After cleaning, the dataset is reduced to 53 high‑quality accelerometer and orientation features, ready for modeling.
# Compute correlation matrix for first 20 numeric predictors
corr_matrix <- cor(train_clean[, 1:20])
# Plot correlation matrix as a heatmap
corrplot(
corr_matrix,
method = "color",
type = "lower",
tl.cex = 0.6,
tl.col = "black",
main = "\\nFigure 2: Correlation Matrix of Top 20 Predictors"
)
Figure 2: Correlation Matrix of Top 20 Predictors
Random Forests are robust to multicollinearity, so we keep these features and let the model decide which combinations are most predictive.
To estimate out‑of‑sample error without using the unlabeled test file
(pml-testing.csv), we split the cleaned training data
into:
# Stratified partition split (preserves class proportions)
in_train <- createDataPartition(train_clean$classe, p = 0.75, list = FALSE)
local_train <- train_clean[in_train, ]
local_validation <- train_clean[-in_train, ]
# Define common cross‑validation control (5‑fold CV, parallel enabled)
fit_control <- trainControl(
method = "cv",
number = 5,
allowParallel = TRUE,
verboseIter = FALSE
)
Within the Random Forest, GBM, and SVM training steps, we also use
5‑fold CV (fit_control), which further reduces sample bias
and improves stability of the estimates.
This two‑level strategy (train/validation split + internal CV) is our justification for why we trust the out‑of‑sample error estimate.
We compare four approaches:
rpart), as a baseline.rf), as our chosen final model.gbm).svmLinear).# Train Decision tree model
model_tree <- train(
classe ~ .,
data = local_train,
method = "rpart",
trControl = fit_control
)
# Predict on local validation and compute confusion matrix
pred_tree <- predict(model_tree, newdata = local_validation)
tree_cm <- confusionMatrix(pred_tree, local_validation$classe)
tree_accuracy <- tree_cm$overall["Accuracy"]
# Print confusion matrix and accuracy
print(tree_cm)
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 1252 396 434 343 114
## B 30 317 24 151 132
## C 90 236 397 310 229
## D 0 0 0 0 0
## E 23 0 0 0 426
##
## Overall Statistics
##
## Accuracy : 0.4878
## 95% CI : (0.4737, 0.5019)
## No Information Rate : 0.2845
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.3306
##
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 0.8975 0.33404 0.46433 0.0000 0.47281
## Specificity 0.6332 0.91479 0.78637 1.0000 0.99425
## Pos Pred Value 0.4931 0.48471 0.31458 NaN 0.94878
## Neg Pred Value 0.9395 0.85129 0.87424 0.8361 0.89338
## Prevalence 0.2845 0.19352 0.17435 0.1639 0.18373
## Detection Rate 0.2553 0.06464 0.08095 0.0000 0.08687
## Detection Prevalence 0.5177 0.13336 0.25734 0.0000 0.09156
## Balanced Accuracy 0.7654 0.62441 0.62535 0.5000 0.73353
tree_accuracy
## Accuracy
## 0.4877651
# Plotting the model
plot(model_tree)
The tree’s accuracy is low (~50–55%), confirming it generalizes poorly on this sensor‑data problem.
# Plot the final rpart tree structure
rpart.plot(
model_tree$finalModel,
main = "Figure 3: Decision Tree Structure (rpart)",
type = 2,
fallen.leaves = TRUE,
cex = 0.6
)
Figure 3: Decision Tree Structure (rpart)
The tree is easy to interpret but typically achieves only ~50–55% accuracy on local validation, because single trees overfit and struggle with non‑linear, multi‑sensor patterns.
# Train GBM model
mod_gbm <- train(
classe ~ .,
data = local_train,
method = "gbm",
trControl = fit_control,
tuneLength = 5,
verbose = FALSE
)
# Evaluate GBM Model
pred_gbm <- predict(mod_gbm, local_validation)
cmgbm <- confusionMatrix(pred_gbm, local_validation$classe)
print(cmgbm)
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 1392 2 0 0 0
## B 3 945 2 0 4
## C 0 2 848 14 2
## D 0 0 5 785 3
## E 0 0 0 5 892
##
## Overall Statistics
##
## Accuracy : 0.9914
## 95% CI : (0.9884, 0.9938)
## No Information Rate : 0.2845
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.9892
##
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 0.9978 0.9958 0.9918 0.9764 0.9900
## Specificity 0.9994 0.9977 0.9956 0.9980 0.9988
## Pos Pred Value 0.9986 0.9906 0.9792 0.9899 0.9944
## Neg Pred Value 0.9991 0.9990 0.9983 0.9954 0.9978
## Prevalence 0.2845 0.1935 0.1743 0.1639 0.1837
## Detection Rate 0.2838 0.1927 0.1729 0.1601 0.1819
## Detection Prevalence 0.2843 0.1945 0.1766 0.1617 0.1829
## Balanced Accuracy 0.9986 0.9968 0.9937 0.9872 0.9944
# Plot GBM results
plot(mod_gbm)
Gradient Boosting builds trees sequentially, correcting errors from
previous trees.
It often performs well on this dataset but is more sensitive to tuning
and slower to train than Random Forest.
# Train a linear SVM model
mod_svm <- train(
classe ~ .,
data = local_train,
method = "svmLinear",
trControl = fit_control,
tuneLength = 5,
verbose = FALSE
)
# Evaluate SVM Model
pred_svm <- predict(mod_svm, local_validation)
cmsvm <- confusionMatrix(pred_svm, local_validation$classe)
print(cmsvm)
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 1291 129 75 48 46
## B 24 669 79 35 115
## C 36 59 666 83 55
## D 33 22 25 601 51
## E 11 70 10 37 634
##
## Overall Statistics
##
## Accuracy : 0.7873
## 95% CI : (0.7756, 0.7987)
## No Information Rate : 0.2845
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.7296
##
## Mcnemar's Test P-Value : < 2.2e-16
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 0.9254 0.7050 0.7789 0.7475 0.7037
## Specificity 0.9151 0.9360 0.9425 0.9680 0.9680
## Pos Pred Value 0.8125 0.7256 0.7408 0.8210 0.8320
## Neg Pred Value 0.9686 0.9297 0.9528 0.9513 0.9355
## Prevalence 0.2845 0.1935 0.1743 0.1639 0.1837
## Detection Rate 0.2633 0.1364 0.1358 0.1226 0.1293
## Detection Prevalence 0.3240 0.1880 0.1833 0.1493 0.1554
## Balanced Accuracy 0.9203 0.8205 0.8607 0.8578 0.8358
Linear SVM tries to separate classes with hyperplanes; it is much less flexible than Random Forest on complex, non‑linear sensor data and usually underperforms here.
# Train Random Forest model
model_rf <- train(
classe ~ .,
data = local_train,
method = "rf",
trControl = fit_control,
ntree = 150
)
# Predict on local validation and compute confusion matrix
pred_rf <- predict(model_rf, newdata = local_validation)
conf_matrix_rf <- confusionMatrix(pred_rf, local_validation$classe)
print(conf_matrix_rf)
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 1394 3 0 0 0
## B 1 945 6 0 0
## C 0 1 849 15 1
## D 0 0 0 785 1
## E 0 0 0 4 899
##
## Overall Statistics
##
## Accuracy : 0.9935
## 95% CI : (0.9908, 0.9955)
## No Information Rate : 0.2845
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.9917
##
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 0.9993 0.9958 0.9930 0.9764 0.9978
## Specificity 0.9991 0.9982 0.9958 0.9998 0.9990
## Pos Pred Value 0.9979 0.9926 0.9804 0.9987 0.9956
## Neg Pred Value 0.9997 0.9990 0.9985 0.9954 0.9995
## Prevalence 0.2845 0.1935 0.1743 0.1639 0.1837
## Detection Rate 0.2843 0.1927 0.1731 0.1601 0.1833
## Detection Prevalence 0.2849 0.1941 0.1766 0.1603 0.1841
## Balanced Accuracy 0.9992 0.9970 0.9944 0.9881 0.9984
# Extract accuracy and out‑of‑sample error
accuracy <- conf_matrix_rf$overall["Accuracy"]
out_of_sample_error <- 1 - accuracy
# Plot model
plot(model_rf)
We chose Random Forest because:
We evaluate all models on the local validation set and compare their Expected Out‑of‑Sample Error Rates in a single table.
## Accuracy OOS_Error
## Tree 0.488 0.512
## RF 0.993 0.007
## GBM 0.991 0.009
## SVM 0.787 0.213
This table shows that:
This comparison clearly justifies why Random Forest is the best choice for this project.
We now use the final Random Forest model to predict the 20 unlabeled test cases.
# Match test columns to training columns (exclude 'classe')
clean_test_columns <- names(train_clean)[names(train_clean) != "classe"]
final_test_set <- testing_raw[, clean_test_columns]
# Align data types between test and local_train columns
for (col in names(final_test_set)) {
class(final_test_set[[col]]) <- class(local_train[[col]])
}
# Run final predictions on the cleaned test set
final_quiz_predictions <- predict(model_rf, newdata = final_test_set)
# Create submission‑style table (Problem_ID and predicted class)
data.frame(
Problem_ID = testing_raw$problem_id,
Predicted_Class = final_quiz_predictions
)
## Problem_ID Predicted_Class
## 1 1 B
## 2 2 A
## 3 3 B
## 4 4 A
## 5 5 A
## 6 6 E
## 7 7 D
## 8 8 B
## 9 9 A
## 10 10 A
## 11 11 B
## 12 12 C
## 13 13 B
## 14 14 A
## 15 15 E
## 16 16 E
## 17 17 A
## 18 18 B
## 19 19 B
## 20 20 B
These 20 predicted classes (A–E) are the
answers for the Coursera 20‑question quiz.
# De‑allocate parallel cluster and revert to sequential backend
stopCluster(cluster)
registerDoSEQ()
# Print confirmation message
cat("\\nProcess complete. Parallel cluster closed successfully.\\n")
## \nProcess complete. Parallel cluster closed successfully.\n
classe
variable and the remaining sensor features.caret training.rpart).rf).gbm).svmLinear).