Executive Summary

This project develops a machine learning model to classify the quality of unilateral dumbbell biceps curls based on accelerometer data. Using a dataset of 19,622 observations, we compared classification approaches and found that a Random Forest model provided the highest predictive accuracy (>99%). This report details the data preprocessing steps, feature selection rationale, and model validation results.

classes <- c("A", "B", "C", "D", "E")
description <- c("Exactly according to the specification", "Throwing the elbows to the front", "Lifting the dumbbell only halfway", "Lowering the dumbbell only halfway", "Throwing the hips to the front")
status <- c("Correct", "Incorrect", "Incorrect", "Incorrect", "Incorrect")

# Movement class types 
movement_classes <- data.frame(Class=classes, Description=description, Status=status)
kable(movement_classes)
Class Description Status
A Exactly according to the specification Correct
B Throwing the elbows to the front Incorrect
C Lifting the dumbbell only halfway Incorrect
D Lowering the dumbbell only halfway Incorrect
E Throwing the hips to the front Incorrect

Goal: Predict the classe variable using measurements from sensors on the belt, forearm, arm, and dumbbell.

1. Data Loading and Initial Exploration

1.1 Load Required Libraries

# Data manipulation and visualization
library(tidyverse)
library(caret)
library(randomForest)

# Set seed for reproducibility
set.seed(42)

1.2 Examine Dataset Structure

# Load datasets
training_raw <- read.csv("pml-training.csv", na.strings = c("NA", "#DIV/0!", ""))
testing_raw <- read.csv("pml-testing.csv", na.strings = c("NA", "#DIV/0!", ""))

# Check dimensions
dim_info <- data.frame(Dataset = c("Training", "Testing"), 
                       Rows = c(nrow(training_raw), nrow(testing_raw)),
                       Cols = c(ncol(training_raw), ncol(testing_raw)))
kable(dim_info)
Dataset Rows Cols
Training 19622 160
Testing 20 160

1.3 Visualize Class Distribution

# Create bar plot of classe distribution
ggplot(training_raw, aes(x = classe, fill = classe)) +
  geom_bar(alpha = 0.7) +
  geom_text(stat = 'count', aes(label = after_stat(count)), vjust = -0.5) +
  labs(title = "Distribution of Activity Classes",
       subtitle = "Training Dataset",
       x = "Class",
       y = "Count") +
  theme_minimal() +
  theme(legend.position = "none")

Interpretation: The classes are relatively balanced, which is favorable for classification. Class A (correct execution) is slightly more frequent than the error classes (B-E).

2. Data Preprocessing

To improve the probability that the model could generalize, we performed three cleaning steps:

  1. Metadata Removal: Variables like user_name and timestamps were removed to prevent the model from overfitting to specific individuals or time periods.
  2. Sparsity Filtering: Features with >95% missing values were discarded.
  3. Near-Zero Variance (NZV): Removed features with minimal variation that provide no predictive power.

2.1 Remove Non-Predictive Features

Timestamp, window, and user identification variables are not sensor measurements and therefore should not be used.

cols_to_remove <- c("X", "user_name", "raw_timestamp_part_1", "raw_timestamp_part_2", 
                    "cvtd_timestamp", "new_window", "num_window")
train_clean <- training_raw %>% select(-one_of(cols_to_remove))

2.2 Handle Missing Values

Variables with excessive missing data (>95%) provide little predictive value and introduce noise.

# Calculate missing value percentage for each column
missing_pct <- colMeans(is.na(train_clean) | train_clean == "" | train_clean == "#DIV/0!")

# Visualize missing data
missing_df <- data.frame(
  column = names(missing_pct),
  missing_pct = missing_pct * 100
) %>% 
  arrange(desc(missing_pct)) %>%
  filter(missing_pct > 0)

# Plot top 30 columns with missing data
ggplot(head(missing_df, 30), aes(x = reorder(column, missing_pct), y = missing_pct)) +
  geom_bar(stat = "identity", fill = "steelblue", alpha = 0.7) +
  coord_flip() +
  labs(title = "Missing Data by Feature (Top 30)",
       x = "Feature",
       y = "Missing Percentage (%)") +
  theme_minimal()

# Remove columns with >95% missing values
high_missing_cols <- names(missing_pct[missing_pct > 0.95])
train_clean <- train_clean %>% select(-one_of(high_missing_cols))

2.3 Remove Near-Zero Variance Predictors

Variables with near-zero variance provide minimal discriminatory power and can cause numerical instability.

# Identify near-zero variance predictors
nzv <- nearZeroVar(train_clean, saveMetrics = TRUE)
nzv_cols <- rownames(nzv[nzv$nzv == TRUE, ])

if(length(nzv_cols) > 0) {
  train_clean <- train_clean %>% select(-one_of(nzv_cols))
}

2.4 Align Test Set With Training Features

testing_clean <- testing_raw %>% select(any_of(names(train_clean)))

3. Data Splitting for Model Validation

Now we create a validation set from the training data to assess model performance before final testing.

# Data Splitting (75/25)
train_clean$classe <- as.factor(train_clean$classe)

inTrain <- createDataPartition(train_clean$classe, p = 0.75, list = FALSE)
train_set <- train_clean[inTrain, ]
valid_set <- train_clean[-inTrain, ]

4. Model Training: Random Forest

4.1 Random Forest Theory and Assumptions

Algorithm: Random Forest builds multiple decision trees using bootstrap samples and random feature subsets, then aggregates predictions through majority voting.

Assumptions:

  • Features are independent predictors (relaxed assumption)
  • Tree ensemble reduces overfitting through averaging
  • No distributional assumptions on features

Advantages:

  • Handles non-linear relationships
  • Robust to outliers
  • Provides feature importance measures
  • Minimal hyperparameter tuning required
# Train Random Forest with 100 trees
rf_model <- randomForest(classe ~ ., data = train_set, ntree = 100,)
print(rf_model)
## 
## Call:
##  randomForest(formula = classe ~ ., data = train_set, ntree = 100,      ) 
##                Type of random forest: classification
##                      Number of trees: 100
## No. of variables tried at each split: 7
## 
##         OOB estimate of  error rate: 0.57%
## Confusion matrix:
##      A    B    C    D    E  class.error
## A 4181    2    0    2    0 0.0009557945
## B   19 2820    9    0    0 0.0098314607
## C    0   17 2546    4    0 0.0081807557
## D    0    0   21 2390    1 0.0091210614
## E    0    0    2    7 2697 0.0033259424

5. Evaluation

5.1 Feature Importance

The following plot shows the top 20 predictors contributing most to the model’s accuracy.

varImpPlot(rf_model, main = "Top Features by Gini Importance", n.var = 20)

5.2 Validation Set Performance

We apply the model to the validation set to estimate the Out-of-Sample Error.

# Predictions
rf_pred <- predict(rf_model, valid_set)
conf_mat <- confusionMatrix(rf_pred, valid_set$classe)

# Display Accuracy
accuracy_results <- data.frame(Metric = c("Accuracy", "Kappa"), 
                               Value = c(conf_mat$overall['Accuracy'], conf_mat$overall['Kappa']))
kable(accuracy_results)
Metric Value
Accuracy Accuracy 0.9946982
Kappa Kappa 0.9932930

Interpretation: With an accuracy of 99.47%, the estimated out-of-sample error is 0.53%.

6. Final Test Predictions

Applying the final model to the 20 test cases provided in the original study.

test_final_preds <- predict(rf_model, testing_clean)
final_results <- data.frame(Problem_ID = 1:20, Predicted_Classe = test_final_preds)
kable(t(final_results)) 
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
Problem_ID 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
Predicted_Classe B A B A A E D B A A B C B A E E A B B B

7. Conclusion

The Random Forest model is highly effective for this classification task. The primary drivers of movement quality prediction are sensors located on the belt and dumbbell, suggesting these areas capture the most significant deviations in weightlifting form.