This project develops a machine learning model to classify the quality of unilateral dumbbell biceps curls based on accelerometer data. Using a dataset of 19,622 observations, we compared classification approaches and found that a Random Forest model provided the highest predictive accuracy (>99%). This report details the data preprocessing steps, feature selection rationale, and model validation results.
classes <- c("A", "B", "C", "D", "E")
description <- c("Exactly according to the specification", "Throwing the elbows to the front", "Lifting the dumbbell only halfway", "Lowering the dumbbell only halfway", "Throwing the hips to the front")
status <- c("Correct", "Incorrect", "Incorrect", "Incorrect", "Incorrect")
# Movement class types
movement_classes <- data.frame(Class=classes, Description=description, Status=status)
kable(movement_classes)| Class | Description | Status |
|---|---|---|
| A | Exactly according to the specification | Correct |
| B | Throwing the elbows to the front | Incorrect |
| C | Lifting the dumbbell only halfway | Incorrect |
| D | Lowering the dumbbell only halfway | Incorrect |
| E | Throwing the hips to the front | Incorrect |
Goal: Predict the classe variable using
measurements from sensors on the belt, forearm, arm, and dumbbell.
# Load datasets
training_raw <- read.csv("pml-training.csv", na.strings = c("NA", "#DIV/0!", ""))
testing_raw <- read.csv("pml-testing.csv", na.strings = c("NA", "#DIV/0!", ""))
# Check dimensions
dim_info <- data.frame(Dataset = c("Training", "Testing"),
Rows = c(nrow(training_raw), nrow(testing_raw)),
Cols = c(ncol(training_raw), ncol(testing_raw)))
kable(dim_info)| Dataset | Rows | Cols |
|---|---|---|
| Training | 19622 | 160 |
| Testing | 20 | 160 |
# Create bar plot of classe distribution
ggplot(training_raw, aes(x = classe, fill = classe)) +
geom_bar(alpha = 0.7) +
geom_text(stat = 'count', aes(label = after_stat(count)), vjust = -0.5) +
labs(title = "Distribution of Activity Classes",
subtitle = "Training Dataset",
x = "Class",
y = "Count") +
theme_minimal() +
theme(legend.position = "none")Interpretation: The classes are relatively balanced, which is favorable for classification. Class A (correct execution) is slightly more frequent than the error classes (B-E).
To improve the probability that the model could generalize, we performed three cleaning steps:
Timestamp, window, and user identification variables are not sensor measurements and therefore should not be used.
Variables with excessive missing data (>95%) provide little predictive value and introduce noise.
# Calculate missing value percentage for each column
missing_pct <- colMeans(is.na(train_clean) | train_clean == "" | train_clean == "#DIV/0!")
# Visualize missing data
missing_df <- data.frame(
column = names(missing_pct),
missing_pct = missing_pct * 100
) %>%
arrange(desc(missing_pct)) %>%
filter(missing_pct > 0)
# Plot top 30 columns with missing data
ggplot(head(missing_df, 30), aes(x = reorder(column, missing_pct), y = missing_pct)) +
geom_bar(stat = "identity", fill = "steelblue", alpha = 0.7) +
coord_flip() +
labs(title = "Missing Data by Feature (Top 30)",
x = "Feature",
y = "Missing Percentage (%)") +
theme_minimal()Variables with near-zero variance provide minimal discriminatory power and can cause numerical instability.
Now we create a validation set from the training data to assess model performance before final testing.
Algorithm: Random Forest builds multiple decision trees using bootstrap samples and random feature subsets, then aggregates predictions through majority voting.
Assumptions:
Advantages:
# Train Random Forest with 100 trees
rf_model <- randomForest(classe ~ ., data = train_set, ntree = 100,)
print(rf_model)##
## Call:
## randomForest(formula = classe ~ ., data = train_set, ntree = 100, )
## Type of random forest: classification
## Number of trees: 100
## No. of variables tried at each split: 7
##
## OOB estimate of error rate: 0.57%
## Confusion matrix:
## A B C D E class.error
## A 4181 2 0 2 0 0.0009557945
## B 19 2820 9 0 0 0.0098314607
## C 0 17 2546 4 0 0.0081807557
## D 0 0 21 2390 1 0.0091210614
## E 0 0 2 7 2697 0.0033259424
The following plot shows the top 20 predictors contributing most to the model’s accuracy.
We apply the model to the validation set to estimate the Out-of-Sample Error.
# Predictions
rf_pred <- predict(rf_model, valid_set)
conf_mat <- confusionMatrix(rf_pred, valid_set$classe)
# Display Accuracy
accuracy_results <- data.frame(Metric = c("Accuracy", "Kappa"),
Value = c(conf_mat$overall['Accuracy'], conf_mat$overall['Kappa']))
kable(accuracy_results)| Metric | Value | |
|---|---|---|
| Accuracy | Accuracy | 0.9946982 |
| Kappa | Kappa | 0.9932930 |
Interpretation: With an accuracy of 99.47%, the estimated out-of-sample error is 0.53%.
Applying the final model to the 20 test cases provided in the original study.
test_final_preds <- predict(rf_model, testing_clean)
final_results <- data.frame(Problem_ID = 1:20, Predicted_Classe = test_final_preds)
kable(t(final_results)) | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 | 16 | 17 | 18 | 19 | 20 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Problem_ID | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 | 16 | 17 | 18 | 19 | 20 |
| Predicted_Classe | B | A | B | A | A | E | D | B | A | A | B | C | B | A | E | E | A | B | B | B |
The Random Forest model is highly effective for this classification task. The primary drivers of movement quality prediction are sensors located on the belt and dumbbell, suggesting these areas capture the most significant deviations in weightlifting form.