Human Activity Recognition Analysis

Executive Summary

This analysis aims to predict the manner in which participants performed barbell lifts using data from accelerometers. The goal is to classify whether exercises were performed correctly (classe A) or with one of four common mistakes (classes B-E). A Random Forest model achieved 99.5% accuracy on the validation set with an estimated out-of-sample error rate of 0.5%.

Introduction

Using devices such as Jawbone Up, Nike FuelBand, and Fitbit, it is now possible to collect a large amount of data about personal activity relatively inexpensively. These type of devices are part of the quantified self movement – a group of enthusiasts who take measurements about themselves regularly to improve their health, to find patterns in their behavior, or because they are tech geeks.

One thing that people regularly do is quantify how much of a particular activity they do, but they rarely quantify how well they do it. In this project, we use data from accelerometers on the belt, forearm, arm, and dumbell of 6 participants to predict the quality of barbell lifts.

The participants were asked to perform barbell lifts correctly and incorrectly in 5 different ways: - Class A: exactly according to the specification - Class B: throwing the elbows to the front - Class C: lifting the dumbbell only halfway - Class D: lowering the dumbbell only halfway - Class E: throwing the hips to the front

Data Exploration

Loading and Initial Inspection

# Load the data - using pre-loaded datasets from environment
if (exists("train_data") && exists("test_data")) {
  training <- train_data
  testing <- test_data
} else {
  # Fallback to loading from CSV files if not pre-loaded
  training <- read.csv("pml-training.csv", na.strings = c("NA", "", "#DIV/0!"))
  testing <- read.csv("pml-testing.csv", na.strings = c("NA", "", "#DIV/0!"))
}

# Basic data exploration
cat("Training set dimensions:", dim(training), "\n")

## Training set dimensions: 19622 160

cat("Testing set dimensions:", dim(testing), "\n")

## Testing set dimensions: 20 160

cat("\nClasse distribution in training set:\n")

## 
## Classe distribution in training set:

print(table(training$classe))

## 
##    A    B    C    D    E 
## 5580 3797 3422 3216 3607

Data Quality Assessment

# Check for missing values
na_count <- sapply(training, function(x) sum(is.na(x)))
na_percent <- na_count / nrow(training) * 100

# Variables with high missing value percentage
high_na_vars <- names(na_percent[na_percent > 90])
cat("Variables with >90% missing values:", length(high_na_vars), "\n")

## Variables with >90% missing values: 100

# Variables with low/no missing values
low_na_vars <- names(na_percent[na_percent < 5])
cat("Variables with <5% missing values:", length(low_na_vars), "\n")

## Variables with <5% missing values: 60

# Show structure of some key variables
str(training[, c("user_name", "classe", "num_window", "roll_belt", "pitch_belt", "yaw_belt")])

## 'data.frame':    19622 obs. of  6 variables:
##  $ user_name : chr  "carlitos" "carlitos" "carlitos" "carlitos" ...
##  $ classe    : chr  "A" "A" "A" "A" ...
##  $ num_window: int  11 11 11 12 12 12 12 12 12 12 ...
##  $ roll_belt : num  1.41 1.41 1.42 1.48 1.48 1.45 1.42 1.42 1.43 1.45 ...
##  $ pitch_belt: num  8.07 8.07 8.07 8.05 8.07 8.06 8.09 8.13 8.16 8.17 ...
##  $ yaw_belt  : num  -94.4 -94.4 -94.4 -94.4 -94.4 -94.4 -94.4 -94.4 -94.4 -94.4 ...

Feature Overview

# Focus on complete cases for modeling
complete_vars <- names(training)[na_count == 0]

# Remove non-predictive variables
remove_vars <- c("X", "user_name", "raw_timestamp_part_1", "raw_timestamp_part_2", 
                "cvtd_timestamp", "new_window", "num_window")
complete_vars <- complete_vars[!complete_vars %in% remove_vars]

cat("Final number of predictive variables:", length(complete_vars) - 1, "\n") # -1 for classe

## Final number of predictive variables: 52

cat("Variables selected for modeling:\n")

## Variables selected for modeling:

print(head(complete_vars, 10))

##  [1] "roll_belt"        "pitch_belt"       "yaw_belt"         "total_accel_belt"
##  [5] "gyros_belt_x"     "gyros_belt_y"     "gyros_belt_z"     "accel_belt_x"    
##  [9] "accel_belt_y"     "accel_belt_z"

Data Preprocessing

# Create clean training dataset with complete variables only
training_clean <- training[, complete_vars]
testing_clean <- testing[, complete_vars[complete_vars != "classe"]] # testing doesn't have classe

# Ensure classe is a factor
training_clean$classe <- as.factor(training_clean$classe)

# Check for near zero variance predictors
nzv <- nearZeroVar(training_clean[, -ncol(training_clean)], saveMetrics = TRUE)
nzv_vars <- rownames(nzv[nzv$nzv == TRUE, ])
cat("Near zero variance variables:", length(nzv_vars), "\n")

## Near zero variance variables: 0

# Remove near zero variance variables if any
if (length(nzv_vars) > 0) {
  training_clean <- training_clean[, !names(training_clean) %in% nzv_vars]
  testing_clean <- testing_clean[, !names(testing_clean) %in% nzv_vars]
}

cat("Final training set dimensions:", dim(training_clean), "\n")

## Final training set dimensions: 19622 53

cat("Final testing set dimensions:", dim(testing_clean), "\n")

## Final testing set dimensions: 20 52

Model Development

Cross-Validation Strategy

# Create data partition for cross-validation
set.seed(12345)
inTrain <- createDataPartition(training_clean$classe, p = 0.7, list = FALSE)
train_set <- training_clean[inTrain, ]
validation_set <- training_clean[-inTrain, ]

cat("Training set size:", nrow(train_set), "\n")

## Training set size: 13737

cat("Validation set size:", nrow(validation_set), "\n")

## Validation set size: 5885

# Set up cross-validation control
ctrl <- trainControl(method = "cv", number = 5, verboseIter = FALSE)

Model Training

Random Forest Model

# Train Random Forest model
cat("Training Random Forest model...\n")

## Training Random Forest model...

rf_model <- train(classe ~ ., 
                 data = train_set,
                 method = "rf",
                 trControl = ctrl,
                 ntree = 100,  # Reduced for faster computation
                 importance = TRUE)

print(rf_model)

## Random Forest 
## 
## 13737 samples
##    52 predictor
##     5 classes: 'A', 'B', 'C', 'D', 'E' 
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold) 
## Summary of sample sizes: 10987, 10990, 10990, 10991, 10990 
## Resampling results across tuning parameters:
## 
##   mtry  Accuracy   Kappa    
##    2    0.9895171  0.9867390
##   27    0.9893718  0.9865556
##   52    0.9836219  0.9792811
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 2.

print(rf_model$finalModel)

## 
## Call:
##  randomForest(x = x, y = y, ntree = 100, mtry = min(param$mtry,      ncol(x)), importance = TRUE) 
##                Type of random forest: classification
##                      Number of trees: 100
## No. of variables tried at each split: 2
## 
##         OOB estimate of  error rate: 0.86%
## Confusion matrix:
##      A    B    C    D    E class.error
## A 3901    4    1    0    0 0.001280082
## B   20 2630    8    0    0 0.010534236
## C    0   23 2365    8    0 0.012938230
## D    1    0   43 2205    3 0.020870337
## E    0    1    2    4 2518 0.002772277

Model Validation and Performance

Validation Set Performance

# Make predictions on validation set
rf_pred <- predict(rf_model, validation_set)

# Calculate confusion matrix
conf_matrix <- confusionMatrix(rf_pred, validation_set$classe)
print(conf_matrix)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 1672    5    0    0    0
##          B    2 1134    3    0    0
##          C    0    0 1023   21    0
##          D    0    0    0  942    0
##          E    0    0    0    1 1082
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9946          
##                  95% CI : (0.9923, 0.9963)
##     No Information Rate : 0.2845          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.9931          
##                                           
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            0.9988   0.9956   0.9971   0.9772   1.0000
## Specificity            0.9988   0.9989   0.9957   1.0000   0.9998
## Pos Pred Value         0.9970   0.9956   0.9799   1.0000   0.9991
## Neg Pred Value         0.9995   0.9989   0.9994   0.9955   1.0000
## Prevalence             0.2845   0.1935   0.1743   0.1638   0.1839
## Detection Rate         0.2841   0.1927   0.1738   0.1601   0.1839
## Detection Prevalence   0.2850   0.1935   0.1774   0.1601   0.1840
## Balanced Accuracy      0.9988   0.9973   0.9964   0.9886   0.9999

# Calculate accuracy and error rates
accuracy <- conf_matrix$overall['Accuracy']
out_of_sample_error <- 1 - accuracy

cat("\n=== MODEL PERFORMANCE SUMMARY ===\n")

## 
## === MODEL PERFORMANCE SUMMARY ===

cat("Validation Set Accuracy:", round(accuracy * 100, 2), "%\n")

## Validation Set Accuracy: 99.46 %

cat("Estimated Out-of-Sample Error:", round(out_of_sample_error * 100, 2), "%\n")

## Estimated Out-of-Sample Error: 0.54 %

Variable Importance

# Plot variable importance
varImp_plot <- plot(varImp(rf_model), top = 20, main = "Top 20 Most Important Variables")
print(varImp_plot)

Final Predictions

Test Set Predictions

# Make predictions on the test set
test_predictions <- predict(rf_model, testing_clean)

cat("Predictions for the 20 test cases:\n")

## Predictions for the 20 test cases:

for(i in 1:length(test_predictions)) {
  cat("Test Case", i, ":", as.character(test_predictions[i]), "\n")
}

## Test Case 1 : B 
## Test Case 2 : A 
## Test Case 3 : B 
## Test Case 4 : A 
## Test Case 5 : A 
## Test Case 6 : E 
## Test Case 7 : D 
## Test Case 8 : B 
## Test Case 9 : A 
## Test Case 10 : A 
## Test Case 11 : B 
## Test Case 12 : C 
## Test Case 13 : B 
## Test Case 14 : A 
## Test Case 15 : E 
## Test Case 16 : E 
## Test Case 17 : A 
## Test Case 18 : B 
## Test Case 19 : B 
## Test Case 20 : B

# Create a data frame with results
results_df <- data.frame(
  Problem_ID = 1:20,
  Predicted_Class = as.character(test_predictions)
)

print(results_df)

##    Problem_ID Predicted_Class
## 1           1               B
## 2           2               A
## 3           3               B
## 4           4               A
## 5           5               A
## 6           6               E
## 7           7               D
## 8           8               B
## 9           9               A
## 10         10               A
## 11         11               B
## 12         12               C
## 13         13               B
## 14         14               A
## 15         15               E
## 16         16               E
## 17         17               A
## 18         18               B
## 19         19               B
## 20         20               B

Conclusions

Model Performance Summary

The Random Forest model demonstrates excellent performance:

Cross-validation accuracy: ~99.5%
Estimated out-of-sample error: ~0.5%
Strong performance across all classes: The confusion matrix shows high precision and recall for all five exercise classes

Model Selection Rationale

Random Forest was chosen as the primary algorithm for several reasons:

Robustness: Handles high-dimensional data well and is resistant to overfitting
Feature Selection: Automatically handles variable selection and provides importance measures
Performance: Consistently achieves high accuracy on classification problems
Interpretability: Provides variable importance rankings

Cross-Validation Approach

The model uses 5-fold cross-validation to ensure robust performance estimates. Additionally, a separate validation set (30% of training data) was used for final model evaluation to provide an unbiased estimate of out-of-sample performance.

Expected Out-of-Sample Error

Based on the validation set performance, the expected out-of-sample error rate is approximately 0.5%, indicating that the model should correctly classify about 99.5% of new, unseen cases.

Human Activity Recognition Analysis

Marco Gigante

2025-09-27

Human Activity Recognition Analysis

Executive Summary

Introduction

Data Exploration

Loading and Initial Inspection

Data Quality Assessment

Feature Overview

Data Preprocessing

Model Development

Cross-Validation Strategy

Model Training

Random Forest Model

Model Validation and Performance

Validation Set Performance

Variable Importance

Final Predictions

Test Set Predictions

Conclusions

Model Performance Summary

Model Selection Rationale

Cross-Validation Approach

Expected Out-of-Sample Error