Project Report: Exercise Classification Prediction

Objective

The goal of this project is to build a predictive model that classifies the manner in which an exercise was performed based on various sensor measurements and timestamps. The target variable is classe, which represents the exercise class, and the features used for prediction include sensor measurements and time-related information.

Data Preparation

Loading Data

We start by loading the training and testing datasets:

# Load necessary libraries
library(lubridate)
library(fastDummies)
library(randomForest)
library(dplyr)

# Read the training and test data
train.data <- read.csv("pml-training.csv")
test.data <- read.csv("pml-testing.csv")

Handling Missing Values

We calculate the percentage of missing values for each column and filter out columns with any missing values:

# Calculate the percentage of missing values for each column
missing_percentage <- sapply(train.data, function(x) {
  sum(is.na(x) | x == "") / length(x) * 100
})

# Filter out columns with missing values (percentage = 0)
df <- train.data[, missing_percentage == 0]

One-Hot Encoding

We apply one-hot encoding to categorical columns to convert them into numerical values:

# Extract the names of columns to one-hot encode
columns_to_encode <- c(2, 6)
column_names <- colnames(df)[columns_to_encode]

# Apply one-hot encoding to the specified columns
df <- fastDummies::dummy_cols(df, select_columns = column_names, remove_first_dummy = TRUE)

# Remove the original columns after encoding
df <- df[, !colnames(df) %in% column_names]

Feature Engineering

We convert the timestamp column into date-time components:

# Convert 'cvtd_timestamp' column to datetime object
df$cvtd_timestamp <- dmy_hm(df$cvtd_timestamp)

# Extract day, month, year, hour, and minute from the datetime object
df$day <- day(df$cvtd_timestamp)
df$month <- month(df$cvtd_timestamp)
df$year <- year(df$cvtd_timestamp)
df$hour <- hour(df$cvtd_timestamp)
df$minute <- minute(df$cvtd_timestamp)

# Remove the original 'cvtd_timestamp' column
df <- df[, !colnames(df) %in% "cvtd_timestamp"]

Data Cleaning

We remove redundant or unnecessary columns:

# Remove columns 'raw_timestamp_part_1', 'raw_timestamp_part_2', and 'X'
df <- df[, !(colnames(df) %in% c("raw_timestamp_part_1", "raw_timestamp_part_2", "X"))]

Response Variable

We extract and ensure that the response variable is a factor:

# Create a vector containing the 'classe' column
response <- train.data$classe
response <- as.factor(response)

# Remove the 'classe' column from df (since it's in the response vector)
df <- df[, !colnames(df) %in% "classe"]

Model Building

Random Forest Model

We use the Random Forest algorithm to build our classification model. Random Forest is chosen due to its robustness and ability to handle a large number of features and complex interactions:

# Fit a random forest model
rf_model <- randomForest(
  x = df,                # The data frame containing the predictors
  y = response,          # The response vector
  ntree = 300,           # Number of trees in the forest
  mtry = sqrt(ncol(df)), # Number of variables to try at each split
  nodesize = 5,          # Minimum number of samples in a terminal node
  importance = TRUE      # Compute variable importance
)

Test Data Preparation

The test data undergoes similar preprocessing steps as the training data:

# Read the test data
test.data <- read.csv("pml-testing.csv")

# Filter out columns with missing values (percentage = 0) for the test data
test_missing_percentage <- sapply(test.data, function(x) {
  sum(is.na(x) | x == "") / length(x) * 100
})
test.df <- test.data[, test_missing_percentage == 0]


# Apply one-hot encoding to the specified columns in the test data
test.df <- fastDummies::dummy_cols(test.df, select_columns = column_names, remove_first_dummy = TRUE)

# Remove the original columns after encoding
test.df <- test.df[, !colnames(test.df) %in% column_names]

# Convert 'cvtd_timestamp' column to datetime object in the test data
test.df$cvtd_timestamp <- dmy_hm(test.df$cvtd_timestamp)

# Extract day, month, year, hour, and minute from the datetime object in the test data
test.df$day <- day(test.df$cvtd_timestamp)
test.df$month <- month(test.df$cvtd_timestamp)
test.df$year <- year(test.df$cvtd_timestamp)
test.df$hour <- hour(test.df$cvtd_timestamp)
test.df$minute <- minute(test.df$cvtd_timestamp)

# Remove the original 'cvtd_timestamp' column in the test data
test.df <- test.df[, !colnames(test.df) %in% "cvtd_timestamp"]

# Remove the columns 'raw_timestamp_part_1' and 'raw_timestamp_part_2' from the test data
test.df <- test.df[, !(colnames(test.df) %in% c("raw_timestamp_part_1", "raw_timestamp_part_2"))]

# Rename the column "new_window_" to "new_window_yes" in test.df
colnames(test.df)[colnames(test.df) == "new_window_"] <- "new_window_yes"


# Remove id column
test.df <- test.df[, !colnames(test.df) %in% "problem_id"]
test.df <- test.df[, !colnames(test.df) %in% "X"]

Predictions

We use the trained Random Forest model to make predictions on the test dataset:

# Make predictions
predictions <- predict(rf_model, test.df)
print(predictions)

##  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 
##  B  A  B  A  A  E  D  B  A  A  B  C  B  A  E  E  A  B  B  B 
## Levels: A B C D E

Model Evaluation

Cross-Validation

Cross-validation is used to estimate the model’s performance on unseen data. In this implementation, cross-validation was not explicitly performed to train the model, but a robust model-building process including feature engineering and careful preprocessing was applied to ensure model quality. We will apply Cross-validation in the following, to estimate Out-of-Sample Error.

Expected Out-of-Sample Error

Setting Up Cross-Validation

To perform 5-fold cross-validation, we follow these steps:

Create 5 Folds: Split your data into 5 equal parts (folds).
Train and Test the Model: For each fold, use 4 folds for training and 1 fold for testing. Fit the model on the training folds and evaluate it on the test fold.
Calculate the Error: Compute the prediction accuracy for each fold and average these accuracies to get the overall performance.

set.seed(123) # For reproducibility

# Number of folds
k <- 5

# Create folds
folds <- cut(seq(1, nrow(df)), breaks = k, labels = FALSE)
folds = sample(folds)

# Initialize a vector to store accuracies
accuracies <- numeric(k)

for (i in 1:k) {
  # Split data into training and testing sets
  test_indices <- which(folds == i, arr.ind = TRUE)
  test_data <- df[test_indices, ]
  test_response <- response[test_indices]
  
  train_data <- df[-test_indices, ]
  train_response <- response[-test_indices]
  
  # Fit the random forest model
  rf_model <- randomForest(
    x = train_data,
    y = train_response,
    ntree = 50, #Lower than before to reduce computation time
    mtry = sqrt(ncol(train_data)),
    nodesize = 5,
    importance = TRUE
  )
  
  # Predict on the test data
  predictions <- predict(rf_model, newdata = test_data)
  
  # Calculate accuracy for this fold
  accuracy <- sum(predictions == test_response) / length(test_response)
  accuracies[i] <- accuracy
}

# Average accuracy across all folds
mean_accuracy <- mean(accuracies)
mean_accuracy

## [1] 0.9976047

Conclusion

The Random Forest model effectively classifies the exercise manner based on sensor data and time features. The preprocessing steps ensured that the model used clean and relevant features, and the Random Forest algorithm provided a robust classification performance. Future work could include hyperparameter tuning and additional cross-validation to further refine the model’s accuracy.

PredictionAssignment

rematex3

2024-08-21