The goal of this project is to build a predictive model that
classifies the manner in which an exercise was performed based on
various sensor measurements and timestamps. The target variable is
classe, which represents the exercise class, and the
features used for prediction include sensor measurements and
time-related information.
We start by loading the training and testing datasets:
# Load necessary libraries
library(lubridate)
library(fastDummies)
library(randomForest)
library(dplyr)
# Read the training and test data
train.data <- read.csv("pml-training.csv")
test.data <- read.csv("pml-testing.csv")
We calculate the percentage of missing values for each column and filter out columns with any missing values:
# Calculate the percentage of missing values for each column
missing_percentage <- sapply(train.data, function(x) {
sum(is.na(x) | x == "") / length(x) * 100
})
# Filter out columns with missing values (percentage = 0)
df <- train.data[, missing_percentage == 0]
We apply one-hot encoding to categorical columns to convert them into numerical values:
# Extract the names of columns to one-hot encode
columns_to_encode <- c(2, 6)
column_names <- colnames(df)[columns_to_encode]
# Apply one-hot encoding to the specified columns
df <- fastDummies::dummy_cols(df, select_columns = column_names, remove_first_dummy = TRUE)
# Remove the original columns after encoding
df <- df[, !colnames(df) %in% column_names]
We convert the timestamp column into date-time components:
# Convert 'cvtd_timestamp' column to datetime object
df$cvtd_timestamp <- dmy_hm(df$cvtd_timestamp)
# Extract day, month, year, hour, and minute from the datetime object
df$day <- day(df$cvtd_timestamp)
df$month <- month(df$cvtd_timestamp)
df$year <- year(df$cvtd_timestamp)
df$hour <- hour(df$cvtd_timestamp)
df$minute <- minute(df$cvtd_timestamp)
# Remove the original 'cvtd_timestamp' column
df <- df[, !colnames(df) %in% "cvtd_timestamp"]
We remove redundant or unnecessary columns:
# Remove columns 'raw_timestamp_part_1', 'raw_timestamp_part_2', and 'X'
df <- df[, !(colnames(df) %in% c("raw_timestamp_part_1", "raw_timestamp_part_2", "X"))]
We extract and ensure that the response variable is a factor:
# Create a vector containing the 'classe' column
response <- train.data$classe
response <- as.factor(response)
# Remove the 'classe' column from df (since it's in the response vector)
df <- df[, !colnames(df) %in% "classe"]
We use the Random Forest algorithm to build our classification model. Random Forest is chosen due to its robustness and ability to handle a large number of features and complex interactions:
# Fit a random forest model
rf_model <- randomForest(
x = df, # The data frame containing the predictors
y = response, # The response vector
ntree = 300, # Number of trees in the forest
mtry = sqrt(ncol(df)), # Number of variables to try at each split
nodesize = 5, # Minimum number of samples in a terminal node
importance = TRUE # Compute variable importance
)
The test data undergoes similar preprocessing steps as the training data:
# Read the test data
test.data <- read.csv("pml-testing.csv")
# Filter out columns with missing values (percentage = 0) for the test data
test_missing_percentage <- sapply(test.data, function(x) {
sum(is.na(x) | x == "") / length(x) * 100
})
test.df <- test.data[, test_missing_percentage == 0]
# Apply one-hot encoding to the specified columns in the test data
test.df <- fastDummies::dummy_cols(test.df, select_columns = column_names, remove_first_dummy = TRUE)
# Remove the original columns after encoding
test.df <- test.df[, !colnames(test.df) %in% column_names]
# Convert 'cvtd_timestamp' column to datetime object in the test data
test.df$cvtd_timestamp <- dmy_hm(test.df$cvtd_timestamp)
# Extract day, month, year, hour, and minute from the datetime object in the test data
test.df$day <- day(test.df$cvtd_timestamp)
test.df$month <- month(test.df$cvtd_timestamp)
test.df$year <- year(test.df$cvtd_timestamp)
test.df$hour <- hour(test.df$cvtd_timestamp)
test.df$minute <- minute(test.df$cvtd_timestamp)
# Remove the original 'cvtd_timestamp' column in the test data
test.df <- test.df[, !colnames(test.df) %in% "cvtd_timestamp"]
# Remove the columns 'raw_timestamp_part_1' and 'raw_timestamp_part_2' from the test data
test.df <- test.df[, !(colnames(test.df) %in% c("raw_timestamp_part_1", "raw_timestamp_part_2"))]
# Rename the column "new_window_" to "new_window_yes" in test.df
colnames(test.df)[colnames(test.df) == "new_window_"] <- "new_window_yes"
# Remove id column
test.df <- test.df[, !colnames(test.df) %in% "problem_id"]
test.df <- test.df[, !colnames(test.df) %in% "X"]
We use the trained Random Forest model to make predictions on the test dataset:
# Make predictions
predictions <- predict(rf_model, test.df)
print(predictions)
## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
## B A B A A E D B A A B C B A E E A B B B
## Levels: A B C D E
Cross-validation is used to estimate the model’s performance on unseen data. In this implementation, cross-validation was not explicitly performed to train the model, but a robust model-building process including feature engineering and careful preprocessing was applied to ensure model quality. We will apply Cross-validation in the following, to estimate Out-of-Sample Error.
To perform 5-fold cross-validation, we follow these steps:
set.seed(123) # For reproducibility
# Number of folds
k <- 5
# Create folds
folds <- cut(seq(1, nrow(df)), breaks = k, labels = FALSE)
folds = sample(folds)
# Initialize a vector to store accuracies
accuracies <- numeric(k)
for (i in 1:k) {
# Split data into training and testing sets
test_indices <- which(folds == i, arr.ind = TRUE)
test_data <- df[test_indices, ]
test_response <- response[test_indices]
train_data <- df[-test_indices, ]
train_response <- response[-test_indices]
# Fit the random forest model
rf_model <- randomForest(
x = train_data,
y = train_response,
ntree = 50, #Lower than before to reduce computation time
mtry = sqrt(ncol(train_data)),
nodesize = 5,
importance = TRUE
)
# Predict on the test data
predictions <- predict(rf_model, newdata = test_data)
# Calculate accuracy for this fold
accuracy <- sum(predictions == test_response) / length(test_response)
accuracies[i] <- accuracy
}
# Average accuracy across all folds
mean_accuracy <- mean(accuracies)
mean_accuracy
## [1] 0.9976047
The Random Forest model effectively classifies the exercise manner based on sensor data and time features. The preprocessing steps ensured that the model used clean and relevant features, and the Random Forest algorithm provided a robust classification performance. Future work could include hyperparameter tuning and additional cross-validation to further refine the model’s accuracy.