Introduction

The objective is to develop a predictive model to forecast the manner in which an exercise was performed, identified by the “class” variable in the training dataset. The report should detail:

  1. The methodology used to build the model.

  2. The application of cross-validation techniques.

  3. The estimation of the expected out-of-sample error.

  4. A justification for all modeling decisions made.

  5. The use of the final model to predict 20 distinct test cases.

Data exploration and preparation

The following preparation of the data are done. 0. Load data

library(caret)
## Loading required package: ggplot2
## Loading required package: lattice
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(ggplot2)
library(lattice)

training_orig <- read.csv("pml-training.csv")
predicting_orig <- read.csv("pml-testing.csv")
  1. Remove the features, which the procentage of their values “NA” exceed threshold.
is_majority_na <- function(col, threshold = 0.7) {
  nr_prop = sum(is.na(col)) / length(col)
  if (nr_prop > threshold) {
    return(TRUE)
  } else {
    return(FALSE)
  }
}

get_cols_to_remove <- function(dataframe, threshold = 0.7) {
  result_bool <- sapply(dataframe, is_majority_na, threshold = threshold)
  cols <- names(result_bool[result_bool == TRUE])
  return(cols)
}

cols_to_remove <- get_cols_to_remove(training_orig, threshold = 0.7)

training1 <- training_orig %>%
  select(-all_of(cols_to_remove))

predicting1 <- predicting_orig %>%
  select(-all_of(cols_to_remove))
  1. Remove the features, which the procentage of their values "" exceed threshold.
get_cols_with_high_empty_string_prop <- function(dataframe, threshold = 0.7) {
  empty_prop <- colMeans(dataframe == "")
  result_bool <- empty_prop > threshold
  cols_to_remove <- names(result_bool[result_bool == TRUE])
  return(cols_to_remove)
}

cols_to_remove <- get_cols_with_high_empty_string_prop(training1, threshold = 0.7)

training2 <- training1 %>%
  select(-all_of(cols_to_remove))

predicting2 <- predicting1 %>%
  select(-all_of(cols_to_remove))
  1. manuall remove features
manual_to_remove <- c("user_name", "raw_timestamp_part_1", "raw_timestamp_part_2", "cvtd_timestamp", "X", "new_window", "num_window")
training3 <- training2 %>%
  select(-all_of(manual_to_remove))
predicting3 <- predicting2 %>%
  select(-all_of(manual_to_remove))
  1. For some missing values of the filtered features, I will assign the average value from the same feature
impute_with_mean <- function(col) {
  avg_value <- mean(col, na.rm = TRUE)
  
  col[is.na(col)] <- avg_value
  
  return(col)
}
training <- training3 %>%
  mutate_if(is.numeric, impute_with_mean)
predicting <- predicting3 %>%
  mutate_if(is.numeric, impute_with_mean)

training$classe <- as.factor(training$classe)
predicting$problem_id <- as.factor(predicting$problem_id)
  1. Divide the training set to training and testing set
inTrain = createDataPartition(training$classe, p=0.7, , list=FALSE)
training = training[ inTrain,]
testing = training[-inTrain,]

Model building

I choose decision tree based approach because of the following reasons:

  1. Classification Task: The target variable (classe) is categorical (A, B, C, D, E). The decision tree method is a classification algorithm designed to predict these categories accurately. Multivariate regression is used for continuous, numerical outcomes.

  2. High Dimensionality: The dataset has hundreds of features. The decision tree method, such as Random Forest, handles high-dimensional data robustly, naturally dealing with feature selection and overfitting.

  3. Performance: In practice, Random Forest consistently achieves higher accuracy (often near 100%) on this specific dataset compared to regression-based approaches.

I prepare cross validation. I use standard k-folder cross validation, where k is 5 and it repeats only once because of restriction of running time.

set.seed(1000)
control_params <- trainControl(
  method = "repeatedcv",
  number = 5,
  repeats = 1
)

I use the “Recursive Partitioning and Regression Trees” (rpart) method to train the data

modFit_rpart <- train(
  classe ~ ., 
  data = training, 
  method = "rpart",      
  trControl = control_params
)
print(modFit_rpart)
## CART 
## 
## 13737 samples
##    52 predictor
##     5 classes: 'A', 'B', 'C', 'D', 'E' 
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold, repeated 1 times) 
## Summary of sample sizes: 10988, 10989, 10990, 10991, 10990 
## Resampling results across tuning parameters:
## 
##   cp          Accuracy   Kappa     
##   0.03600854  0.5024348  0.34972150
##   0.05953955  0.4161917  0.20844949
##   0.11443393  0.3317331  0.07215803
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was cp = 0.03600854.
res_rpart <- predict(modFit_rpart, newdata=testing)
conf_matrix <- confusionMatrix(res_rpart, testing$classe)
res <- conf_matrix$overall['Accuracy']
cat("\nAccuracy:\n")
## 
## Accuracy:
cat(res, "\n")
## 0.4867429
cat("\nOut of sample error:\n")
## 
## Out of sample error:
cat(1 - res, "\n")
## 0.5132571

I use the “Randam Forest” (rf) method to train the data

modFit_rf <- train(
  classe ~ ., 
  data = training, 
  method = "rf",      
  trControl = control_params
)
print(modFit_rf)
## Random Forest 
## 
## 13737 samples
##    52 predictor
##     5 classes: 'A', 'B', 'C', 'D', 'E' 
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold, repeated 1 times) 
## Summary of sample sizes: 10990, 10989, 10990, 10989, 10990 
## Resampling results across tuning parameters:
## 
##   mtry  Accuracy   Kappa    
##    2    0.9895175  0.9867384
##   27    0.9908277  0.9883963
##   52    0.9836209  0.9792791
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 27.
res_rf <- predict(modFit_rf, newdata=testing)
conf_matrix <- confusionMatrix(res_rf, testing$classe)
res <- conf_matrix$overall['Accuracy']
cat("\nAccuracy:\n")
## 
## Accuracy:
cat(res, "\n")
## 1
cat("\nOut of sample error:\n")
## 
## Out of sample error:
cat(1 - res, "\n")
## 0

I use the “Gradient Boosting Machine” (gbm) method to train the data

modFit_gbm <- train(
  classe ~ ., 
  data = training,
  method = "gbm",     
  trControl = control_params,
  verbose = FALSE 
)
print(modFit_gbm)
## Stochastic Gradient Boosting 
## 
## 13737 samples
##    52 predictor
##     5 classes: 'A', 'B', 'C', 'D', 'E' 
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold, repeated 1 times) 
## Summary of sample sizes: 10990, 10990, 10990, 10989, 10989 
## Resampling results across tuning parameters:
## 
##   interaction.depth  n.trees  Accuracy   Kappa    
##   1                   50      0.7511837  0.6847087
##   1                  100      0.8153887  0.7663985
##   1                  150      0.8497486  0.8099028
##   2                   50      0.8493850  0.8091987
##   2                  100      0.9038360  0.8782875
##   2                  150      0.9295332  0.9108390
##   3                   50      0.8943725  0.8662621
##   3                  100      0.9406709  0.9249164
##   3                  150      0.9609079  0.9505387
## 
## Tuning parameter 'shrinkage' was held constant at a value of 0.1
## 
## Tuning parameter 'n.minobsinnode' was held constant at a value of 10
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were n.trees = 150, interaction.depth =
##  3, shrinkage = 0.1 and n.minobsinnode = 10.
res_gbm <- predict(modFit_gbm, newdata=testing)
conf_matrix <- confusionMatrix(res_gbm, testing$classe)
res <- conf_matrix$overall['Accuracy']
cat("\nAccuracy:\n")
## 
## Accuracy:
cat(res, "\n")
## 0.9751885
cat("\nOut of sample error:\n")
## 
## Out of sample error:
cat(1 - res, "\n")
## 0.02481148

According to the comparison of above 3 models, one can see that random forest method deliver the best results, following by the “gbm”. The “rpart” method gives the worst results.

Prediction:

I use the “random forest” method to predict the new test cases in “pml-training.cvs”.

res <- predict(modFit_rf, newdata=predicting)
print("\nPredicted results:\n")
## [1] "\nPredicted results:\n"
print(res)
##  [1] B A B A A E D B A A B C B A E E A B B B
## Levels: A B C D E