The aim of this project is to predict how well a person performed a
weight lifting exercise. The outcome variable in the training data is
classe. This variable has five possible classes, so this is
a classification problem.
I used a Random Forest model because it works well when there are many predictor variables and when the relationship between the predictors and the outcome may be complex. This dataset contains many sensor measurements, so Random Forest is a strong choice.
library(caret)
library(randomForest)
library(dplyr)
library(ggplot2)
training <- read.csv("https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv", na.strings = c("NA", "", "#DIV/0!"))
testing <- read.csv("https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv", na.strings = c("NA", "", "#DIV/0!"))
Some columns contain mostly missing values, so they were removed. I also removed identification and timestamp related columns because they do not help explain the actual movement being performed. After that, I removed predictors with near zero variance.
I also made sure that the outcome variable classe stayed
as a factor with fixed levels. This avoids matching errors later when
comparing predictions to the true answers.
# Remove columns with more than 95% missing values
na_cols <- colSums(is.na(training)) / nrow(training)
training_clean <- training[, na_cols < 0.95]
testing_clean <- testing[, names(training_clean)[names(training_clean) != "classe"]]
# Remove non useful identifier fields
remove_cols <- c(
"X", "user_name", "raw_timestamp_part_1", "raw_timestamp_part_2",
"cvtd_timestamp", "new_window", "num_window"
)
remove_cols <- remove_cols[remove_cols %in% names(training_clean)]
training_clean <- training_clean %>% select(-all_of(remove_cols))
testing_clean <- testing_clean %>% select(-all_of(remove_cols))
# Remove near zero variance predictors
predictor_names <- setdiff(names(training_clean), "classe")
nzv <- nearZeroVar(training_clean[, predictor_names])
if (length(nzv) > 0) {
keep_predictors <- predictor_names[-nzv]
training_clean <- training_clean[, c(keep_predictors, "classe")]
testing_clean <- testing_clean[, keep_predictors]
}
# Keep outcome levels consistent
training_clean$classe <- factor(training_clean$classe)
classe_levels <- levels(training_clean$classe)
# Match any character or factor predictor types across both datasets
common_predictors <- intersect(names(testing_clean), setdiff(names(training_clean), "classe"))
for (col in common_predictors) {
if (is.character(training_clean[[col]]) || is.factor(training_clean[[col]]) ||
is.character(testing_clean[[col]]) || is.factor(testing_clean[[col]])) {
combined_levels <- unique(c(as.character(training_clean[[col]]), as.character(testing_clean[[col]])))
training_clean[[col]] <- factor(as.character(training_clean[[col]]), levels = combined_levels)
testing_clean[[col]] <- factor(as.character(testing_clean[[col]]), levels = combined_levels)
}
}
The training data was split into 70% for model building and 30% for validation. I used the larger training part to fit the model and the remaining part to test how well it performs on unseen data.
set.seed(123)
in_train <- createDataPartition(training_clean$classe, p = 0.70, list = FALSE)
train_data <- training_clean[in_train, ]
valid_data <- training_clean[-in_train, ]
train_data$classe <- factor(train_data$classe, levels = classe_levels)
valid_data$classe <- factor(valid_data$classe, levels = classe_levels)
I trained a Random Forest model on the 70% training split.
set.seed(123)
rf_model <- randomForest(
classe ~ .,
data = train_data,
ntree = 250,
importance = TRUE
)
rf_model
##
## Call:
## randomForest(formula = classe ~ ., data = train_data, ntree = 250, importance = TRUE)
## Type of random forest: classification
## Number of trees: 250
## No. of variables tried at each split: 7
##
## OOB estimate of error rate: 0.52%
## Confusion matrix:
## A B C D E class.error
## A 3903 3 0 0 0 0.0007680492
## B 11 2641 6 0 0 0.0063957863
## C 0 14 2379 3 0 0.0070951586
## D 0 0 26 2225 1 0.0119893428
## E 0 0 2 5 2518 0.0027722772
To measure performance, I predicted the classes for the 30% validation set and compared them to the real classes.
valid_pred <- predict(rf_model, newdata = valid_data)
valid_pred <- factor(valid_pred, levels = classe_levels)
reference <- factor(valid_data$classe, levels = classe_levels)
cm <- confusionMatrix(data = valid_pred, reference = reference)
cm
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 1674 4 0 0 0
## B 0 1132 4 0 0
## C 0 3 1022 9 4
## D 0 0 0 955 4
## E 0 0 0 0 1074
##
## Overall Statistics
##
## Accuracy : 0.9952
## 95% CI : (0.9931, 0.9968)
## No Information Rate : 0.2845
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.994
##
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 1.0000 0.9939 0.9961 0.9907 0.9926
## Specificity 0.9991 0.9992 0.9967 0.9992 1.0000
## Pos Pred Value 0.9976 0.9965 0.9846 0.9958 1.0000
## Neg Pred Value 1.0000 0.9985 0.9992 0.9982 0.9983
## Prevalence 0.2845 0.1935 0.1743 0.1638 0.1839
## Detection Rate 0.2845 0.1924 0.1737 0.1623 0.1825
## Detection Prevalence 0.2851 0.1930 0.1764 0.1630 0.1825
## Balanced Accuracy 0.9995 0.9965 0.9964 0.9949 0.9963
accuracy <- cm$overall["Accuracy"]
out_of_sample_error <- 1 - accuracy
accuracy
## Accuracy
## 0.9952421
out_of_sample_error
## Accuracy
## 0.004757859
The validation accuracy gives an estimate of how well the model
should perform on new data. The expected out of sample error is
calculated as 1 - accuracy. Since Random Forest usually
performs very well on this dataset, the error is expected to be very
small.
importance_values <- importance(rf_model)
varImpPlot(rf_model, n.var = 20)
After checking the model on the validation data, I used it to predict the 20 cases in the provided testing file.
final_predictions <- predict(rf_model, newdata = testing_clean)
final_predictions <- factor(final_predictions, levels = classe_levels)
final_predictions
## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
## B A B A A E D B A A B C B A E E A B B B
## Levels: A B C D E
I chose Random Forest because it is reliable for classification tasks with many variables. It can handle complex patterns and usually gives strong accuracy without needing a lot of manual tuning. It also works well when some variables are noisy or less useful.
I also chose to clean the data before modeling because columns with too many missing values and columns that only identify the record do not help prediction. Removing them helps the model focus on the important sensor measurements.
This project used a Random Forest model to predict the manner in which weight lifting exercises were performed. The data was cleaned by removing highly missing and non useful variables, then split into 70% training data and 30% validation data. The model was evaluated on the validation set to estimate out of sample error. Finally, the fitted model was used to predict the 20 unseen test cases.