This document represents a demonstration of my understanding of the topics covered in the Coursera Practical Machine Learning course. This project was completed using R Version 4.4.1 in RStudio version 2024.09.0.
The goal of this project is to use data from various accelerometers on the belt, forearm, arm, and dumbbell of 6 participants. They were asked to perform 10 repetitions of unilateral dumbbell biceps curls correctly and incorrectly in 5 different ways. More information is available from the website here: http://groupware.les.inf.puc-rio.br/har. The manner in which the exercises were conducted is coded below:
Class A - exactly according to the
specification
Class B - throwing the elbows to the front
Class C - lifting the dumbbell only halfway
Class D - lowering the dumbbell only halfway
Class E - throwing the hips to the front
We were supplied with both training and testing data sets. We are to collect and analyze the training data set and develop a prediction model that we are to apply to the test data set in order to predict the manner in which the 20 test cases performed the exercises. This is represented by the “classe” variable. Any of the other variables in the data set can be used to formulate the model.
The training data for this project are available here: https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv
The test data are available here: https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv
Before any analysis can be done, the data must be imported and cleaned.
library(caret); library(rpart); library(corrplot); library(rattle); library(data.table)
## Loading required package: ggplot2
## Loading required package: lattice
## corrplot 0.94 loaded
## Loading required package: tibble
## Loading required package: bitops
## Rattle: A free graphical interface for data science with R.
## Version 5.5.1 Copyright (c) 2006-2021 Togaware Pty Ltd.
## Type 'rattle()' to shake, rattle, and roll your data.
# Load the training and testing data sets.
download.file("https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv", "pml-training.csv", method = "curl")
train.data <- read.csv("pml-training.csv")
download.file("https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv", "pml-testing.csv", method = "curl")
test.data <- read.csv("pml-testing.csv")
dim(train.data)
## [1] 19622 160
dim(test.data)
## [1] 20 160
Upon examination, both data sets have 160 variables. These variables contain many missing values and NA’s that will need to to be removed. It is also very likely that a number of the variables included will be of the near zero variance and thus provide no useful information for the model that we are seeking to build. These variables should be removed too.
# Remove variables containing mostly NAs.
train.data <- train.data[, colMeans(is.na(train.data)) < .9]
dim(train.data)
## [1] 19622 93
# Remove near zero variance variables
nzv <- nearZeroVar(train.data)
train.data <- train.data[, -nzv]
dim(train.data)
## [1] 19622 59
#First five columns contain identifier and date/time stamp info that are not useful for the purpose of prediction, so I removed them
train.data <- train.data[, -(1:5)]
dim(train.data)
## [1] 19622 54
Split the training data set into a training data set and a validation data set. The test data set denoted as test.data will be untouched. The final model will be applied to it in order to generate our predictions.
set.seed(1970) # Set seed to ensure reproducibility
inTrain <- createDataPartition(train.data$classe, p = 0.7, list = FALSE)
train.data <- train.data[inTrain, ]
validation.data <- train.data[-inTrain, ]
The data cleaning procedure was able to reduce the number of variables from 160 to 54.
Let’s take a look at the relationship between the variables. A strong positive correlation is denoted by blue. A strong negative correlation is signified by red.
train.corr <- cor(train.data[, -54])
corrplot(train.corr, type = "upper", order = "FPC", tl.cex = .5, tl.col = "black")
We will build prediction models using the training data and the following methods: Generalized Boosted Model and Random Forest. A confusion matrix is generated after each model to determine highest accuracy. I applied the same settings for the trainControl function to determine which method gave the best performance. The model with the highest accuracy rate will be applied the test.data prediction set.
The Generalized Boosted Model yielded an accuracy of 99.46% and a sample error rate of 0.54%.
# Generalized Boosted Model and associated Confusion Matrix
modGBM <- train(classe ~., data = train.data, method = "gbm", trControl = trainControl(method = "cv", number = 5), verbose = FALSE)
predict.gbm <- predict(modGBM, newdata = validation.data)
confMatrix.gbm <- confusionMatrix(table(predict.gbm, validation.data$classe))
#Print Confusion Matrix for Generalized Boosted Model
confMatrix.gbm
## Confusion Matrix and Statistics
##
##
## predict.gbm A B C D E
## A 1204 4 0 0 0
## B 0 766 3 3 0
## C 0 3 702 5 1
## D 0 2 3 648 2
## E 0 0 1 0 758
##
## Overall Statistics
##
## Accuracy : 0.9934
## 95% CI : (0.9904, 0.9957)
## No Information Rate : 0.2933
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.9917
##
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 1.0000 0.9884 0.9901 0.9878 0.9961
## Specificity 0.9986 0.9982 0.9973 0.9980 0.9997
## Pos Pred Value 0.9967 0.9922 0.9873 0.9893 0.9987
## Neg Pred Value 1.0000 0.9973 0.9979 0.9977 0.9991
## Prevalence 0.2933 0.1888 0.1727 0.1598 0.1854
## Detection Rate 0.2933 0.1866 0.1710 0.1579 0.1847
## Detection Prevalence 0.2943 0.1881 0.1732 0.1596 0.1849
## Balanced Accuracy 0.9993 0.9933 0.9937 0.9929 0.9979
The Random Forest model yielded an accuracy of 100% and an sample error rate of 0%. The Random Forest model had the highest accuracy rate. Therefore, I will apply the Random Forest Model to the provided validation data provided in the file: pml-testing.csv denoted as test.data in my code.
# Random Forest Model and associated Confusion Matrix
modRF <- train(classe ~., data = train.data, method = "rf", trControl = trainControl(method = "cv", number = 5), verbose = FALSE)
predict.rf <- predict(modRF, newdata = validation.data)
confMatrix.rf <- confusionMatrix(table(predict.rf, validation.data$classe))
#Print Confusion Matrix for Random Forest model
confMatrix.rf
## Confusion Matrix and Statistics
##
##
## predict.rf A B C D E
## A 1204 0 0 0 0
## B 0 775 0 0 0
## C 0 0 709 0 0
## D 0 0 0 656 0
## E 0 0 0 0 761
##
## Overall Statistics
##
## Accuracy : 1
## 95% CI : (0.9991, 1)
## No Information Rate : 0.2933
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 1
##
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 1.0000 1.0000 1.0000 1.0000 1.0000
## Specificity 1.0000 1.0000 1.0000 1.0000 1.0000
## Pos Pred Value 1.0000 1.0000 1.0000 1.0000 1.0000
## Neg Pred Value 1.0000 1.0000 1.0000 1.0000 1.0000
## Prevalence 0.2933 0.1888 0.1727 0.1598 0.1854
## Detection Rate 0.2933 0.1888 0.1727 0.1598 0.1854
## Detection Prevalence 0.2933 0.1888 0.1727 0.1598 0.1854
## Balanced Accuracy 1.0000 1.0000 1.0000 1.0000 1.0000
Applying the Random Forest model on the test.data yielded the following results.
pred <- predict(modRF, test.data)
pred
## [1] B A B A A E D B A A B C B A E E A B B B
## Levels: A B C D E