This project uses a data set from the Human Activity Recognition, which evalueates the performance of unilateral dumbbell biceps curl from 6 subjects. the data set is divided into a training and test set, and after feature evaluation and preprocessing unnecessary features were eliminated and the training data set was divided into a training data subset and validation subset. Prediction models were used using Gradient Boosting Machine and Random Forests, of which the Random Forest model performed better at predicting the outcomes in the validation data set, with 100% accuracy, according to the validation data set.
One thing that people regularly do is quantify how much of a particular activity they do, but they rarely quantify how well they do it. In this project, the goal is to use data from accelerometers on the belt, forearm, arm, and dumbell of 6 participants from the Human Activity Recognition Study.
On this study, participants were asked to perform one set of 10 repetitions of the Unilateral Dumbbell Biceps Curl in five different fashions:
- Exactly according to the specification (Class A)
- Throwing the elbows to the front (Class B)
- Lifting the dumbbell only halfway (Class C)
- Lowering the dumbbell only halfway (Class D)
- Throwing the hips to the front (Class E)
Class A corresponds to the specified execution of the exercise, while the other 4 classes correspond to common mistakes. Participants were supervised by an experienced weight lifter to make sure the execution complied to the manner they were supposed to simulate. The exercises were performed by six male participants aged between 20-28 years, with little weight lifting experience.
The training data for this project are available here
The test data are available here The data for this project come from this source
The goal of this project is to predict the manner in which the subjects did the exercise. This is the “classe” variable in the training set. You may use any of the other variables to predict with. You should create a report describing how you built your model, how you used cross validation, what you think the expected out of sample error is, and why you made the choices you did. You will also use your prediction model to predict 20 different test cases.
Libraries are loaded, the data sets files are downloaded and read into two data frames, training for the training data set, and testing for the testing data set.
library(tidyverse); library(caret)
## -- Attaching packages --------------------------------------- tidyverse 1.3.1 --
## v ggplot2 3.3.5 v purrr 0.3.4
## v tibble 3.1.5 v dplyr 1.0.7
## v tidyr 1.1.4 v stringr 1.4.0
## v readr 2.0.2 v forcats 0.5.1
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
## Loading required package: lattice
##
## Attaching package: 'caret'
## The following object is masked from 'package:purrr':
##
## lift
Train.url <- "https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv"
Test.url <- "https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv"
if(!file.exists("DataSets/pml-training.csv")){
download.file(Train.url, destfile = "DataSets/pml-training.csv")
}
if(!file.exists("DataSets/pml-testing.csv")){
download.file(Test.url, "DataSets/pml-testing.csv")
}
training <- read.csv("./DataSets/pml-training.csv", header = TRUE)
testing <- read.csv("./DataSets/pml-testing.csv", header = TRUE)
The data sets consists of measurements from accelerometers on the belt, forearm, arm, and dumbbell of 6 participants, as explained on the introduction. The training data set consists of 19622 observations from 160 variables. The first 7 columns from our data frame consists of ID variables, so they were removed. Then all other variables,except the classification variable, were converted into numeric values.
dim(training)
## [1] 19622 160
train_data <- training[-c(1:7)]
train_data <- sapply(train_data[, -c(153)], as.numeric)
train_data <- as.data.frame(train_data)
Variables that were highly correlated, with near zero variance or with an NA proportion greater than 80% were removed from the training data frame.
# Removing features with near zero variance
train_nzv <- nearZeroVar(train_data)
train_data <- train_data[, -train_nzv]
# Removing highly correlated features
M <- abs(cor(train_data))
M[!lower.tri(M)] <- 0
high_cor_features <- which(M > 0.8, arr.ind = TRUE)
train_data <- train_data[, !apply(M, 2,
function(x) any(abs(x) > 0.8, na.rm = TRUE))]
# Removing features with NA values
train_data <- train_data[lapply(train_data,
function(x) sum(is.na(x))/length(x) ) < 0.9 ]
# Adding classification var
train_data$classe <- training$classe
train_data$classe <- as.factor(train_data$classe)
The training data set was divided into two different data sets. The models will be build on the train_data data set, and the obtained models will be tested on the val_data validation data set.
set.seed(125)
inTrain <- createDataPartition(y = train_data$classe, p = 0.8, list = FALSE)
train_data <- train_data[inTrain, ]
val_data <- train_data[-inTrain, ]
3-fold cross validation will be used for model fitting.
fitControl <- trainControl(method = "cv", number = 3)
A Gradient Boosting Machine model and a Random Forest model were created.
gbmFit1 <- train(classe ~ ., data = train_data,
method = "gbm",
trControl = fitControl,
verbose = FALSE)
rfFit <- train(classe ~ ., data = train_data,
trControl = fitControl,
method = "rf")
Below are the plots of bot models
library(gridExtra)
grid.arrange(plot(gbmFit1), plot(rfFit), nrow = 1)
## Model Validation Both models were tested on the validation data set, and confusion matrices were created to compare their accuracy.
gbm_pred <- predict(gbmFit1, val_data)
rf_pred <- predict(rfFit, val_data)
gbm_cm <- confusionMatrix(gbm_pred, val_data$classe)
gbm_cm
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 879 17 0 1 0
## B 5 600 11 3 2
## C 1 9 508 17 4
## D 0 0 8 497 4
## E 0 1 2 3 556
##
## Overall Statistics
##
## Accuracy : 0.9719
## 95% CI : (0.9655, 0.9774)
## No Information Rate : 0.2829
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.9644
##
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 0.9932 0.9569 0.9603 0.9539 0.9823
## Specificity 0.9920 0.9916 0.9881 0.9954 0.9977
## Pos Pred Value 0.9799 0.9662 0.9425 0.9764 0.9893
## Neg Pred Value 0.9973 0.9892 0.9919 0.9908 0.9961
## Prevalence 0.2829 0.2004 0.1691 0.1666 0.1809
## Detection Rate 0.2810 0.1918 0.1624 0.1589 0.1777
## Detection Prevalence 0.2868 0.1985 0.1723 0.1627 0.1797
## Balanced Accuracy 0.9926 0.9743 0.9742 0.9747 0.9900
rf_cm <- confusionMatrix(rf_pred, val_data$classe)
rf_cm
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 885 0 0 0 0
## B 0 627 0 0 0
## C 0 0 529 0 0
## D 0 0 0 521 0
## E 0 0 0 0 566
##
## Overall Statistics
##
## Accuracy : 1
## 95% CI : (0.9988, 1)
## No Information Rate : 0.2829
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 1
##
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 1.0000 1.0000 1.0000 1.0000 1.0000
## Specificity 1.0000 1.0000 1.0000 1.0000 1.0000
## Pos Pred Value 1.0000 1.0000 1.0000 1.0000 1.0000
## Neg Pred Value 1.0000 1.0000 1.0000 1.0000 1.0000
## Prevalence 0.2829 0.2004 0.1691 0.1666 0.1809
## Detection Rate 0.2829 0.2004 0.1691 0.1666 0.1809
## Detection Prevalence 0.2829 0.2004 0.1691 0.1666 0.1809
## Balanced Accuracy 1.0000 1.0000 1.0000 1.0000 1.0000
The Out of Sample Error for the Gradient Boosting Machine model was 0.281, with an accuracy of 0.9719, while the Random Forest Model had an accuracy of 1 and an Out of Sample error of 0, according to the validation data set.
## Model Accuracy OoSError
## 1 GBM 0.9719 0.0281
## 2 RF 1.0000 0.0000
It appears that the Random Forest model had the best accuracy, correctly predicting 100% of the validation data set classes, so we will use this model on the testing data set. The predictions appear below.
rf_test <- predict(rfFit, testing)
rf_test
## [1] B A B A A E D B A A B C B A E E A B B B
## Levels: A B C D E
The Random Forest Model was better to predict the outcome in the validation data set.