To predict the manner of exercise, whether it is the correct way or the incorrect way
Data was downloaded using read.csv() function from the following sources. ### Training Source : https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv ### Test Source : https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv
A basic inspection using summary function highlighted that many variables carry mostly NAs or blanks "". The sample summary result for such variable is produced below -
summary(training[, 12:15])
## kurtosis_roll_belt kurtosis_picth_belt kurtosis_yaw_belt skewness_roll_belt
## :19216 :19216 :19216 :19216
## #DIV/0! : 10 #DIV/0! : 32 #DIV/0!: 406 #DIV/0! : 9
## -1.908453: 2 47.000000: 4 0.000000 : 4
## -0.016850: 1 -0.150950: 3 0.422463 : 2
## -0.021024: 1 -0.684748: 3 -0.003095: 1
## -0.025513: 1 -1.750749: 3 -0.010002: 1
## (Other) : 391 (Other) : 361 (Other) : 389
As can be seen above, the extremely high number of NAs or ""s make the variable redundant for prediction purposes.
Such columns carrying NAs or ""s were removed. First, seven variables consist of variables not relevant to prediction like time, name, etc. So, they are removed too.
x1 <- (training == "")
y1 <- apply(x1, 2, sum)
y1 <- y1/19622
colindex1 <- (y1 > 0.5)
colindex1 <- which(colindex1 == TRUE)
x <- is.na(training)
y <- apply(x, 2, sum)
y <- y/19622
colindex <- (y > 0.5)
colindex <- which(colindex == TRUE)
col1 <- union(colindex, colindex1)
training <- training[, -col1]
testing <- testing[, - col1]
training <- training[, -c(1:7)]
testing <- testing[, -c(1:7)]
dim(training)
## [1] 19622 53
dim(testing)
## [1] 20 53
Thus, the dimensions of the datasets were reduced.
Training data was divided into training and testing sets using the following code.
set.seed(1313)
library(caret)
## Loading required package: lattice
## Loading required package: ggplot2
trainIndex <- createDataPartition(training$classe, p = 0.8, list = FALSE)
train1 <- training[trainIndex, ]
test1 <- training[-trainIndex, ]
This provided us with two sets of data - one to train the model with and second to test the accuracy of the model.
Three models were trained using Classification trees, random forests and gradient boosting (gbm). Each of them was cross validated using a k-fold cross validation using 5 folds. More than 5 fold increase the computation time without any benefit of increased accuracy.
controltrain <- trainControl(method = "cv", number = 5)
model1 <- train(classe ~ ., data = train1, method = "rpart", trControl = controltrain)
predict1 <- predict(model1, test1[,-53])
accuracy1 <- confusionMatrix(predict1, test1$classe)
For a quick look at the results, we plot the tree using rpart.plot. The result is as follows -
## Warning: package 'rattle' was built under R version 3.6.2
## Loading required package: tibble
## Loading required package: bitops
## Rattle: A free graphical interface for data science with R.
## Version 5.4.0 Copyright (c) 2006-2020 Togaware Pty Ltd.
## Type 'rattle()' to shake, rattle, and roll your data.
The above visualization itself raises serious question regarding the accuracy of the model as D remains unused, which effectively means 0 % accuracy for those with results D. The more detailed parameters are calculated later.
Here the choice of trees is based on keeping the computation time reasonable without compromising the accuracy of model. Increasing it to 32 does not lead to any significant benefits.
controltrain <- trainControl(method = "cv", number = 5)
model2 <- train(classe ~ ., data = train1, method = "rf", trControl = controltrain, ntree = 16)
predict2 <- predict(model2, test1[,-53])
accuracy2 <- confusionMatrix(predict2, test1$classe)
controltrain <- trainControl(method = "repeatedcv", number = 5, repeats = 1)
model3 <- train(classe ~ ., data = train1, method = "gbm", trControl = controltrain, verbose = TRUE)
predict3 <- predict(model3, test1[,-53])
accuracy3 <- confusionMatrix(predict3, test1$classe)
The results were as follows (in the sequence of operations) -
accuracy1[[3]][1]
## Accuracy
## 0.4901861
accuracy2[[3]][1]
## Accuracy
## 0.9920979
accuracy3[[3]][1]
## Accuracy
## 0.9648228
Clearly, Random Forests offer by far exceeds the accuracy of the other two methods. So, random forests is chosen as the method for making requisite predictions.
Final prediction is as follows -
final_predict <- predict(model2, testing[, -53])
final_predict
## [1] B A B A A E D B A A B C B A E E A B B B
## Levels: A B C D E