People regularly quantify how much of an excercise they do, but rarely measure their performance. In this investigation, data from accelerometers on the belt, forearm, arm, and dumbell of 6 participants will be used to predict the manner of the subject. Random forest, decision tree and generalized boosted model will be the methods implemented to determine the best predicion. R programming will be the major tool used in the project.
The data for this project is taken from the Human Activity Recognition project by Groupware@LES. For more information, please visit their website.
As the first step in this investigation, data preparation is needed. The following code is used to load the corresponding libraries.
library(knitr)
library(caret)
library(rpart)
library(rpart.plot)
library(rattle)
library(randomForest)
library(corrplot)
library(e1071)
The next step is loading the dataset from the URL provided, and store the information into the training and testing variables.
trainURL <- "https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv"
testURL <- "https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv"
trainFile <- "pml-traininig.csv"
testFile <- "pml-testing.csv"
if(!file.exists(trainFile))
{
download.file(trainURL, destfile = trainFile)
}
if(!file.exists(testFile))
{
download.file(testURL, destfile = testFile)
}
training <- read.csv(trainFile)
testing <- read.csv(testFile)
In order to have a better predictive model, training dataset is partitioned into 2 subsets:
* trainSet: consists of 70% of the dataset; will be used for the modeling process
* validationSet: consists of 30% of the dataset; will be used for cross validation
trainingPartition <- createDataPartition(training$classe, p = 0.7, list = FALSE)
trainSet <- training[trainingPartition, ]
validationSet <- training[-trainingPartition, ]
To ensure classification rules can be applied to the dataset, data cleansing must be done. The following considerations will be entered:
1. Remove the constant and almost constant variables accross the sample
2. Remove variables composed of at least 95% of missing values or empty strings
3. Remove identification variables, such as time and user information
# Remove constant and almost constant varibales across the sample
NZV <- nearZeroVar(trainSet)
trainSet <- trainSet[, -NZV]
validationSet <- validationSet[, -NZV]
# Remove variables with mostly missing values
na <- sapply(trainSet, function(x) mean(is.na(x))) > 0.95
trainSet <- trainSet[, na == FALSE]
validationSet <- validationSet[, na == FALSE]
# Remove identification variables
trainSet <- trainSet[, -(1:5)]
validationSet <- validationSet[, -(1:5)]
After this cleansing process, there are 53 variables suited for analysis.
To get a better insight of the relationship between the variables, a correlation analyisis will be done.
plotCorrelation <- cor(trainSet[, -54])
corrplot(plotCorrelation, method = "color", order = "AOE", type = "lower", tl.cex = 0.5, tl.col = rgb(0, 0, 0), title = "Figure 1: Correlation Plot", mar=c(0,0,1,0))
In Figure 1: Correlation Plot, highly possitively correlated values are painted in dark blue, while negatively are colored dark red.
Now, three popular methods will be applied to model the regressions in the training dataset. A confusion matrix is plotted at the end of each analysis to better visualize the accuracy of the models.
# Set seed for reproducibility
set.seed(1234)
# Create random forest model
controlRF <- trainControl(method = "cv", number = 3, verboseIter = FALSE)
modelRF <- train(classe ~ ., data = trainSet, method = "rf", trControl = controlRF)
modelRF$finalModel
##
## Call:
## randomForest(x = x, y = y, mtry = param$mtry)
## Type of random forest: classification
## Number of trees: 500
## No. of variables tried at each split: 27
##
## OOB estimate of error rate: 0.21%
## Confusion matrix:
## A B C D E class.error
## A 3905 1 0 0 0 0.0002560164
## B 7 2647 4 0 0 0.0041384500
## C 0 4 2392 0 0 0.0016694491
## D 0 0 7 2244 1 0.0035523979
## E 0 1 0 4 2520 0.0019801980
# Predict using the test dataset
predictRF <- predict(modelRF, newdata = validationSet)
confusionMatrixRF <- confusionMatrix(predictRF, validationSet$classe)
confusionMatrixRF
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 1673 6 0 0 0
## B 0 1132 2 0 0
## C 0 0 1024 4 0
## D 0 1 0 960 1
## E 1 0 0 0 1081
##
## Overall Statistics
##
## Accuracy : 0.9975
## 95% CI : (0.9958, 0.9986)
## No Information Rate : 0.2845
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.9968
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 0.9994 0.9939 0.9981 0.9959 0.9991
## Specificity 0.9986 0.9996 0.9992 0.9996 0.9998
## Pos Pred Value 0.9964 0.9982 0.9961 0.9979 0.9991
## Neg Pred Value 0.9998 0.9985 0.9996 0.9992 0.9998
## Prevalence 0.2845 0.1935 0.1743 0.1638 0.1839
## Detection Rate 0.2843 0.1924 0.1740 0.1631 0.1837
## Detection Prevalence 0.2853 0.1927 0.1747 0.1635 0.1839
## Balanced Accuracy 0.9990 0.9967 0.9986 0.9977 0.9994
# Plot results
plot(confusionMatrixRF$table, col = confusionMatrixRF$byClass,
main = paste("Figure 2: Random Forest Plot - Accuracy =",
round(confusionMatrixRF$overall['Accuracy'], 3)))
# Set seed for reproducibility
set.seed(1234)
# Create decision tree model
modelDT <- rpart(classe ~ ., data = trainSet, method = "class")
# Predict using the test dataset
predictDT <- predict(modelDT, newdata = validationSet, type = "class")
confusionMatrixDT <- confusionMatrix(predictDT, validationSet$classe)
confusionMatrixDT
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 1481 170 18 60 79
## B 134 707 24 117 138
## C 9 51 901 130 68
## D 21 118 69 601 129
## E 29 93 14 56 668
##
## Overall Statistics
##
## Accuracy : 0.7405
## 95% CI : (0.7291, 0.7517)
## No Information Rate : 0.2845
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.6709
## Mcnemar's Test P-Value : < 2.2e-16
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 0.8847 0.6207 0.8782 0.6234 0.6174
## Specificity 0.9223 0.9130 0.9469 0.9315 0.9600
## Pos Pred Value 0.8191 0.6313 0.7774 0.6407 0.7767
## Neg Pred Value 0.9527 0.9093 0.9736 0.9266 0.9176
## Prevalence 0.2845 0.1935 0.1743 0.1638 0.1839
## Detection Rate 0.2517 0.1201 0.1531 0.1021 0.1135
## Detection Prevalence 0.3072 0.1903 0.1969 0.1594 0.1461
## Balanced Accuracy 0.9035 0.7668 0.9125 0.7775 0.7887
# Plot results
plot(confusionMatrixDT$table, col = confusionMatrixDT$byClass,
main = paste("Figure 3: Decision Tree Plot - Accuracy =",
round(confusionMatrixDT$overall['Accuracy'], 3)))
# Set seed for reproducibility
set.seed(1234)
# Create decision tree model
controlGBM <- trainControl(method = "repeatedcv", number = 5, repeats = 1)
modelGBM <- train(classe ~ ., data = trainSet, method = "gbm", trControl = controlGBM, verbose = FALSE)
modelGBM$finalModel
## A gradient boosted model with multinomial loss function.
## 150 iterations were performed.
## There were 53 predictors of which 41 had non-zero influence.
# Predict using the test dataset
predictGBM <- predict(modelGBM, newdata = validationSet)
confusionMatrixGBM <- confusionMatrix(predictGBM, validationSet$classe)
confusionMatrixGBM
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 1670 12 0 1 0
## B 3 1111 10 6 4
## C 0 12 1012 11 1
## D 1 4 4 944 5
## E 0 0 0 2 1072
##
## Overall Statistics
##
## Accuracy : 0.9871
## 95% CI : (0.9839, 0.9898)
## No Information Rate : 0.2845
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.9837
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 0.9976 0.9754 0.9864 0.9793 0.9908
## Specificity 0.9969 0.9952 0.9951 0.9972 0.9996
## Pos Pred Value 0.9923 0.9797 0.9768 0.9854 0.9981
## Neg Pred Value 0.9990 0.9941 0.9971 0.9959 0.9979
## Prevalence 0.2845 0.1935 0.1743 0.1638 0.1839
## Detection Rate 0.2838 0.1888 0.1720 0.1604 0.1822
## Detection Prevalence 0.2860 0.1927 0.1760 0.1628 0.1825
## Balanced Accuracy 0.9973 0.9853 0.9907 0.9882 0.9952
# Plot results
plot(confusionMatrixGBM$table, col = confusionMatrixGBM$byClass,
main = paste("Figure 4: Generalized Boosted Model Plot - Accuracy =",
round(confusionMatrixGBM$overall['Accuracy'], 3)))
As for this investigation, the accuracy of the selected models is the following:
* Random forest: 0.999
* Decision tree: 0.729
* GBM: 0.989 Therefore, the random forest method must be used to prefict the results.
predict <- predict(modelRF, newdata = testing)
predict
## [1] B A B A A E D B A A B C B A E E A B B B
## Levels: A B C D E