One thing that people regularly do is quantify how much of a particular activity they do, but they rarely quantify how well they do it. In this project, your goal will be to use data from accelerometers on the belt, forearm, arm, and dumbell of 6 participants.
The following code imports the data, partions the data into a new training set and a new testing set, and loads the necessary packages. Variables that have missing data, are timestamps, and user names are discarded because they do not provide useful information regarding accurate predictions. For purposes of reproducibility the pusdo-random seed will be set at 1991.
set.seed(1991)
library(readr)
library(AppliedPredictiveModeling)
library(caret)
training <- read_csv("C:/Users/Kevin/Desktop/pml-training.csv",
col_types = cols(X1 = col_skip(), classe = col_factor(levels = c("A",
"B", "C", "D", "E")), cvtd_timestamp = col_skip(),
new_window = col_factor(levels = c("yes",
"no"))), trim_ws = FALSE)
training<-training[,-c(1:5)]
new<-training[,colSums(is.na(training))==0]
inTrain = createDataPartition(new$classe, p = .75)[[1]]
Newtraining = new[ inTrain,]
Newtesting = new[-inTrain,]
The model I chose in this project is the Random Forest model because I have been doing personal research regarding decision trees and this model provided a sufficient model with high specificity and accuracy.
The following code builds a Random Forest model on the Newtraining set.
library(randomForest)
rffit <- randomForest(classe ~ ., data = Newtraining)
rffit
##
## Call:
## randomForest(formula = classe ~ ., data = Newtraining)
## Type of random forest: classification
## Number of trees: 500
## No. of variables tried at each split: 7
##
## OOB estimate of error rate: 0.54%
## Confusion matrix:
## A B C D E class.error
## A 4178 4 2 0 1 0.001672640
## B 14 2829 5 0 0 0.006671348
## C 0 13 2550 4 0 0.006622517
## D 0 0 24 2386 2 0.010779436
## E 0 0 3 8 2695 0.004065041
As Shown above there is an expected out of sample error rate of 0.51%.
The following uses the Newtesting data to conduct an out-of-sample error rate, accuracy, specificity, sensitivity, and related statistics
rfpred <- predict(rffit, newdata = Newtesting)
confusionMatrix(Newtesting$classe, rfpred)
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 1394 0 0 0 1
## B 8 939 2 0 0
## C 0 5 848 2 0
## D 0 0 7 797 0
## E 0 0 0 5 896
##
## Overall Statistics
##
## Accuracy : 0.9939
## 95% CI : (0.9913, 0.9959)
## No Information Rate : 0.2859
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.9923
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 0.9943 0.9947 0.9895 0.9913 0.9989
## Specificity 0.9997 0.9975 0.9983 0.9983 0.9988
## Pos Pred Value 0.9993 0.9895 0.9918 0.9913 0.9945
## Neg Pred Value 0.9977 0.9987 0.9978 0.9983 0.9998
## Prevalence 0.2859 0.1925 0.1748 0.1639 0.1829
## Detection Rate 0.2843 0.1915 0.1729 0.1625 0.1827
## Detection Prevalence 0.2845 0.1935 0.1743 0.1639 0.1837
## Balanced Accuracy 0.9970 0.9961 0.9939 0.9948 0.9988
From above we see that the expected accuracy, with 95% confidence, lies in the interval of (0.9913, 0.9959).
The following provides the predictions of the unused test data provided in the original project instructions.
testing <- read_csv("C:/Users/Kevin/Desktop/pml-testing.csv")
predict(rffit, newdata = testing)
## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
## B A B A A E D B A A B C B A E E A B B B
## Levels: A B C D E