Project Description and Goal

One thing that people regularly do is quantify how much of a particular activity they do, but they rarely quantify how well they do it. In this project, your goal will be to use data from accelerometers on the belt, forearm, arm, and dumbell of 6 participants.

Read in data

The following code imports the data, partions the data into a new training set and a new testing set, and loads the necessary packages. Variables that have missing data, are timestamps, and user names are discarded because they do not provide useful information regarding accurate predictions. For purposes of reproducibility the pusdo-random seed will be set at 1991.

set.seed(1991)
library(readr)
library(AppliedPredictiveModeling)
library(caret)
training <- read_csv("C:/Users/Kevin/Desktop/pml-training.csv",
col_types = cols(X1 = col_skip(), classe = col_factor(levels = c("A",
"B", "C", "D", "E")), cvtd_timestamp = col_skip(),
new_window = col_factor(levels = c("yes",
"no"))), trim_ws = FALSE)

training<-training[,-c(1:5)]
new<-training[,colSums(is.na(training))==0]
inTrain = createDataPartition(new$classe, p = .75)[[1]]
Newtraining = new[ inTrain,]
Newtesting = new[-inTrain,]

Building a model

The model I chose in this project is the Random Forest model because I have been doing personal research regarding decision trees and this model provided a sufficient model with high specificity and accuracy.

The following code builds a Random Forest model on the Newtraining set.

library(randomForest)
rffit <- randomForest(classe ~ ., data = Newtraining)
rffit
## 
## Call:
##  randomForest(formula = classe ~ ., data = Newtraining) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 7
## 
##         OOB estimate of  error rate: 0.54%
## Confusion matrix:
##      A    B    C    D    E class.error
## A 4178    4    2    0    1 0.001672640
## B   14 2829    5    0    0 0.006671348
## C    0   13 2550    4    0 0.006622517
## D    0    0   24 2386    2 0.010779436
## E    0    0    3    8 2695 0.004065041

As Shown above there is an expected out of sample error rate of 0.51%.

Cross Validation

The following uses the Newtesting data to conduct an out-of-sample error rate, accuracy, specificity, sensitivity, and related statistics

rfpred <- predict(rffit, newdata = Newtesting)
confusionMatrix(Newtesting$classe, rfpred)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 1394    0    0    0    1
##          B    8  939    2    0    0
##          C    0    5  848    2    0
##          D    0    0    7  797    0
##          E    0    0    0    5  896
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9939          
##                  95% CI : (0.9913, 0.9959)
##     No Information Rate : 0.2859          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.9923          
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            0.9943   0.9947   0.9895   0.9913   0.9989
## Specificity            0.9997   0.9975   0.9983   0.9983   0.9988
## Pos Pred Value         0.9993   0.9895   0.9918   0.9913   0.9945
## Neg Pred Value         0.9977   0.9987   0.9978   0.9983   0.9998
## Prevalence             0.2859   0.1925   0.1748   0.1639   0.1829
## Detection Rate         0.2843   0.1915   0.1729   0.1625   0.1827
## Detection Prevalence   0.2845   0.1935   0.1743   0.1639   0.1837
## Balanced Accuracy      0.9970   0.9961   0.9939   0.9948   0.9988

From above we see that the expected accuracy, with 95% confidence, lies in the interval of (0.9913, 0.9959).

Test data Predictions

The following provides the predictions of the unused test data provided in the original project instructions.

testing <- read_csv("C:/Users/Kevin/Desktop/pml-testing.csv")
predict(rffit, newdata = testing)
##  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 
##  B  A  B  A  A  E  D  B  A  A  B  C  B  A  E  E  A  B  B  B 
## Levels: A B C D E