Using devices such as Jawbone Up, Nike FuelBand, and Fitbit it is now possible to collect a large amount of data about personal activity relatively inexpensively. These type of devices are part of the quantified self movement - a group of enthusiasts who take measurements about themselves regularly to improve their health, to find patterns in their behavior, or because they are tech geeks.
Participants of this project were asked to perform barbell lifts correctly and incorrectly in 5 different ways.The goal of your project is to predict the manner in which they did the exercise. This is the “classe” variable in the training set.
library(knitr)
library(caret)
## Warning: package 'caret' was built under R version 3.2.5
## Warning: package 'ggplot2' was built under R version 3.2.5
library(rpart)
library(rpart.plot)
## Warning: package 'rpart.plot' was built under R version 3.2.5
library(rattle)
## Warning: package 'rattle' was built under R version 3.2.5
library(randomForest)
## Warning: package 'randomForest' was built under R version 3.2.5
#Load training and test datasets
training<-read.csv("pml-training.csv")
testing<- read.csv("pml-testing.csv")
First, we will partition the training data into a training dataset (70%) and a test dataset (30%). The actual test dataset provided will remain untouched and used for predicting the test results.
train_partition<-createDataPartition(training$classe, p = 0.7, list = FALSE)
train_set<-training[train_partition,]
test_set<-training[-train_partition,]
Next, we remove variables that are mostly NA
NAs <- sapply(train_set, function(x) mean(is.na(x)))>.95
train_set<-train_set[, NAs == FALSE]
test_set<-test_set[, NAs == FALSE]
Followed by removing the variables which have near zero variance
nearZeroVariance <- nearZeroVar(train_set)
train_set <- train_set[,-nearZeroVariance]
test_set <- test_set[,-nearZeroVariance]
Finally, we remove the variables for identification only, columns 1 to 5.
train_set <-train_set[,-(1:5)]
test_set <-test_set[,-(1:5)]
2 models will be used to predict the ‘classe’ variable in the training set. The model with the higher accuracy will then be used for the quiz portion of the assignment.
set.seed(11111)
controlRandForest<-trainControl(method = "cv", number = 3, verboseIter = FALSE)
modelFitRandForest<-train(classe ~ ., data = train_set, method = "rf", trControl = controlRandForest)
modelFitRandForest$finalModel
##
## Call:
## randomForest(x = x, y = y, mtry = param$mtry)
## Type of random forest: classification
## Number of trees: 500
## No. of variables tried at each split: 27
##
## OOB estimate of error rate: 0.25%
## Confusion matrix:
## A B C D E class.error
## A 3903 2 0 0 1 0.0007680492
## B 7 2647 3 1 0 0.0041384500
## C 0 5 2390 1 0 0.0025041736
## D 0 0 6 2245 1 0.0031083481
## E 0 1 0 6 2518 0.0027722772
predictRandForest <- predict(modelFitRandForest, newdata = test_set)
confusionMatrixRandForest <- confusionMatrix(predictRandForest, test_set$classe)
confusionMatrixRandForest
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 1674 7 0 0 0
## B 0 1132 2 0 0
## C 0 0 1024 6 0
## D 0 0 0 958 0
## E 0 0 0 0 1082
##
## Overall Statistics
##
## Accuracy : 0.9975
## 95% CI : (0.9958, 0.9986)
## No Information Rate : 0.2845
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.9968
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 1.0000 0.9939 0.9981 0.9938 1.0000
## Specificity 0.9983 0.9996 0.9988 1.0000 1.0000
## Pos Pred Value 0.9958 0.9982 0.9942 1.0000 1.0000
## Neg Pred Value 1.0000 0.9985 0.9996 0.9988 1.0000
## Prevalence 0.2845 0.1935 0.1743 0.1638 0.1839
## Detection Rate 0.2845 0.1924 0.1740 0.1628 0.1839
## Detection Prevalence 0.2856 0.1927 0.1750 0.1628 0.1839
## Balanced Accuracy 0.9992 0.9967 0.9984 0.9969 1.0000
set.seed(11111)
controlGBM <- trainControl(method = "repeatedcv", number = 5, repeats = 1)
modelFitGBM <-train(classe ~ ., data = train_set, method = "gbm", trControl = controlGBM, verbose = FALSE)
## Loading required package: gbm
## Warning: package 'gbm' was built under R version 3.2.5
## Loading required package: survival
##
## Attaching package: 'survival'
## The following object is masked from 'package:caret':
##
## cluster
## Loading required package: splines
## Loading required package: parallel
## Loaded gbm 2.1.1
## Loading required package: plyr
## Warning: package 'plyr' was built under R version 3.2.5
modelFitGBM$finalModel
## A gradient boosted model with multinomial loss function.
## 150 iterations were performed.
## There were 53 predictors of which 41 had non-zero influence.
predictGBM <-predict(modelFitGBM, newdata = test_set)
confusionMatrixGBM <- confusionMatrix(predictGBM, test_set$classe)
confusionMatrixGBM
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 1669 20 0 0 0
## B 4 1095 4 1 2
## C 0 24 1021 15 2
## D 1 0 0 947 10
## E 0 0 1 1 1068
##
## Overall Statistics
##
## Accuracy : 0.9856
## 95% CI : (0.9822, 0.9884)
## No Information Rate : 0.2845
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.9817
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 0.9970 0.9614 0.9951 0.9824 0.9871
## Specificity 0.9953 0.9977 0.9916 0.9978 0.9996
## Pos Pred Value 0.9882 0.9901 0.9614 0.9885 0.9981
## Neg Pred Value 0.9988 0.9908 0.9990 0.9965 0.9971
## Prevalence 0.2845 0.1935 0.1743 0.1638 0.1839
## Detection Rate 0.2836 0.1861 0.1735 0.1609 0.1815
## Detection Prevalence 0.2870 0.1879 0.1805 0.1628 0.1818
## Balanced Accuracy 0.9961 0.9795 0.9933 0.9901 0.9933
From the results above, we see that the Random Forest method is more accurate than the Generalized Boosted Model. Using the Random Forest Model to predict the test results:
predictTest<-predict(modelFitRandForest, newdata = testing)
predictTest
## [1] B A B A A E D B A A B C B A E E A B B B
## Levels: A B C D E