Here we are tasked with creating a model that will predict the manner in which test subjects completed a particular exercise. The subject completed the same exercise but in differing techniques. the subjects also wore different types of fitness measuring devices to collect data during their exercise. Our model should use this information to predic the manner/technique in which they exercised.
For this assignment we will need the help of several libraries. Most importantly, we will need access to the caret package which will allow us to train.
library("ggplot2")
library("kernlab")
##
## Attaching package: 'kernlab'
## The following object is masked from 'package:ggplot2':
##
## alpha
library("caret")
## Loading required package: lattice
library("randomForest")
## randomForest 4.6-12
## Type rfNews() to see new features/changes/bug fixes.
##
## Attaching package: 'randomForest'
## The following object is masked from 'package:ggplot2':
##
## margin
Now we need to read in our data and partition it off to create a training and test set. For this set, I chose to use 60 percent of the data for our training set.
Lift<-read.csv("weightlift.csv")
set.seed(77)
inTraining <- createDataPartition(y=Lift$classe, p=0.6, list=FALSE)
trainingData <- Lift[inTraining,]
testData <- Lift[-inTraining,]
Here we set up our cross validation process to ]be four fold. THis will be used when we create our first model.
trainOpts <- trainControl()
trainOpts$method="cv"
trainOpts$number=4
The following code creates a randomforest with a four fold. While the randomforest method takes some time, it will produce very accurate results and allow us to sort the for the most important variables.
rfModel <- train(classe~., data=trainingData, method="rf", trControl=trainOpts)
rfModel
## Random Forest
##
## 11776 samples
## 55 predictor
## 5 classes: 'A', 'B', 'C', 'D', 'E'
##
## No pre-processing
## Resampling: Cross-Validated (4 fold)
## Summary of sample sizes: 8832, 8832, 8832, 8832
## Resampling results across tuning parameters:
##
## mtry Accuracy Kappa
## 2 0.9980469 0.9975296
## 28 0.9998302 0.9997852
## 55 0.9996603 0.9995704
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 28.
Here is our confusion matrix for our first model. Looking at the information, we can see that our first model has an almost 99 percent accuracy for the data is was tested for.
cm <- confusionMatrix(rfModel$finalModel$predicted,trainingData$classe)
cm$table
## Reference
## Prediction A B C D E
## A 3348 0 0 0 0
## B 0 2278 0 0 0
## C 0 1 2054 0 0
## D 0 0 0 1930 0
## E 0 0 0 0 2165
cm$overall[1]
## Accuracy
## 0.9999151
This shows us the most important variables. Currently we have 55 variables in our model. We need to trim that down to 30.
priority <- varImp(rfModel)
print(priority)
## rf variable importance
##
## only 20 most important variables shown (out of 55)
##
## Overall
## X 100.0000
## roll_belt 7.0328
## pitch_forearm 2.3663
## num_window 1.8351
## accel_belt_z 1.6315
## roll_dumbbell 1.0912
## accel_forearm_x 0.8214
## magnet_dumbbell_y 0.7687
## magnet_belt_y 0.7184
## total_accel_belt 0.5719
## magnet_dumbbell_z 0.4941
## yaw_belt 0.4713
## magnet_dumbbell_x 0.4635
## pitch_belt 0.4432
## accel_dumbbell_y 0.4111
## magnet_belt_z 0.3839
## roll_forearm 0.2690
## total_accel_dumbbell 0.2430
## magnet_arm_x 0.2280
## accel_dumbbell_z 0.2112
We need to sort the data and keep the first 11 variables and store them into ValidationData.
sort <- order(priority$importance$Overall,decreasing=TRUE)
remain <- row.names(priority$importance)[sort[1:11]]
testData <- testData[,c(remain,"classe")]
colnames(testData)
## [1] "X" "roll_belt" "pitch_forearm"
## [4] "num_window" "accel_belt_z" "roll_dumbbell"
## [7] "accel_forearm_x" "magnet_dumbbell_y" "magnet_belt_y"
## [10] "total_accel_belt" "magnet_dumbbell_z" "classe"
Now we partition the new data set. This set uses 40 percent of the data.
inTraining <- createDataPartition(y=testData$classe, p=0.4, list=FALSE)
trainingAgain <- testData[inTraining,]
testing <- testData[-inTraining,]
rfModel2 <- train(classe ~ ., data = trainingAgain, method="rf", trControl=trainOpts)
rfModel2
## Random Forest
##
## 3141 samples
## 11 predictor
## 5 classes: 'A', 'B', 'C', 'D', 'E'
##
## No pre-processing
## Resampling: Cross-Validated (4 fold)
## Summary of sample sizes: 2355, 2356, 2356, 2356
## Resampling results across tuning parameters:
##
## mtry Accuracy Kappa
## 2 0.9980896 0.9975838
## 6 0.9984080 0.9979863
## 11 0.9974534 0.9967787
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 6.
testPredict <- predict(rfModel2, trainingAgain)
confusionMatrix(testPredict, trainingAgain$classe)$table
## Reference
## Prediction A B C D E
## A 893 0 0 0 0
## B 0 608 0 0 0
## C 0 0 548 0 0
## D 0 0 0 515 0
## E 0 0 0 0 577
The in-sample accuracy of the final model is 100 percent.
insample <- confusionMatrix(testPredict, trainingAgain$classe)$overall[1]
insample
## Accuracy
## 1
testPredict <- predict(rfModel2,testing)
confusionMatrix(testPredict,testing$classe)$table
## Reference
## Prediction A B C D E
## A 1337 0 0 0 0
## B 2 910 0 0 0
## C 0 0 820 0 0
## D 0 0 0 771 0
## E 0 0 0 0 865
The out of class sample has a 99.98 percent accuracy for our second model.
outsample<-confusionMatrix(testPredict,testing$classe)$overall[1]
outsample
## Accuracy
## 0.9995749