Project7

Assignment

Here we are tasked with creating a model that will predict the manner in which test subjects completed a particular exercise. The subject completed the same exercise but in differing techniques. the subjects also wore different types of fitness measuring devices to collect data during their exercise. Our model should use this information to predic the manner/technique in which they exercised.

Libraries

For this assignment we will need the help of several libraries. Most importantly, we will need access to the caret package which will allow us to train.

library("ggplot2")
library("kernlab")

## 
## Attaching package: 'kernlab'

## The following object is masked from 'package:ggplot2':
## 
##     alpha

library("caret")

## Loading required package: lattice

library("randomForest")

## randomForest 4.6-12

## Type rfNews() to see new features/changes/bug fixes.

## 
## Attaching package: 'randomForest'

## The following object is masked from 'package:ggplot2':
## 
##     margin

Partition Data

Now we need to read in our data and partition it off to create a training and test set. For this set, I chose to use 60 percent of the data for our training set.

Lift<-read.csv("weightlift.csv")

set.seed(77)
inTraining <- createDataPartition(y=Lift$classe, p=0.6, list=FALSE)
trainingData <- Lift[inTraining,]
testData <- Lift[-inTraining,]

Setting up Cross-Validation

Here we set up our cross validation process to ]be four fold. THis will be used when we create our first model.

trainOpts <- trainControl()
trainOpts$method="cv"
trainOpts$number=4

Creating the first RandomForest Model

The following code creates a randomforest with a four fold. While the randomforest method takes some time, it will produce very accurate results and allow us to sort the for the most important variables.

rfModel <- train(classe~., data=trainingData, method="rf", trControl=trainOpts)
rfModel

## Random Forest 
## 
## 11776 samples
##    55 predictor
##     5 classes: 'A', 'B', 'C', 'D', 'E' 
## 
## No pre-processing
## Resampling: Cross-Validated (4 fold) 
## Summary of sample sizes: 8832, 8832, 8832, 8832 
## Resampling results across tuning parameters:
## 
##   mtry  Accuracy   Kappa    
##    2    0.9980469  0.9975296
##   28    0.9998302  0.9997852
##   55    0.9996603  0.9995704
## 
## Accuracy was used to select the optimal model using  the largest value.
## The final value used for the model was mtry = 28.

Evaluation of First Model

Here is our confusion matrix for our first model. Looking at the information, we can see that our first model has an almost 99 percent accuracy for the data is was tested for.

cm <- confusionMatrix(rfModel$finalModel$predicted,trainingData$classe)
cm$table

##           Reference
## Prediction    A    B    C    D    E
##          A 3348    0    0    0    0
##          B    0 2278    0    0    0
##          C    0    1 2054    0    0
##          D    0    0    0 1930    0
##          E    0    0    0    0 2165

cm$overall[1]

##  Accuracy 
## 0.9999151

Trimming down the Variables

This shows us the most important variables. Currently we have 55 variables in our model. We need to trim that down to 30.

priority <- varImp(rfModel)
print(priority)

## rf variable importance
## 
##   only 20 most important variables shown (out of 55)
## 
##                       Overall
## X                    100.0000
## roll_belt              7.0328
## pitch_forearm          2.3663
## num_window             1.8351
## accel_belt_z           1.6315
## roll_dumbbell          1.0912
## accel_forearm_x        0.8214
## magnet_dumbbell_y      0.7687
## magnet_belt_y          0.7184
## total_accel_belt       0.5719
## magnet_dumbbell_z      0.4941
## yaw_belt               0.4713
## magnet_dumbbell_x      0.4635
## pitch_belt             0.4432
## accel_dumbbell_y       0.4111
## magnet_belt_z          0.3839
## roll_forearm           0.2690
## total_accel_dumbbell   0.2430
## magnet_arm_x           0.2280
## accel_dumbbell_z       0.2112

We need to sort the data and keep the first 11 variables and store them into ValidationData.

sort <- order(priority$importance$Overall,decreasing=TRUE)
remain <- row.names(priority$importance)[sort[1:11]]
testData <- testData[,c(remain,"classe")]
colnames(testData)

##  [1] "X"                 "roll_belt"         "pitch_forearm"    
##  [4] "num_window"        "accel_belt_z"      "roll_dumbbell"    
##  [7] "accel_forearm_x"   "magnet_dumbbell_y" "magnet_belt_y"    
## [10] "total_accel_belt"  "magnet_dumbbell_z" "classe"

RePartition the new data set

Now we partition the new data set. This set uses 40 percent of the data.

inTraining <- createDataPartition(y=testData$classe, p=0.4, list=FALSE)
trainingAgain <- testData[inTraining,]
testing <- testData[-inTraining,]

Creating the Second Model

rfModel2 <- train(classe ~ ., data = trainingAgain, method="rf", trControl=trainOpts)
rfModel2

## Random Forest 
## 
## 3141 samples
##   11 predictor
##    5 classes: 'A', 'B', 'C', 'D', 'E' 
## 
## No pre-processing
## Resampling: Cross-Validated (4 fold) 
## Summary of sample sizes: 2355, 2356, 2356, 2356 
## Resampling results across tuning parameters:
## 
##   mtry  Accuracy   Kappa    
##    2    0.9980896  0.9975838
##    6    0.9984080  0.9979863
##   11    0.9974534  0.9967787
## 
## Accuracy was used to select the optimal model using  the largest value.
## The final value used for the model was mtry = 6.

Evaluation of Final Model

In-Sample

testPredict <- predict(rfModel2, trainingAgain)
confusionMatrix(testPredict, trainingAgain$classe)$table

##           Reference
## Prediction   A   B   C   D   E
##          A 893   0   0   0   0
##          B   0 608   0   0   0
##          C   0   0 548   0   0
##          D   0   0   0 515   0
##          E   0   0   0   0 577

The in-sample accuracy of the final model is 100 percent.

insample <- confusionMatrix(testPredict, trainingAgain$classe)$overall[1]
insample

## Accuracy 
##        1

Out-Of-Sample

testPredict <- predict(rfModel2,testing)
confusionMatrix(testPredict,testing$classe)$table

##           Reference
## Prediction    A    B    C    D    E
##          A 1337    0    0    0    0
##          B    2  910    0    0    0
##          C    0    0  820    0    0
##          D    0    0    0  771    0
##          E    0    0    0    0  865

The out of class sample has a 99.98 percent accuracy for our second model.

outsample<-confusionMatrix(testPredict,testing$classe)$overall[1]
outsample

##  Accuracy 
## 0.9995749