Project IV

Introduction

The data I’m using for this experiment is the dataset “weightlift.csv” consists of a data from accelerometers on the belt, forearm, arm, and dumbell of 6 participants. Six young health participants were asked to perform one set of 10 repetitions of the Unilateral Dumbbell Biceps Curl in five different fashions: exactly according to the specification (Class A), throwing the elbows to the front (Class B), lifting the dumbbell only halfway (Class C), lowering the dumbbell only halfway (Class D) and throwing the hips to the front (Class E).

Weighlifting Prediction Model: Variable Selection

To start. I read more about the different variables to gain more reasoning of what could make the biggest effect to build a prediction model. Firstly, I got rid of some of the variables that were used for “time stamp”, continuilly some other variables that to my knowledge I thought it might not affect the prediction Model such as some of the acelerometers used on the arm and the x axis for the belt.

Narrow Down Variables Before Training

colnames(Data)

##  [1] "X"                    "new_window"           "num_window"          
##  [4] "roll_belt"            "pitch_belt"           "yaw_belt"            
##  [7] "total_accel_belt"     "gyros_belt_x"         "gyros_belt_y"        
## [10] "gyros_belt_z"         "accel_belt_x"         "accel_belt_y"        
## [13] "accel_belt_z"         "magnet_belt_x"        "magnet_belt_y"       
## [16] "magnet_belt_z"        "roll_arm"             "pitch_arm"           
## [19] "yaw_arm"              "total_accel_arm"      "gyros_arm_x"         
## [22] "gyros_arm_y"          "gyros_arm_z"          "accel_arm_x"         
## [25] "accel_arm_y"          "accel_arm_z"          "magnet_arm_x"        
## [28] "magnet_arm_y"         "magnet_arm_z"         "roll_dumbbell"       
## [31] "pitch_dumbbell"       "yaw_dumbbell"         "total_accel_dumbbell"
## [34] "gyros_dumbbell_x"     "gyros_dumbbell_y"     "gyros_dumbbell_z"    
## [37] "accel_dumbbell_x"     "accel_dumbbell_y"     "accel_dumbbell_z"    
## [40] "magnet_dumbbell_x"    "magnet_dumbbell_y"    "magnet_dumbbell_z"   
## [43] "roll_forearm"         "pitch_forearm"        "yaw_forearm"         
## [46] "total_accel_forearm"  "gyros_forearm_x"      "gyros_forearm_y"     
## [49] "gyros_forearm_z"      "accel_forearm_x"      "accel_forearm_y"     
## [52] "accel_forearm_z"      "magnet_forearm_x"     "magnet_forearm_y"    
## [55] "magnet_forearm_z"     "classe"

Data <- Data[,-c(1:3,8,11,14,21:29,55,52,49)]
colnames(Data)

##  [1] "roll_belt"            "pitch_belt"           "yaw_belt"            
##  [4] "total_accel_belt"     "gyros_belt_y"         "gyros_belt_z"        
##  [7] "accel_belt_y"         "accel_belt_z"         "magnet_belt_y"       
## [10] "magnet_belt_z"        "roll_arm"             "pitch_arm"           
## [13] "yaw_arm"              "total_accel_arm"      "roll_dumbbell"       
## [16] "pitch_dumbbell"       "yaw_dumbbell"         "total_accel_dumbbell"
## [19] "gyros_dumbbell_x"     "gyros_dumbbell_y"     "gyros_dumbbell_z"    
## [22] "accel_dumbbell_x"     "accel_dumbbell_y"     "accel_dumbbell_z"    
## [25] "magnet_dumbbell_x"    "magnet_dumbbell_y"    "magnet_dumbbell_z"   
## [28] "roll_forearm"         "pitch_forearm"        "yaw_forearm"         
## [31] "total_accel_forearm"  "gyros_forearm_x"      "gyros_forearm_y"     
## [34] "accel_forearm_x"      "accel_forearm_y"      "magnet_forearm_x"    
## [37] "magnet_forearm_y"     "classe"

Training Initiation

Set Up Training Options

I Set up to perform 3-fold cross-validation.

trainOptions <- trainControl()
trainOptions$method="cv"
trainOptions$number=3

Clean Data Set and Creation of Partitions for Training

I continue to create the following partitions for Training, only using 30% of the data for Training.

set.seed(444)
inTraining <- createDataPartition(y=Data$classe, p=0.3, list=FALSE)
trainingData <- Data[inTraining,]
validationData <- Data[-inTraining,]

Training Model

I used the RandomForest Library to help create a more accurate model with the variables I decided to keep.

rfModel <- train(classe~ ., data = Data, method="rf", trControl=trainOptions)

After creating the first model with the Training Data, I decide to see what where the variables that had more influence in the Predictor Model, that way I can have a more accurate prediction and continiously get rid of the ones that to do not affect the model as much. Which I decided to stick with the top 8 most important variables (x > 20.00 Important Points).

Getting Top Predictors

Vpriority <- varImp(rfModel)
print(Vpriority)

## rf variable importance
## 
##   only 20 most important variables shown (out of 37)
## 
##                   Overall
## roll_belt          100.00
## yaw_belt            87.30
## magnet_dumbbell_z   74.93
## pitch_belt          65.53
## pitch_forearm       61.71
## magnet_dumbbell_y   61.15
## roll_forearm        53.67
## magnet_dumbbell_x   52.35
## accel_dumbbell_y    42.38
## magnet_belt_z       40.62
## roll_dumbbell       38.63
## accel_belt_z        37.95
## magnet_belt_y       37.11
## accel_dumbbell_z    34.41
## accel_forearm_x     31.88
## roll_arm            30.60
## yaw_arm             27.66
## gyros_belt_z        27.11
## accel_dumbbell_x    24.52
## yaw_dumbbell        24.27

sort <- order(Vpriority$importance$Overall,decreasing=TRUE)
keep <- row.names(Vpriority$importance)[sort[1:8]]

validationData <- validationData[,c(keep,"classe")]

Variabls kept were

Training Continuation

Partition Validation Data Set

I redo a Partition of the Training set this time containing 60% of the data.

inTraining <- createDataPartition(y=validationData$classe, p=0.6, list=FALSE)
trainingAgain <- validationData[inTraining,]
testing <- validationData[-inTraining,]

Retrain with Only Those Predictors chosen

Then continue to re-train using RandomForest Model.

rfModel2 <- train(classe ~ ., data = trainingAgain, method="rf", trControl=trainOptions)

Evaluating Model

Furthermore, I continue to analize the accuracy of the Model.

In-Sample Model

rfModel2

## Random Forest 
## 
## 8831 samples
##    8 predictor
##    5 classes: 'A', 'B', 'C', 'D', 'E' 
## 
## No pre-processing
## Resampling: Cross-Validated (3 fold) 
## Summary of sample sizes: 5887, 5887, 5888 
## Resampling results across tuning parameters:
## 
##   mtry  Accuracy   Kappa    
##   2     0.9696523  0.9616232
##   5     0.9669346  0.9581828
##   8     0.9624048  0.9524430
## 
## Accuracy was used to select the optimal model using  the largest value.
## The final value used for the model was mtry = 2.

trainPredict <- predict(rfModel2,trainingData)
confusionMatrix(trainPredict,trainingData$classe)$table

##           Reference
## Prediction    A    B    C    D    E
##          A 1656    5    1    0    1
##          B   11 1118    7    3    6
##          C    3   10 1013   10    2
##          D    4    4    6  951    3
##          E    0    3    0    1 1071

confusionMatrix(trainPredict,trainingData$classe)$overall[1]

##  Accuracy 
## 0.9864154

Showing an average Accuracy of 98.6415351%, which its very good for the In-Sample Model.

Out-of-Sample Model

testPredict <- predict(rfModel2,testing)
confusionMatrix(testPredict,testing$classe)$table

##           Reference
## Prediction    A    B    C    D    E
##          A 1654   11    2    0    1
##          B   12 1104    7    8   14
##          C    6   17 1006   20    7
##          D    2    6   11  936    5
##          E    0    0    0    0 1055

confusionMatrix(testPredict,testing$classe)$overall[1]

##  Accuracy 
## 0.9780761

After Testing, the Out-of-Sample Model, an accuracy of 97.8076139% is really good to identify the type of classe depending of how the different excersises are made. Only to be sure that is correct, another ConfusionMatrix was made to be certain of the overall accuracy.

Finally, I decide to keep my Prediction Model, which keeps being consistent with a 97.8076139% accuracy.