The data I’m using for this experiment is the dataset “weightlift.csv” consists of a data from accelerometers on the belt, forearm, arm, and dumbell of 6 participants. Six young health participants were asked to perform one set of 10 repetitions of the Unilateral Dumbbell Biceps Curl in five different fashions: exactly according to the specification (Class A), throwing the elbows to the front (Class B), lifting the dumbbell only halfway (Class C), lowering the dumbbell only halfway (Class D) and throwing the hips to the front (Class E).
Read more: http://groupware.les.inf.puc-rio.br/har#dataset#ixzz4RWHn1Jfu
To start. I read more about the different variables to gain more reasoning of what could make the biggest effect to build a prediction model. Firstly, I got rid of some of the variables that were used for “time stamp”, continuilly some other variables that to my knowledge I thought it might not affect the prediction Model such as some of the acelerometers used on the arm and the x axis for the belt.
colnames(Data)
## [1] "X" "new_window" "num_window"
## [4] "roll_belt" "pitch_belt" "yaw_belt"
## [7] "total_accel_belt" "gyros_belt_x" "gyros_belt_y"
## [10] "gyros_belt_z" "accel_belt_x" "accel_belt_y"
## [13] "accel_belt_z" "magnet_belt_x" "magnet_belt_y"
## [16] "magnet_belt_z" "roll_arm" "pitch_arm"
## [19] "yaw_arm" "total_accel_arm" "gyros_arm_x"
## [22] "gyros_arm_y" "gyros_arm_z" "accel_arm_x"
## [25] "accel_arm_y" "accel_arm_z" "magnet_arm_x"
## [28] "magnet_arm_y" "magnet_arm_z" "roll_dumbbell"
## [31] "pitch_dumbbell" "yaw_dumbbell" "total_accel_dumbbell"
## [34] "gyros_dumbbell_x" "gyros_dumbbell_y" "gyros_dumbbell_z"
## [37] "accel_dumbbell_x" "accel_dumbbell_y" "accel_dumbbell_z"
## [40] "magnet_dumbbell_x" "magnet_dumbbell_y" "magnet_dumbbell_z"
## [43] "roll_forearm" "pitch_forearm" "yaw_forearm"
## [46] "total_accel_forearm" "gyros_forearm_x" "gyros_forearm_y"
## [49] "gyros_forearm_z" "accel_forearm_x" "accel_forearm_y"
## [52] "accel_forearm_z" "magnet_forearm_x" "magnet_forearm_y"
## [55] "magnet_forearm_z" "classe"
Data <- Data[,-c(1:3,8,11,14,21:29,55,52,49)]
colnames(Data)
## [1] "roll_belt" "pitch_belt" "yaw_belt"
## [4] "total_accel_belt" "gyros_belt_y" "gyros_belt_z"
## [7] "accel_belt_y" "accel_belt_z" "magnet_belt_y"
## [10] "magnet_belt_z" "roll_arm" "pitch_arm"
## [13] "yaw_arm" "total_accel_arm" "roll_dumbbell"
## [16] "pitch_dumbbell" "yaw_dumbbell" "total_accel_dumbbell"
## [19] "gyros_dumbbell_x" "gyros_dumbbell_y" "gyros_dumbbell_z"
## [22] "accel_dumbbell_x" "accel_dumbbell_y" "accel_dumbbell_z"
## [25] "magnet_dumbbell_x" "magnet_dumbbell_y" "magnet_dumbbell_z"
## [28] "roll_forearm" "pitch_forearm" "yaw_forearm"
## [31] "total_accel_forearm" "gyros_forearm_x" "gyros_forearm_y"
## [34] "accel_forearm_x" "accel_forearm_y" "magnet_forearm_x"
## [37] "magnet_forearm_y" "classe"
I Set up to perform 3-fold cross-validation.
trainOptions <- trainControl()
trainOptions$method="cv"
trainOptions$number=3
I continue to create the following partitions for Training, only using 30% of the data for Training.
set.seed(444)
inTraining <- createDataPartition(y=Data$classe, p=0.3, list=FALSE)
trainingData <- Data[inTraining,]
validationData <- Data[-inTraining,]
I used the RandomForest Library to help create a more accurate model with the variables I decided to keep.
rfModel <- train(classe~ ., data = Data, method="rf", trControl=trainOptions)
After creating the first model with the Training Data, I decide to see what where the variables that had more influence in the Predictor Model, that way I can have a more accurate prediction and continiously get rid of the ones that to do not affect the model as much. Which I decided to stick with the top 8 most important variables (x > 20.00 Important Points).
Vpriority <- varImp(rfModel)
print(Vpriority)
## rf variable importance
##
## only 20 most important variables shown (out of 37)
##
## Overall
## roll_belt 100.00
## yaw_belt 87.30
## magnet_dumbbell_z 74.93
## pitch_belt 65.53
## pitch_forearm 61.71
## magnet_dumbbell_y 61.15
## roll_forearm 53.67
## magnet_dumbbell_x 52.35
## accel_dumbbell_y 42.38
## magnet_belt_z 40.62
## roll_dumbbell 38.63
## accel_belt_z 37.95
## magnet_belt_y 37.11
## accel_dumbbell_z 34.41
## accel_forearm_x 31.88
## roll_arm 30.60
## yaw_arm 27.66
## gyros_belt_z 27.11
## accel_dumbbell_x 24.52
## yaw_dumbbell 24.27
sort <- order(Vpriority$importance$Overall,decreasing=TRUE)
keep <- row.names(Vpriority$importance)[sort[1:8]]
validationData <- validationData[,c(keep,"classe")]
Variabls kept were
I redo a Partition of the Training set this time containing 60% of the data.
inTraining <- createDataPartition(y=validationData$classe, p=0.6, list=FALSE)
trainingAgain <- validationData[inTraining,]
testing <- validationData[-inTraining,]
Then continue to re-train using RandomForest Model.
rfModel2 <- train(classe ~ ., data = trainingAgain, method="rf", trControl=trainOptions)
Furthermore, I continue to analize the accuracy of the Model.
rfModel2
## Random Forest
##
## 8831 samples
## 8 predictor
## 5 classes: 'A', 'B', 'C', 'D', 'E'
##
## No pre-processing
## Resampling: Cross-Validated (3 fold)
## Summary of sample sizes: 5887, 5887, 5888
## Resampling results across tuning parameters:
##
## mtry Accuracy Kappa
## 2 0.9696523 0.9616232
## 5 0.9669346 0.9581828
## 8 0.9624048 0.9524430
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 2.
trainPredict <- predict(rfModel2,trainingData)
confusionMatrix(trainPredict,trainingData$classe)$table
## Reference
## Prediction A B C D E
## A 1656 5 1 0 1
## B 11 1118 7 3 6
## C 3 10 1013 10 2
## D 4 4 6 951 3
## E 0 3 0 1 1071
confusionMatrix(trainPredict,trainingData$classe)$overall[1]
## Accuracy
## 0.9864154
Showing an average Accuracy of 98.6415351%, which its very good for the In-Sample Model.
testPredict <- predict(rfModel2,testing)
confusionMatrix(testPredict,testing$classe)$table
## Reference
## Prediction A B C D E
## A 1654 11 2 0 1
## B 12 1104 7 8 14
## C 6 17 1006 20 7
## D 2 6 11 936 5
## E 0 0 0 0 1055
confusionMatrix(testPredict,testing$classe)$overall[1]
## Accuracy
## 0.9780761
After Testing, the Out-of-Sample Model, an accuracy of 97.8076139% is really good to identify the type of classe depending of how the different excersises are made. Only to be sure that is correct, another ConfusionMatrix was made to be certain of the overall accuracy.
Finally, I decide to keep my Prediction Model, which keeps being consistent with a 97.8076139% accuracy.