“Six young health participants were asked to perform one set of 10 repetitions of the Unilateral Dumbbell Biceps Curl in five different fashions: exactly according to the specification (Class A), throwing the elbows to the front (Class B), lifting the dumbbell only halfway (Class C), lowering the dumbbell only halfway (Class D) and throwing the hips to the front (Class E).” The goal of your project is to predict which classe.
Data are available online at http://web.archive.org/web/20161224072740/http:/groupware.les.inf.puc-rio.br/har
suppressPackageStartupMessages(library(caret))
suppressPackageStartupMessages(library(rpart))
suppressPackageStartupMessages(library(rattle))
rawData <- read.csv("https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv", na.strings = c("NA", "", "#DIV0!"))
head(colnames(rawData), n=20)
## [1] "X" "user_name" "raw_timestamp_part_1"
## [4] "raw_timestamp_part_2" "cvtd_timestamp" "new_window"
## [7] "num_window" "roll_belt" "pitch_belt"
## [10] "yaw_belt" "total_accel_belt" "kurtosis_roll_belt"
## [13] "kurtosis_picth_belt" "kurtosis_yaw_belt" "skewness_roll_belt"
## [16] "skewness_roll_belt.1" "skewness_yaw_belt" "max_roll_belt"
## [19] "max_picth_belt" "max_yaw_belt"
Keeping only the relevant variables by removing variables unrelated to measurements and by removing incomplete variables (NAs)
data <- rawData[, -c(1:7)]
data <- data[, colSums(is.na(data)) == 0]
colnames(data)
## [1] "roll_belt" "pitch_belt" "yaw_belt"
## [4] "total_accel_belt" "gyros_belt_x" "gyros_belt_y"
## [7] "gyros_belt_z" "accel_belt_x" "accel_belt_y"
## [10] "accel_belt_z" "magnet_belt_x" "magnet_belt_y"
## [13] "magnet_belt_z" "roll_arm" "pitch_arm"
## [16] "yaw_arm" "total_accel_arm" "gyros_arm_x"
## [19] "gyros_arm_y" "gyros_arm_z" "accel_arm_x"
## [22] "accel_arm_y" "accel_arm_z" "magnet_arm_x"
## [25] "magnet_arm_y" "magnet_arm_z" "roll_dumbbell"
## [28] "pitch_dumbbell" "yaw_dumbbell" "total_accel_dumbbell"
## [31] "gyros_dumbbell_x" "gyros_dumbbell_y" "gyros_dumbbell_z"
## [34] "accel_dumbbell_x" "accel_dumbbell_y" "accel_dumbbell_z"
## [37] "magnet_dumbbell_x" "magnet_dumbbell_y" "magnet_dumbbell_z"
## [40] "roll_forearm" "pitch_forearm" "yaw_forearm"
## [43] "total_accel_forearm" "gyros_forearm_x" "gyros_forearm_y"
## [46] "gyros_forearm_z" "accel_forearm_x" "accel_forearm_y"
## [49] "accel_forearm_z" "magnet_forearm_x" "magnet_forearm_y"
## [52] "magnet_forearm_z" "classe"
summary(data[,1:6])
## roll_belt pitch_belt yaw_belt total_accel_belt
## Min. :-28.90 Min. :-55.8000 Min. :-180.00 Min. : 0.00
## 1st Qu.: 1.10 1st Qu.: 1.7600 1st Qu.: -88.30 1st Qu.: 3.00
## Median :113.00 Median : 5.2800 Median : -13.00 Median :17.00
## Mean : 64.41 Mean : 0.3053 Mean : -11.21 Mean :11.31
## 3rd Qu.:123.00 3rd Qu.: 14.9000 3rd Qu.: 12.90 3rd Qu.:18.00
## Max. :162.00 Max. : 60.3000 Max. : 179.00 Max. :29.00
## gyros_belt_x gyros_belt_y
## Min. :-1.040000 Min. :-0.64000
## 1st Qu.:-0.030000 1st Qu.: 0.00000
## Median : 0.030000 Median : 0.02000
## Mean :-0.005592 Mean : 0.03959
## 3rd Qu.: 0.110000 3rd Qu.: 0.11000
## Max. : 2.220000 Max. : 0.64000
Partitioning the training data into a training dataset (80%) and a testing dataset (20%)
inTrain <- createDataPartition(y=data$classe,p=0.8,list = FALSE)
train <- data[inTrain,]
test <- data[-inTrain,]
dim(train);dim(test)
## [1] 15699 53
## [1] 3923 53
The strategy is to try multiple classifier algorithm, including random forest, generalized boosted models and decision tree and to compare their accuracy on the test test.
set.seed(239)
trControl <- trainControl(method = "cv",number = 3)
model1RF <- train(classe ~ ., data=train, method="rf",trControl=trControl)
model1RFPred <- predict(model1RF,test)
model1RFConf <- confusionMatrix(model1RFPred, test$classe)
model1RFConf
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 1116 7 0 0 0
## B 0 752 4 0 0
## C 0 0 679 2 0
## D 0 0 1 641 1
## E 0 0 0 0 720
##
## Overall Statistics
##
## Accuracy : 0.9962
## 95% CI : (0.9937, 0.9979)
## No Information Rate : 0.2845
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.9952
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 1.0000 0.9908 0.9927 0.9969 0.9986
## Specificity 0.9975 0.9987 0.9994 0.9994 1.0000
## Pos Pred Value 0.9938 0.9947 0.9971 0.9969 1.0000
## Neg Pred Value 1.0000 0.9978 0.9985 0.9994 0.9997
## Prevalence 0.2845 0.1935 0.1744 0.1639 0.1838
## Detection Rate 0.2845 0.1917 0.1731 0.1634 0.1835
## Detection Prevalence 0.2863 0.1927 0.1736 0.1639 0.1835
## Balanced Accuracy 0.9988 0.9948 0.9960 0.9981 0.9993
set.seed(089)
trControl2 <- trainControl(method = "cv",number = 3)
model2B <- train(classe ~ ., data=train, method="gbm",trControl=trControl2)
model2BPred <- predict(model2B,test)
model2BConf <- confusionMatrix(model2BPred, test$classe)
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 1100 29 0 1 2
## B 12 713 28 2 10
## C 3 14 648 13 8
## D 0 2 6 620 11
## E 1 1 2 7 690
##
## Overall Statistics
##
## Accuracy : 0.9613
## 95% CI : (0.9547, 0.9671)
## No Information Rate : 0.2845
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.951
## Mcnemar's Test P-Value : 0.0007146
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 0.9857 0.9394 0.9474 0.9642 0.9570
## Specificity 0.9886 0.9836 0.9883 0.9942 0.9966
## Pos Pred Value 0.9717 0.9320 0.9446 0.9703 0.9843
## Neg Pred Value 0.9943 0.9854 0.9889 0.9930 0.9904
## Prevalence 0.2845 0.1935 0.1744 0.1639 0.1838
## Detection Rate 0.2804 0.1817 0.1652 0.1580 0.1759
## Detection Prevalence 0.2886 0.1950 0.1749 0.1629 0.1787
## Balanced Accuracy 0.9871 0.9615 0.9678 0.9792 0.9768
model3CT <- rpart(classe ~ .,data=train,method="class")
model3CTPred <- predict(model3CT,test,type="class")
model3CTConf <- confusionMatrix(model3CTPred, test$classe)
model3CTConf
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 998 142 18 36 9
## B 41 421 66 66 64
## C 18 103 533 47 61
## D 52 49 43 435 47
## E 7 44 24 59 540
##
## Overall Statistics
##
## Accuracy : 0.7461
## 95% CI : (0.7322, 0.7597)
## No Information Rate : 0.2845
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.6781
## Mcnemar's Test P-Value : 3.613e-15
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 0.8943 0.5547 0.7792 0.6765 0.7490
## Specificity 0.9270 0.9251 0.9293 0.9418 0.9582
## Pos Pred Value 0.8296 0.6398 0.6995 0.6949 0.8012
## Neg Pred Value 0.9566 0.8965 0.9522 0.9369 0.9443
## Prevalence 0.2845 0.1935 0.1744 0.1639 0.1838
## Detection Rate 0.2544 0.1073 0.1359 0.1109 0.1376
## Detection Prevalence 0.3067 0.1677 0.1942 0.1596 0.1718
## Balanced Accuracy 0.9106 0.7399 0.8543 0.8091 0.8536
Showing the decision tree of the model 3.
fancyRpartPlot(model3CT)
## Warning: labs do not fit even at cex 0.15, there may be some overplotting
modelSummary <- data.frame(modelName = c("Random forest","Boosting","Decision tree"),accuracy=c(model1RFConf$overall[1],model2BConf$overall[1],model3CTConf$overall[1]))
modelSummary
## modelName accuracy
## 1 Random forest 0.9961764
## 2 Boosting 0.9612541
## 3 Decision tree 0.7461127
Both the random forest and the boosting algorithm show near perfect prediction on the testing data set, random forest being slightly better.
Velloso, E.; Bulling, A.; Gellersen, H.; Ugulino, W.; Fuks, H. Qualitative Activity Recognition of Weight Lifting Exercises. Proceedings of 4th International Conference in Cooperation with SIGCHI (Augmented Human ’13) . Stuttgart, Germany: ACM SIGCHI, 2013.