Introduction

“Six young health participants were asked to perform one set of 10 repetitions of the Unilateral Dumbbell Biceps Curl in five different fashions: exactly according to the specification (Class A), throwing the elbows to the front (Class B), lifting the dumbbell only halfway (Class C), lowering the dumbbell only halfway (Class D) and throwing the hips to the front (Class E).” The goal of your project is to predict which classe.

Data are available online at http://web.archive.org/web/20161224072740/http:/groupware.les.inf.puc-rio.br/har

Analysis

suppressPackageStartupMessages(library(caret))
suppressPackageStartupMessages(library(rpart))
suppressPackageStartupMessages(library(rattle))

Downloading the data

rawData <- read.csv("https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv", na.strings = c("NA", "", "#DIV0!"))

Pre-processing

head(colnames(rawData), n=20)
##  [1] "X"                    "user_name"            "raw_timestamp_part_1"
##  [4] "raw_timestamp_part_2" "cvtd_timestamp"       "new_window"          
##  [7] "num_window"           "roll_belt"            "pitch_belt"          
## [10] "yaw_belt"             "total_accel_belt"     "kurtosis_roll_belt"  
## [13] "kurtosis_picth_belt"  "kurtosis_yaw_belt"    "skewness_roll_belt"  
## [16] "skewness_roll_belt.1" "skewness_yaw_belt"    "max_roll_belt"       
## [19] "max_picth_belt"       "max_yaw_belt"

Keeping only the relevant variables by removing variables unrelated to measurements and by removing incomplete variables (NAs)

data <- rawData[, -c(1:7)]
data <- data[, colSums(is.na(data)) == 0]
colnames(data)
##  [1] "roll_belt"            "pitch_belt"           "yaw_belt"            
##  [4] "total_accel_belt"     "gyros_belt_x"         "gyros_belt_y"        
##  [7] "gyros_belt_z"         "accel_belt_x"         "accel_belt_y"        
## [10] "accel_belt_z"         "magnet_belt_x"        "magnet_belt_y"       
## [13] "magnet_belt_z"        "roll_arm"             "pitch_arm"           
## [16] "yaw_arm"              "total_accel_arm"      "gyros_arm_x"         
## [19] "gyros_arm_y"          "gyros_arm_z"          "accel_arm_x"         
## [22] "accel_arm_y"          "accel_arm_z"          "magnet_arm_x"        
## [25] "magnet_arm_y"         "magnet_arm_z"         "roll_dumbbell"       
## [28] "pitch_dumbbell"       "yaw_dumbbell"         "total_accel_dumbbell"
## [31] "gyros_dumbbell_x"     "gyros_dumbbell_y"     "gyros_dumbbell_z"    
## [34] "accel_dumbbell_x"     "accel_dumbbell_y"     "accel_dumbbell_z"    
## [37] "magnet_dumbbell_x"    "magnet_dumbbell_y"    "magnet_dumbbell_z"   
## [40] "roll_forearm"         "pitch_forearm"        "yaw_forearm"         
## [43] "total_accel_forearm"  "gyros_forearm_x"      "gyros_forearm_y"     
## [46] "gyros_forearm_z"      "accel_forearm_x"      "accel_forearm_y"     
## [49] "accel_forearm_z"      "magnet_forearm_x"     "magnet_forearm_y"    
## [52] "magnet_forearm_z"     "classe"
summary(data[,1:6])
##    roll_belt        pitch_belt          yaw_belt       total_accel_belt
##  Min.   :-28.90   Min.   :-55.8000   Min.   :-180.00   Min.   : 0.00   
##  1st Qu.:  1.10   1st Qu.:  1.7600   1st Qu.: -88.30   1st Qu.: 3.00   
##  Median :113.00   Median :  5.2800   Median : -13.00   Median :17.00   
##  Mean   : 64.41   Mean   :  0.3053   Mean   : -11.21   Mean   :11.31   
##  3rd Qu.:123.00   3rd Qu.: 14.9000   3rd Qu.:  12.90   3rd Qu.:18.00   
##  Max.   :162.00   Max.   : 60.3000   Max.   : 179.00   Max.   :29.00   
##   gyros_belt_x        gyros_belt_y     
##  Min.   :-1.040000   Min.   :-0.64000  
##  1st Qu.:-0.030000   1st Qu.: 0.00000  
##  Median : 0.030000   Median : 0.02000  
##  Mean   :-0.005592   Mean   : 0.03959  
##  3rd Qu.: 0.110000   3rd Qu.: 0.11000  
##  Max.   : 2.220000   Max.   : 0.64000

Data partition

Partitioning the training data into a training dataset (80%) and a testing dataset (20%)

inTrain <- createDataPartition(y=data$classe,p=0.8,list = FALSE)
train <- data[inTrain,]
test <- data[-inTrain,]
dim(train);dim(test)
## [1] 15699    53
## [1] 3923   53

Preditive models

The strategy is to try multiple classifier algorithm, including random forest, generalized boosted models and decision tree and to compare their accuracy on the test test.

Model 1: Random forest

set.seed(239)
trControl <- trainControl(method = "cv",number = 3)
model1RF <- train(classe ~ ., data=train, method="rf",trControl=trControl)
model1RFPred <- predict(model1RF,test)
model1RFConf <- confusionMatrix(model1RFPred, test$classe)
model1RFConf
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 1116    7    0    0    0
##          B    0  752    4    0    0
##          C    0    0  679    2    0
##          D    0    0    1  641    1
##          E    0    0    0    0  720
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9962          
##                  95% CI : (0.9937, 0.9979)
##     No Information Rate : 0.2845          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.9952          
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            1.0000   0.9908   0.9927   0.9969   0.9986
## Specificity            0.9975   0.9987   0.9994   0.9994   1.0000
## Pos Pred Value         0.9938   0.9947   0.9971   0.9969   1.0000
## Neg Pred Value         1.0000   0.9978   0.9985   0.9994   0.9997
## Prevalence             0.2845   0.1935   0.1744   0.1639   0.1838
## Detection Rate         0.2845   0.1917   0.1731   0.1634   0.1835
## Detection Prevalence   0.2863   0.1927   0.1736   0.1639   0.1835
## Balanced Accuracy      0.9988   0.9948   0.9960   0.9981   0.9993

Model 2: Boosting

set.seed(089)
trControl2 <- trainControl(method = "cv",number = 3)
model2B <- train(classe ~ ., data=train, method="gbm",trControl=trControl2)
model2BPred <- predict(model2B,test)
model2BConf <- confusionMatrix(model2BPred, test$classe)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 1100   29    0    1    2
##          B   12  713   28    2   10
##          C    3   14  648   13    8
##          D    0    2    6  620   11
##          E    1    1    2    7  690
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9613          
##                  95% CI : (0.9547, 0.9671)
##     No Information Rate : 0.2845          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.951           
##  Mcnemar's Test P-Value : 0.0007146       
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            0.9857   0.9394   0.9474   0.9642   0.9570
## Specificity            0.9886   0.9836   0.9883   0.9942   0.9966
## Pos Pred Value         0.9717   0.9320   0.9446   0.9703   0.9843
## Neg Pred Value         0.9943   0.9854   0.9889   0.9930   0.9904
## Prevalence             0.2845   0.1935   0.1744   0.1639   0.1838
## Detection Rate         0.2804   0.1817   0.1652   0.1580   0.1759
## Detection Prevalence   0.2886   0.1950   0.1749   0.1629   0.1787
## Balanced Accuracy      0.9871   0.9615   0.9678   0.9792   0.9768

Model 3: Classification tree

model3CT <- rpart(classe ~ .,data=train,method="class")
model3CTPred <- predict(model3CT,test,type="class")
model3CTConf <- confusionMatrix(model3CTPred, test$classe)
model3CTConf
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   A   B   C   D   E
##          A 998 142  18  36   9
##          B  41 421  66  66  64
##          C  18 103 533  47  61
##          D  52  49  43 435  47
##          E   7  44  24  59 540
## 
## Overall Statistics
##                                           
##                Accuracy : 0.7461          
##                  95% CI : (0.7322, 0.7597)
##     No Information Rate : 0.2845          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.6781          
##  Mcnemar's Test P-Value : 3.613e-15       
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            0.8943   0.5547   0.7792   0.6765   0.7490
## Specificity            0.9270   0.9251   0.9293   0.9418   0.9582
## Pos Pred Value         0.8296   0.6398   0.6995   0.6949   0.8012
## Neg Pred Value         0.9566   0.8965   0.9522   0.9369   0.9443
## Prevalence             0.2845   0.1935   0.1744   0.1639   0.1838
## Detection Rate         0.2544   0.1073   0.1359   0.1109   0.1376
## Detection Prevalence   0.3067   0.1677   0.1942   0.1596   0.1718
## Balanced Accuracy      0.9106   0.7399   0.8543   0.8091   0.8536

Showing the decision tree of the model 3.

fancyRpartPlot(model3CT)
## Warning: labs do not fit even at cex 0.15, there may be some overplotting

Model prediction summary

modelSummary <- data.frame(modelName = c("Random forest","Boosting","Decision tree"),accuracy=c(model1RFConf$overall[1],model2BConf$overall[1],model3CTConf$overall[1]))
modelSummary
##       modelName  accuracy
## 1 Random forest 0.9961764
## 2      Boosting 0.9612541
## 3 Decision tree 0.7461127

Conclusion

Both the random forest and the boosting algorithm show near perfect prediction on the testing data set, random forest being slightly better.

References

Velloso, E.; Bulling, A.; Gellersen, H.; Ugulino, W.; Fuks, H. Qualitative Activity Recognition of Weight Lifting Exercises. Proceedings of 4th International Conference in Cooperation with SIGCHI (Augmented Human ’13) . Stuttgart, Germany: ACM SIGCHI, 2013.