Executive Summary

In this project, we use data from accelerometers on the belt, forearm, arm, and dumbell of 6 participants, that were asked to perform barbell lifts correctly and incorrectly in 5 different ways. The description of the experiment can be found at ‘http://groupware.les.inf.puc-rio.br/har’ under the section ‘Weight Lifting Exercise Dataset’. The data set was provided by Velloso, E and co (1).

The training dataset was split into a training, test and validation subset. We trained four different models with the training subset and tested them separately with the test subset. Following the results of the confusionMatrix method we chose the random forest model as our final model showing an accuracy of 0.9879 and a 95% confidence interval of (0.984, 0.991). This final model was validated with our validation subset showing a drop of accuracy of only 0.0017 or 0.17%. Thus its accuracy remained very high with 0.9862 and a 95% confidence interval of (0.9829, 0.9891).

Getting The Data

6 participants were asked to perform barbell lifts under 5 different conditions, which are classified as groups from A to E, all under the surveillance of an experienced observer.

A : exactly according to the specification (correctly)
B : throwing the elbows to the front (incorrectly)
C : lifting the dumbbell only halfway (incorrectly)
D : lowering the dumbbell only halfway (incorrectly)
E : throwing the hips to the front (incorrectly)

Two data sets are downloaded from a specific source consisting in a training set and quiz set for the Practical Machine Learning Quiz at coursera.

urlTrainingData <- "https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv"
urlTestingData <- "https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv"
date_downloaded <- date()
download.file(urlTrainingData, destfile = "training.csv")
download.file(urlTestingData, destfile = "testing.csv")
training <- read.csv("training.csv",strip.white = TRUE, na.strings = c("#DIV/0!", "NA"))
dim(training) #[1] 19622 160
quizSet <- read.csv("testing.csv", strip.white=TRUE, na.strings = c("#DIV/0!", "NA")) #Prediction Quiz

Cleaning The Data

The number of variables in the training set will be reduced, such that only variables from accelerometers on the belt, forearm, arm, and dumbell are retained and used to build our prediction models. Morevover further variables are retained, that could be significant in predicting the correct outcome, named as ‘classe’ in the data set representing one of the five groups (A - E). Finally, from 160 variables, 29 predictors and 1 outcome are used to build our models.

summary(training)

#looked at the dataset and wrote down all the variables that are necessary for this assignment. 
asked <- c("classe", "roll_belt","pitch_belt", "yaw_belt" ,"accel_belt_x","accel_belt_y", "accel_belt_z", "magnet_belt_x", "magnet_belt_y", "magnet_belt_z", "accel_arm_x", "accel_arm_y", "accel_arm_z", "roll_forearm", "pitch_forearm", "yaw_forearm" ,"accel_forearm_x", "accel_forearm_y", "accel_forearm_z", "magnet_forearm_x", "magnet_forearm_y","magnet_forearm_z", "roll_dumbbell", "pitch_dumbbell" ,"accel_dumbbell_x", "accel_dumbbell_y", "accel_dumbbell_z", "magnet_dumbbell_x", "magnet_dumbbell_y", "magnet_dumbbell_z")

training <- training[,asked]

sum(!complete.cases(training)) #0, thus only complete cases
dim(training) #[1] 19622    30

With the cleaned data set we will build four different models.

  1. Random Forest model
  2. Linear Discriminant model
  3. Boosting model
  4. Predictive Tree model

The construction of these models will be done with the caret package and the appropriate methods. After building the first four models, the accuracy and results will be measured with the method confusionMatrix() and its results will decide our final model. The final model will be tested using a validation subset.

Prediction Model

The number of observations in the original training data set allows us to construct a training subset, a test subset and a validation subset. The training subset will be used to train the first four models, which will be tested with the test subset. Only the final model will be validated. As the in sample error will always be lower than the out sample error, we expected the models to perform worse on new sample sets. The expected drop in accuracy (1 - out sample error) will be mentioned for each model by personal guesses. An accuracy of 20% shows that the model would be as good as guessing by chance.

set.seed(12121)#in order to be reproducible for others

library(caret)
library(randomForest)

inBuild <- createDataPartition(y = training$classe, p = 0.7, list = FALSE)

validation <- training[-inBuild,]
buildData <- training[inBuild,]

inTrain <- createDataPartition(y = buildData$classe, p = 0.7, list = FALSE)

trainSet <- buildData[inTrain,]
testSet <- buildData[-inTrain,]

1. Random Forest Model

This model provides a very good accuracy for the test subset, but it takes an incredible amount of time. Unfortunately due to overfitting, the out of sample error could be quite high, such that we could expect a drop in accuracy at around 90% or lower. The method varImp shows us the 20 most important variables for its prediction.

setting <- trainControl(allowParallel=T, method="cv", number=4) #to work faster, but still very slow :-(

modRF <- train(classe ~ ., data = trainSet, method = "rf", trainControl=setting)

predictionsRF <- predict(modRF, newdata = testSet)

confusionMatrix(predictionsRF, testSet$classe)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 1164   16    0    0    0
##          B    5  770    1    0    0
##          C    1   10  714    4    1
##          D    0    1    3  670    6
##          E    1    0    0    1  750
## 
## Overall Statistics
##                                         
##                Accuracy : 0.9879        
##                  95% CI : (0.984, 0.991)
##     No Information Rate : 0.2844        
##     P-Value [Acc > NIR] : < 2.2e-16     
##                                         
##                   Kappa : 0.9846        
##  Mcnemar's Test P-Value : NA            
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            0.9940   0.9661   0.9944   0.9926   0.9908
## Specificity            0.9946   0.9982   0.9953   0.9971   0.9994
## Pos Pred Value         0.9864   0.9923   0.9781   0.9853   0.9973
## Neg Pred Value         0.9976   0.9919   0.9988   0.9985   0.9979
## Prevalence             0.2844   0.1935   0.1744   0.1639   0.1838
## Detection Rate         0.2827   0.1870   0.1734   0.1627   0.1821
## Detection Prevalence   0.2865   0.1884   0.1773   0.1651   0.1826
## Balanced Accuracy      0.9943   0.9822   0.9949   0.9948   0.9951
varImp(modRF)
## rf variable importance
## 
##   only 20 most important variables shown (out of 29)
## 
##                   Overall
## roll_belt         100.000
## pitch_forearm      59.798
## yaw_belt           55.207
## pitch_belt         49.762
## roll_forearm       44.370
## magnet_dumbbell_y  43.342
## magnet_dumbbell_z  42.815
## accel_dumbbell_y   26.249
## magnet_dumbbell_x  18.320
## roll_dumbbell      17.455
## accel_dumbbell_z   16.929
## accel_forearm_x    16.811
## magnet_belt_z      16.294
## accel_belt_z       15.300
## magnet_forearm_z   14.392
## magnet_belt_y      12.717
## magnet_belt_x      11.324
## accel_arm_x        10.933
## accel_forearm_z     7.406
## magnet_forearm_x    5.074

2. Linear Discriminant Model

This model provides a much lower accuracy compared to the random forest model and could only be helpful in combination with our first model. The expected out of sample error with a new sample could be around 40% and if unlucky above 50%.

library(MASS)
modLDA <- train(classe ~ ., data = trainSet, method = "lda")

predictionsLDA <- predict(modLDA, newdata = testSet)

confusionMatrix(predictionsLDA, testSet$classe) #Accuracy 0.6545 [0.6422, 0.6667]
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   A   B   C   D   E
##          A 873 158 116  47  32
##          B  79 427  46  78 124
##          C 118 123 468  59  63
##          D  89  44  79 447  95
##          E  12  45   9  44 443
## 
## Overall Statistics
##                                           
##                Accuracy : 0.6455          
##                  95% CI : (0.6306, 0.6601)
##     No Information Rate : 0.2844          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.5512          
##  Mcnemar's Test P-Value : < 2.2e-16       
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            0.7455   0.5358   0.6518   0.6622   0.5852
## Specificity            0.8802   0.9015   0.8932   0.9108   0.9673
## Pos Pred Value         0.7121   0.5663   0.5632   0.5928   0.8011
## Neg Pred Value         0.8970   0.8900   0.9239   0.9322   0.9119
## Prevalence             0.2844   0.1935   0.1744   0.1639   0.1838
## Detection Rate         0.2120   0.1037   0.1136   0.1085   0.1076
## Detection Prevalence   0.2977   0.1831   0.2018   0.1831   0.1343
## Balanced Accuracy      0.8129   0.7186   0.7725   0.7865   0.7762

3. Boosting Model

This models provides very good results and a very high accuracy. By combining it with our first model, we could improve the accuracy for new samples. The expected accuracy on new samples will be lower, thus giving a higher out of sample error, but we can’t estimate by how much.

library(plyr)
library(survival)
library(splines)
library(parallel)
library(ggplot2)
library(gbm)

#takes a lot of time
modB <- train(classe ~ ., data = trainSet, method = "gbm", verbose = FALSE)

predictionsB <- predict(modB, newdata = testSet)

confusionMatrix(predictionsB, testSet$classe)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 1143   32    0    2    1
##          B   17  734   16    6    6
##          C    5   27  693   19    5
##          D    1    3    6  646   12
##          E    5    1    3    2  733
## 
## Overall Statistics
##                                           
##                Accuracy : 0.959           
##                  95% CI : (0.9524, 0.9648)
##     No Information Rate : 0.2844          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.9481          
##  Mcnemar's Test P-Value : 0.0001592       
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            0.9761   0.9210   0.9652   0.9570   0.9683
## Specificity            0.9881   0.9864   0.9835   0.9936   0.9967
## Pos Pred Value         0.9703   0.9422   0.9252   0.9671   0.9852
## Neg Pred Value         0.9905   0.9811   0.9926   0.9916   0.9929
## Prevalence             0.2844   0.1935   0.1744   0.1639   0.1838
## Detection Rate         0.2776   0.1782   0.1683   0.1569   0.1780
## Detection Prevalence   0.2861   0.1892   0.1819   0.1622   0.1807
## Balanced Accuracy      0.9821   0.9537   0.9744   0.9753   0.9825

4. Predictive Tree Model

This model has the worst performance so far with an accuracy of around 50%. The out of sample error will be higher, but we can’t estimate by how much.

library(rattle)
modT <- train(classe ~ ., data = trainSet, method = "rpart")

predictionsT <- predict(modT, testSet)

confusionMatrix(predictionsT, testSet$classe)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 1054  319  338  291  111
##          B   18  272   30  106   86
##          C   95  206  350  278  208
##          D    0    0    0    0    0
##          E    4    0    0    0  352
## 
## Overall Statistics
##                                           
##                Accuracy : 0.4925          
##                  95% CI : (0.4771, 0.5079)
##     No Information Rate : 0.2844          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.3374          
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            0.9001  0.34128  0.48747   0.0000  0.46499
## Specificity            0.6407  0.92773  0.76853   1.0000  0.99881
## Pos Pred Value         0.4988  0.53125  0.30783      NaN  0.98876
## Neg Pred Value         0.9416  0.85441  0.87655   0.8361  0.89234
## Prevalence             0.2844  0.19354  0.17436   0.1639  0.18383
## Detection Rate         0.2559  0.06605  0.08499   0.0000  0.08548
## Detection Prevalence   0.5131  0.12433  0.27610   0.0000  0.08645
## Balanced Accuracy      0.7704  0.63451  0.62800   0.5000  0.73190
fancyRpartPlot(modT$finalModel)

5. Final Model

Our first model showed such a high accuracy that it will be used as a standalone model for our validation set showing an accuracy of 0.9862 with a 95% confidence interval of (0.9829, 0.9891). The expected drop of accuracy due to the out of sample error was much smaller than expected, which increased only by 0.0017 or 0.17%.

predictions <- predict(modRF, validation)

confusionMatrix(predictions, validation$classe)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 1665   14    0    0    0
##          B    6 1111    9    0    4
##          C    2   12 1012   16    1
##          D    0    1    5  946    7
##          E    1    1    0    2 1070
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9862          
##                  95% CI : (0.9829, 0.9891)
##     No Information Rate : 0.2845          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.9826          
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            0.9946   0.9754   0.9864   0.9813   0.9889
## Specificity            0.9967   0.9960   0.9936   0.9974   0.9992
## Pos Pred Value         0.9917   0.9832   0.9703   0.9864   0.9963
## Neg Pred Value         0.9979   0.9941   0.9971   0.9963   0.9975
## Prevalence             0.2845   0.1935   0.1743   0.1638   0.1839
## Detection Rate         0.2829   0.1888   0.1720   0.1607   0.1818
## Detection Prevalence   0.2853   0.1920   0.1772   0.1630   0.1825
## Balanced Accuracy      0.9956   0.9857   0.9900   0.9893   0.9940

Course Project Prediction Quiz

predictionsQuiz <- predict(modRF, quizSet)
quizDF <- data.frame(predictions = predictionsQuiz, quizSet) 
#quizDF[,1:2] passed 20/20
  1. Velloso, E.; Bulling, A.; Gellersen, H.; Ugulino, W.; Fuks, H. Qualitative Activity Recognition of Weight Lifting Exercises. Proceedings of 4th International Conference in Cooperation with SIGCHI (Augmented Human ’13) . Stuttgart, Germany: ACM SIGCHI, 2013. Read more: http://groupware.les.inf.puc-rio.br/har#ixzz4AKmYoNy4