Background

Using devices such as Jawbone Up, Nike FuelBand, and Fitbit it is now possible to collect a large amount of data about personal activity relatively inexpensively. These type of devices are part of the quantified self movement – a group of enthusiasts who take measurements about themselves regularly to improve their health, to find patterns in their behavior, or because they are tech geeks. One thing that people regularly do is quantify how much of a particular activity they do, but they rarely quantify how well they do it. In this project, I use data from accelerometers on the belt, forearm, arm, and dumbell of 6 participants. They were asked to perform barbell lifts correctly and incorrectly in 5 different ways. More information is available from the website here. See the section on the Weight Lifting Exercise Dataset.

Data

The training data for this project is available here: pml-training.csv

The test data is available here: pml-testing.csv

For reproducibility, the following libraries should be installed and loaded in order to accomplish all processing steps included in this report:

library(caret)
library(rattle)

Data Processing

When it comes to machine learning, one should consider the quality of data as most crucial element. Incomplete, irrelevant and inaccurate data sets are all sources of errors that will be inevitably incorporated in any machine learning analysis, giving to the observation “garbage in, garbage out” all its sense. For such high-quality demanding analysis, one should thus spend enough time wrangling data efficiently.
Fortunately, the tidying of the data was already done 1 (a data summary is given here).
Nonetheless, a look at its content shows at first sight a substantive occurrence of various missing and/or irrelevant inputs (“NA”, “”, #Div/0!“) which need to be normalized from the notational point of view.

#download.file("https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv", "pml-training.csv")
#download.file("https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv", "pml-testing.csv")
preTest <- read.csv("pml-testing.csv", na.strings = c("NA", "", "#DIV/0!"))
preTrain <- read.csv("pml-training.csv", na.strings = c("NA", "", "#DIV/0!")) 

The data is then processed as follows:

  1. the 1st and 6th columns (Resp. observation’s number and “new_number”) are irrelevant in this context and are thus excluded,

  2. the columns with down to 90% of “NA” values are also excluded,

  3. In case, two predictors present high pairwise correlation (in term of the Pearson’s correlation coefficient 2), only the best is kept (the one with the smallest mean absolute correlation). The cut-off value of the Pearson’s correlation coefficient is usually chosen to be higher than 0.75. Since we use variables recorded by body sensors that could be mutually influenced, I choose a bit higher value of 0.8,

  4. the predictors with near-zero variance should be excluded. However, all of them were already excluded within the previous steps,

  5. we subset the training data into training and testing sets and set up the training run with the (x, y) syntax 3

# 1. Exclude the 1st and 6th columns
preTest <- preTest[, -c(1, 6)] 
preTrain <- preTrain[, -c(1, 6)]

# 2. Exclude the columns with too many NAs (cut off at 90%)
training <- preTrain[,colSums(is.na(preTrain))/nrow(preTrain) < 0.9]
cIndex <- which(!(colnames(preTrain) %in% colnames(training)))
testing <- preTest[, - cIndex]

# 3. Exclude the predictors that are highly correlated
analyseDataTrain <- sapply(training[,-58], as.numeric) # column 58 is the target variable
corMat <- cor(analyseDataTrain)
highCor <- findCorrelation(corMat, cutoff=0.8)
trainingCor <- training[, -highCor]
testingCor <- testing[, -highCor]

# 4. Exclude the predictors with very small variance
#nsv <- nearZeroVar(trainingCor) # in this case nsv is empty

# 5. Subset the data
set.seed(123)
subsets <- createDataPartition(y=trainingCor$classe, p=0.75, list=FALSE)
subTrainingCor <- trainingCor[subsets, ] 
subTestingCor <- trainingCor[-subsets, ]
x <- subTrainingCor[, -44] # column 44 is the target variable
y <- subTrainingCor[, 44]
rm(preTest, preTrain,testing, training, cIndex, analyseDataTrain, corMat, highCor, subsets)

Computer configuration used in this project

Computer Configuration
Samsung Series 5 Ultra * Operating system: Windows 10 (64 Bits)
* Processor: Intel Corei5 3337U @ 1.80GHz up to 2.7GHz (2 cores, 4 threads)
* RAM: 8 Gb
* Disk: 512 Gb SSD

Model fitting

The trainControl was set to use the K-fold cross-validation as it represents a robust method to estimate the model’s accuracy. The choice of k = 5 has been empirically shown to avoid high bias and variance when estimating the test error rate 4.

In this experiment, 6 participants were asked to perform one set of 10 repetitions of the Unilateral Dumbbell Biceps Curl in five different fashions:

While the class A corresponds to the specified execution of the exercise, the other 4 classes correspond to common mistakes.

As we try to predict these behaviors through predictor variables, the target variable, “classe”, is thus a factor with 5 different levels corresponding to the five preceding cases.

To illustrate the difference in accuracy according to the method used, three different machine learning methods are chosen:

1. Decision tree

set.seed(123)
control = trainControl(method = "cv", number = 5, allowParallel = TRUE)
modelRPART <- train(x, y, method = "rpart", trControl = control) 
modelRPART$results
##           cp  Accuracy     Kappa  AccuracySD    KappaSD
## 1 0.03398842 0.5565960 0.4378718 0.009052306 0.01130354
## 2 0.03792842 0.5119644 0.3733120 0.036654830 0.05876857
## 3 0.06728852 0.3606495 0.1255237 0.104492903 0.17188225
predRPART <- predict(modelRPART, subTestingCor)
cfMatRPART <- confusionMatrix(subTestingCor$classe, predRPART)
cfMatRPART$overall
##       Accuracy          Kappa  AccuracyLower  AccuracyUpper   AccuracyNull 
##   5.309951e-01   4.053248e-01   5.169111e-01   5.450422e-01   4.783850e-01 
## AccuracyPValue  McnemarPValue 
##   9.423710e-14   0.000000e+00
cfMatRPART$table
##           Reference
## Prediction    A    B    C    D    E
##          A 1180   31  184    0    0
##          B  158  317  472    0    2
##          C   16   32  806    0    1
##          D   71  152  521    0   60
##          E   69  168  363    0  301
fancyRpartPlot(modelRPART$finalModel)

The model obtained has a low accuracy of 55,66% and cannot be used to predict the results of the quiz.

2. Stochastic Gradient boosting

set.seed(123) 
modelGBM <- train(x, y, method = "gbm", trControl = control) 
modelGBM$results
##   shrinkage interaction.depth n.minobsinnode n.trees  Accuracy     Kappa
## 1       0.1                 1             10      50 0.7842776 0.7262820
## 4       0.1                 2             10      50 0.9287265 0.9097723
## 7       0.1                 3             10      50 0.9660959 0.9570896
## 2       0.1                 1             10     100 0.8660808 0.8303878
## 5       0.1                 2             10     100 0.9807720 0.9756724
## 8       0.1                 3             10     100 0.9938168 0.9921793
## 3       0.1                 1             10     150 0.9033830 0.8776612
## 6       0.1                 2             10     150 0.9924579 0.9904598
## 9       0.1                 3             10     150 0.9974179 0.9967340
##    AccuracySD     KappaSD
## 1 0.007445123 0.009479860
## 4 0.005811340 0.007363580
## 7 0.002180500 0.002753041
## 2 0.008072280 0.010331460
## 5 0.001602403 0.002032181
## 8 0.001883989 0.002383054
## 3 0.008613351 0.010989106
## 6 0.002235010 0.002827177
## 9 0.001658198 0.002097462
predGBM <- predict(modelGBM, subTestingCor) 
cfMatGBM <- confusionMatrix(subTestingCor$classe, predGBM)
cfMatGBM$overall
##       Accuracy          Kappa  AccuracyLower  AccuracyUpper   AccuracyNull 
##      0.9977569      0.9971627      0.9959901      0.9988798      0.2846656 
## AccuracyPValue  McnemarPValue 
##      0.0000000            NaN
cfMatGBM$table
##           Reference
## Prediction    A    B    C    D    E
##          A 1395    0    0    0    0
##          B    1  948    0    0    0
##          C    0    1  853    1    0
##          D    0    0    5  798    1
##          E    0    0    0    2  899

The model from the GBM method presents very high accuracy on train set (99.74%) as well as on test set (99,78%). Since these values are too close, I assume that there is no overfitting on the train set. On the other hand, one can notice that the test accuracy is slightly better than the train one. Nonetheless, the difference is in fact too small as it corresponds to less than two additional good predictions on the test set and could be induced by the smallest number of observations. The computational time is of 579s.

3. Random forest

set.seed(123)
modelRF <- train(x, y, method = "rf", trControl = control)
modelRF$results 
##   mtry  Accuracy     Kappa  AccuracySD     KappaSD
## 1    2 0.9952433 0.9939829 0.002150399 0.002720445
## 2   22 0.9986407 0.9982806 0.001749896 0.002213406
## 3   43 0.9967383 0.9958746 0.001743711 0.002205343
predRF <- predict(modelRF, subTestingCor)
cfMatRF <- confusionMatrix(subTestingCor$classe, predRF)
rm(control)
cfMatRF$overall
##       Accuracy          Kappa  AccuracyLower  AccuracyUpper   AccuracyNull 
##      0.9989804      0.9987104      0.9976223      0.9996689      0.2844617 
## AccuracyPValue  McnemarPValue 
##      0.0000000            NaN
cfMatRF$table
##           Reference
## Prediction    A    B    C    D    E
##          A 1395    0    0    0    0
##          B    0  948    1    0    0
##          C    0    1  854    0    0
##          D    0    0    3  801    0
##          E    0    0    0    0  901

The obtained model shows very high accuracies. As for the GBM method, the train and test accuracies are once again too close with the latter slightly better than the first. There is thus no overfitting on the train set.

Yet, to avoid doubts on the resampling method, I computed a new model using repeatedcv on the same data set. The accuracies obtained with 5 repeats (99,88%) are exactly the same and are very close to the ones obtained previously.

Comparison between the different methods

results <- resamples(list(RPART=modelRPART, GBM=modelGBM, RF=modelRF))
results$timings 
##       Everything FinalModel Prediction
## RPART      49.65       1.23         NA
## GBM       578.84      95.15         NA
## RF       1234.17     170.50         NA
bwplot(results)

Apart of the RPART method, the models obtained with the GBM and RF methods fit the test data very well. The standard deviations of the accuracies are also very small which suggests that the data collected with the body sensors is very accurate. The big difference in processing time between the GBM and RF methods (more than the double) poses the problem of performance when using a time-consuming method for just a little improvement. Of course, for this project, one can afford using the RF method to predict the cases from the quiz. However, once the data becomes very big, it is important to assess all these different aspects taking into consideration the acceptable margin of error.

Predictor’s importance in Random Forest

importanceRF <- varImp(modelRF,scale = FALSE)
plot(importanceRF, top= 10) 

It is very interesting to notice that, the most important predictors used in RF method are the raw_timestamp_part_1 and the num_window. Since the data was recorded by body sensors using a sliding time window with different durations, it is therefore necessary to look at the observations according to there corresponding time stamp and window number. While this behavior is intuitive for a human being, the fact remains that it is not obvious for a machine. Fortunately, the RF algorithm permitted to learn from the data enough to spot this fact.

Application to the data set of the quiz

predict(modelRF, testingCor) 
##  [1] B A B A A E D B A A B C B A E E A B B B
## Levels: A B C D E
# 100% accuracy
#rm(list = ls())

Bibliography


  1. Velloso, E., Bulling, A., Gellersen, H., Ugulino, W., & Fuks, H. (2013, March). Qualitative activity recognition of weight lifting exercises. In Proceedings of the 4th Augmented Human International Conference (pp. 116-123). ACM.

  2. Pearson, K. (1896). Mathematical contributions to the theory of evolution. III. Regression, heredity, and panmixia. Philosophical Transactions of the Royal Society of London. Series A, containing papers of a mathematical or physical character, 187, 253-318.

  3. https://github.com/lgreski/datasciencectacontent/blob/master/markdown/pml-randomForestPerformance.md

  4. http://www.sthda.com/english/articles/38-regression-model-validation/157-cross-validation-essentials-in-r/