Using devices such as Jawbone Up, Nike FuelBand, and Fitbit it is now possible to collect a large amount of data about personal activity relatively inexpensively. These type of devices are part of the quantified self movement – a group of enthusiasts who take measurements about themselves regularly to improve their health, to find patterns in their behavior, or because they are tech geeks. One thing that people regularly do is quantify how much of a particular activity they do, but they rarely quantify how well they do it. In this project, I use data from accelerometers on the belt, forearm, arm, and dumbell of 6 participants. They were asked to perform barbell lifts correctly and incorrectly in 5 different ways. More information is available from the website here. See the section on the Weight Lifting Exercise Dataset.
The training data for this project is available here: pml-training.csv
The test data is available here: pml-testing.csv
For reproducibility, the following libraries should be installed and loaded in order to accomplish all processing steps included in this report:
library(caret)
library(rattle)
When it comes to machine learning, one should consider the quality of data as most crucial element. Incomplete, irrelevant and inaccurate data sets are all sources of errors that will be inevitably incorporated in any machine learning analysis, giving to the observation “garbage in, garbage out” all its sense. For such high-quality demanding analysis, one should thus spend enough time wrangling data efficiently.
Fortunately, the tidying of the data was already done 1 (a data summary is given here).
Nonetheless, a look at its content shows at first sight a substantive occurrence of various missing and/or irrelevant inputs (“NA”, “”, #Div/0!“) which need to be normalized from the notational point of view.
#download.file("https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv", "pml-training.csv")
#download.file("https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv", "pml-testing.csv")
preTest <- read.csv("pml-testing.csv", na.strings = c("NA", "", "#DIV/0!"))
preTrain <- read.csv("pml-training.csv", na.strings = c("NA", "", "#DIV/0!"))
The data is then processed as follows:
the 1st and 6th columns (Resp. observation’s number and “new_number”) are irrelevant in this context and are thus excluded,
the columns with down to 90% of “NA” values are also excluded,
In case, two predictors present high pairwise correlation (in term of the Pearson’s correlation coefficient 2), only the best is kept (the one with the smallest mean absolute correlation). The cut-off value of the Pearson’s correlation coefficient is usually chosen to be higher than 0.75. Since we use variables recorded by body sensors that could be mutually influenced, I choose a bit higher value of 0.8,
the predictors with near-zero variance should be excluded. However, all of them were already excluded within the previous steps,
we subset the training data into training and testing sets and set up the training run with the (x, y) syntax 3
# 1. Exclude the 1st and 6th columns
preTest <- preTest[, -c(1, 6)]
preTrain <- preTrain[, -c(1, 6)]
# 2. Exclude the columns with too many NAs (cut off at 90%)
training <- preTrain[,colSums(is.na(preTrain))/nrow(preTrain) < 0.9]
cIndex <- which(!(colnames(preTrain) %in% colnames(training)))
testing <- preTest[, - cIndex]
# 3. Exclude the predictors that are highly correlated
analyseDataTrain <- sapply(training[,-58], as.numeric) # column 58 is the target variable
corMat <- cor(analyseDataTrain)
highCor <- findCorrelation(corMat, cutoff=0.8)
trainingCor <- training[, -highCor]
testingCor <- testing[, -highCor]
# 4. Exclude the predictors with very small variance
#nsv <- nearZeroVar(trainingCor) # in this case nsv is empty
# 5. Subset the data
set.seed(123)
subsets <- createDataPartition(y=trainingCor$classe, p=0.75, list=FALSE)
subTrainingCor <- trainingCor[subsets, ]
subTestingCor <- trainingCor[-subsets, ]
x <- subTrainingCor[, -44] # column 44 is the target variable
y <- subTrainingCor[, 44]
rm(preTest, preTrain,testing, training, cIndex, analyseDataTrain, corMat, highCor, subsets)
| Computer | Configuration |
|---|---|
| Samsung Series 5 Ultra | * Operating system: Windows 10 (64 Bits) |
| * Processor: Intel Corei5 3337U @ 1.80GHz up to 2.7GHz (2 cores, 4 threads) | |
| * RAM: 8 Gb | |
| * Disk: 512 Gb SSD |
The trainControl was set to use the K-fold cross-validation as it represents a robust method to estimate the model’s accuracy. The choice of k = 5 has been empirically shown to avoid high bias and variance when estimating the test error rate 4.
In this experiment, 6 participants were asked to perform one set of 10 repetitions of the Unilateral Dumbbell Biceps Curl in five different fashions:
While the class A corresponds to the specified execution of the exercise, the other 4 classes correspond to common mistakes.
As we try to predict these behaviors through predictor variables, the target variable, “classe”, is thus a factor with 5 different levels corresponding to the five preceding cases.
To illustrate the difference in accuracy according to the method used, three different machine learning methods are chosen:
set.seed(123)
control = trainControl(method = "cv", number = 5, allowParallel = TRUE)
modelRPART <- train(x, y, method = "rpart", trControl = control)
modelRPART$results
## cp Accuracy Kappa AccuracySD KappaSD
## 1 0.03398842 0.5565960 0.4378718 0.009052306 0.01130354
## 2 0.03792842 0.5119644 0.3733120 0.036654830 0.05876857
## 3 0.06728852 0.3606495 0.1255237 0.104492903 0.17188225
predRPART <- predict(modelRPART, subTestingCor)
cfMatRPART <- confusionMatrix(subTestingCor$classe, predRPART)
cfMatRPART$overall
## Accuracy Kappa AccuracyLower AccuracyUpper AccuracyNull
## 5.309951e-01 4.053248e-01 5.169111e-01 5.450422e-01 4.783850e-01
## AccuracyPValue McnemarPValue
## 9.423710e-14 0.000000e+00
cfMatRPART$table
## Reference
## Prediction A B C D E
## A 1180 31 184 0 0
## B 158 317 472 0 2
## C 16 32 806 0 1
## D 71 152 521 0 60
## E 69 168 363 0 301
fancyRpartPlot(modelRPART$finalModel)
The model obtained has a low accuracy of 55,66% and cannot be used to predict the results of the quiz.
set.seed(123)
modelGBM <- train(x, y, method = "gbm", trControl = control)
modelGBM$results
## shrinkage interaction.depth n.minobsinnode n.trees Accuracy Kappa
## 1 0.1 1 10 50 0.7842776 0.7262820
## 4 0.1 2 10 50 0.9287265 0.9097723
## 7 0.1 3 10 50 0.9660959 0.9570896
## 2 0.1 1 10 100 0.8660808 0.8303878
## 5 0.1 2 10 100 0.9807720 0.9756724
## 8 0.1 3 10 100 0.9938168 0.9921793
## 3 0.1 1 10 150 0.9033830 0.8776612
## 6 0.1 2 10 150 0.9924579 0.9904598
## 9 0.1 3 10 150 0.9974179 0.9967340
## AccuracySD KappaSD
## 1 0.007445123 0.009479860
## 4 0.005811340 0.007363580
## 7 0.002180500 0.002753041
## 2 0.008072280 0.010331460
## 5 0.001602403 0.002032181
## 8 0.001883989 0.002383054
## 3 0.008613351 0.010989106
## 6 0.002235010 0.002827177
## 9 0.001658198 0.002097462
predGBM <- predict(modelGBM, subTestingCor)
cfMatGBM <- confusionMatrix(subTestingCor$classe, predGBM)
cfMatGBM$overall
## Accuracy Kappa AccuracyLower AccuracyUpper AccuracyNull
## 0.9977569 0.9971627 0.9959901 0.9988798 0.2846656
## AccuracyPValue McnemarPValue
## 0.0000000 NaN
cfMatGBM$table
## Reference
## Prediction A B C D E
## A 1395 0 0 0 0
## B 1 948 0 0 0
## C 0 1 853 1 0
## D 0 0 5 798 1
## E 0 0 0 2 899
The model from the GBM method presents very high accuracy on train set (99.74%) as well as on test set (99,78%). Since these values are too close, I assume that there is no overfitting on the train set. On the other hand, one can notice that the test accuracy is slightly better than the train one. Nonetheless, the difference is in fact too small as it corresponds to less than two additional good predictions on the test set and could be induced by the smallest number of observations. The computational time is of 579s.
set.seed(123)
modelRF <- train(x, y, method = "rf", trControl = control)
modelRF$results
## mtry Accuracy Kappa AccuracySD KappaSD
## 1 2 0.9952433 0.9939829 0.002150399 0.002720445
## 2 22 0.9986407 0.9982806 0.001749896 0.002213406
## 3 43 0.9967383 0.9958746 0.001743711 0.002205343
predRF <- predict(modelRF, subTestingCor)
cfMatRF <- confusionMatrix(subTestingCor$classe, predRF)
rm(control)
cfMatRF$overall
## Accuracy Kappa AccuracyLower AccuracyUpper AccuracyNull
## 0.9989804 0.9987104 0.9976223 0.9996689 0.2844617
## AccuracyPValue McnemarPValue
## 0.0000000 NaN
cfMatRF$table
## Reference
## Prediction A B C D E
## A 1395 0 0 0 0
## B 0 948 1 0 0
## C 0 1 854 0 0
## D 0 0 3 801 0
## E 0 0 0 0 901
The obtained model shows very high accuracies. As for the GBM method, the train and test accuracies are once again too close with the latter slightly better than the first. There is thus no overfitting on the train set.
Yet, to avoid doubts on the resampling method, I computed a new model using repeatedcv on the same data set. The accuracies obtained with 5 repeats (99,88%) are exactly the same and are very close to the ones obtained previously.
results <- resamples(list(RPART=modelRPART, GBM=modelGBM, RF=modelRF))
results$timings
## Everything FinalModel Prediction
## RPART 49.65 1.23 NA
## GBM 578.84 95.15 NA
## RF 1234.17 170.50 NA
bwplot(results)
Apart of the RPART method, the models obtained with the GBM and RF methods fit the test data very well. The standard deviations of the accuracies are also very small which suggests that the data collected with the body sensors is very accurate. The big difference in processing time between the GBM and RF methods (more than the double) poses the problem of performance when using a time-consuming method for just a little improvement. Of course, for this project, one can afford using the RF method to predict the cases from the quiz. However, once the data becomes very big, it is important to assess all these different aspects taking into consideration the acceptable margin of error.
importanceRF <- varImp(modelRF,scale = FALSE)
plot(importanceRF, top= 10)
It is very interesting to notice that, the most important predictors used in RF method are the raw_timestamp_part_1 and the num_window. Since the data was recorded by body sensors using a sliding time window with different durations, it is therefore necessary to look at the observations according to there corresponding time stamp and window number. While this behavior is intuitive for a human being, the fact remains that it is not obvious for a machine. Fortunately, the RF algorithm permitted to learn from the data enough to spot this fact.
predict(modelRF, testingCor)
## [1] B A B A A E D B A A B C B A E E A B B B
## Levels: A B C D E
# 100% accuracy
#rm(list = ls())
Velloso, E., Bulling, A., Gellersen, H., Ugulino, W., & Fuks, H. (2013, March). Qualitative activity recognition of weight lifting exercises. In Proceedings of the 4th Augmented Human International Conference (pp. 116-123). ACM.↩
Pearson, K. (1896). Mathematical contributions to the theory of evolution. III. Regression, heredity, and panmixia. Philosophical Transactions of the Royal Society of London. Series A, containing papers of a mathematical or physical character, 187, 253-318.↩
https://github.com/lgreski/datasciencectacontent/blob/master/markdown/pml-randomForestPerformance.md↩
http://www.sthda.com/english/articles/38-regression-model-validation/157-cross-validation-essentials-in-r/↩