One thing that people regularly do is quantify how much of a particular activity they do, but they rarely quantify how well they do it. In this project, the goal will be to use data from accelerometers on the belt, forearm, arm, and dumbbell of 6 participants. The goal of the project is to predict the manner in which they did the exercise. This is the “classe” variable in the training set.
Participants were asked to perform one set of 10 repetitions of the Unilateral Dumbbell Biceps Curl in five different fashions:
Class A corresponds to the specified execution of the exercise, while the other 4 classes correspond to common mistakes.
Will be performed by subsampling the training data set randomly without replacement into 2 subsamples: subTraining data (70% of the original Training data set) and subTesting data (30%). Our models will be fitted on the subTraining data set, and tested on the subTesting data. Once the most accurate model is choosen, it will be tested on the original Testing data set.
The pseudo-random number generator seed was set at 1 for all code. For this project load this library.
##Seed
set.seed(1)
##Libraries
library(caret)
## Loading required package: lattice
## Loading required package: ggplot2
library(randomForest)
## randomForest 4.6-14
## Type rfNews() to see new features/changes/bug fixes.
##
## Attaching package: 'randomForest'
## The following object is masked from 'package:ggplot2':
##
## margin
library(rpart)
library(rattle)
## Rattle: A free graphical interface for data science with R.
## Version 5.3.0 Copyright (c) 2006-2018 Togaware Pty Ltd.
## Type 'rattle()' to shake, rattle, and roll your data.
##
## Attaching package: 'rattle'
## The following object is masked from 'package:randomForest':
##
## importance
library(RColorBrewer)
library(ggplot2)
The training data for this project are available here:
The test data are available here:
The data for this project come from this source: link. If you use the document you create for this class for any purpose please cite them as they have been very generous in allowing their data to be used for this kind of assignment.
library(data.table)
## Data training
url_data <- "https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv"
data <- fread(url_data,
na.strings = c("NA,",""))
## Data testing
url_validation <- "https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv"
validation <- fread(url_validation,
na.strings = c("NA", ""))
look at the structure of training data and delete columns with all missing values or irrelevant data.
## Training and testing
MissingValues <- sapply(data, function (x) any(is.na(x) | x == ""))
Predictor <- !MissingValues & grepl("belt|[^(fore)]arm|dumbbell|forearm",
names(MissingValues))
predCandidates <- names(MissingValues)[Predictor]
varToInclude <- c("classe", predCandidates)
data <- data[,varToInclude, with = FALSE]
paste0("dimension data pre cleaning: ",dim(data))
## [1] "dimension data pre cleaning: 19622" "dimension data pre cleaning: 120"
data[,.N,classe]
myDataNZV <- nearZeroVar(data, saveMetrics=TRUE)
data <- subset(data, select = which(myDataNZV$nzv == FALSE))
paste0("dimensione data post cleaning: ",dim(data))
## [1] "dimensione data post cleaning: 19622"
## [2] "dimensione data post cleaning: 53"
Partition the data set in two data sets
inTrain <- createDataPartition(y = data$classe, p = 0.7,
list = FALSE)
training <- data[inTrain, ]
testing <- data[-inTrain, ]
dim(training); dim(testing)
## [1] 13737 53
## [1] 5885 53
We train two classification-orientated models on the training set, using random forests (rf in R’s caret package), boosted trees (gbm). We train the models using 2-fold cross-validation. In other words, for each model and each candidate combination of tuning parameters (for instance, the tuning parameter for the random forest model is mtry, the number of randomly selected predictors), one fifth of the training data is held out and serves as a cross-validation set on which the accuracy is calculated after the model has been trained on the rest of the data. The best model is then fitted on the full training set.
trainCtrl <- trainControl(method = "cv", number = 2)
set.seed(1)
modRf <- train(classe ~ ., method = "rf", data = training,
trControl = trainCtrl)
set.seed(1)
modGbm <- train(classe ~ ., method="gbm", data = training,
trControl = trainCtrl)
## Iter TrainDeviance ValidDeviance StepSize Improve
## 1 1.6094 -nan 0.1000 0.1337
## 2 1.5207 -nan 0.1000 0.0902
## 3 1.4607 -nan 0.1000 0.0675
## 4 1.4154 -nan 0.1000 0.0531
## 5 1.3799 -nan 0.1000 0.0497
## 6 1.3468 -nan 0.1000 0.0402
## 7 1.3200 -nan 0.1000 0.0378
## 8 1.2945 -nan 0.1000 0.0314
## 9 1.2735 -nan 0.1000 0.0286
## 10 1.2527 -nan 0.1000 0.0321
## 20 1.0966 -nan 0.1000 0.0179
## 40 0.9152 -nan 0.1000 0.0077
## 60 0.8037 -nan 0.1000 0.0046
## 80 0.7262 -nan 0.1000 0.0040
## 100 0.6625 -nan 0.1000 0.0043
## 120 0.6075 -nan 0.1000 0.0024
## 140 0.5628 -nan 0.1000 0.0020
## 150 0.5448 -nan 0.1000 0.0010
##
## Iter TrainDeviance ValidDeviance StepSize Improve
## 1 1.6094 -nan 0.1000 0.1836
## 2 1.4859 -nan 0.1000 0.1320
## 3 1.3990 -nan 0.1000 0.1047
## 4 1.3317 -nan 0.1000 0.0854
## 5 1.2755 -nan 0.1000 0.0675
## 6 1.2323 -nan 0.1000 0.0588
## 7 1.1939 -nan 0.1000 0.0652
## 8 1.1520 -nan 0.1000 0.0481
## 9 1.1208 -nan 0.1000 0.0500
## 10 1.0887 -nan 0.1000 0.0450
## 20 0.8861 -nan 0.1000 0.0216
## 40 0.6637 -nan 0.1000 0.0092
## 60 0.5375 -nan 0.1000 0.0065
## 80 0.4516 -nan 0.1000 0.0039
## 100 0.3826 -nan 0.1000 0.0054
## 120 0.3311 -nan 0.1000 0.0021
## 140 0.2885 -nan 0.1000 0.0009
## 150 0.2721 -nan 0.1000 0.0013
##
## Iter TrainDeviance ValidDeviance StepSize Improve
## 1 1.6094 -nan 0.1000 0.2348
## 2 1.4595 -nan 0.1000 0.1680
## 3 1.3534 -nan 0.1000 0.1214
## 4 1.2767 -nan 0.1000 0.1048
## 5 1.2114 -nan 0.1000 0.0926
## 6 1.1519 -nan 0.1000 0.0751
## 7 1.1038 -nan 0.1000 0.0676
## 8 1.0602 -nan 0.1000 0.0715
## 9 1.0141 -nan 0.1000 0.0500
## 10 0.9789 -nan 0.1000 0.0540
## 20 0.7446 -nan 0.1000 0.0235
## 40 0.5074 -nan 0.1000 0.0107
## 60 0.3862 -nan 0.1000 0.0060
## 80 0.3086 -nan 0.1000 0.0053
## 100 0.2506 -nan 0.1000 0.0021
## 120 0.2100 -nan 0.1000 0.0017
## 140 0.1761 -nan 0.1000 0.0007
## 150 0.1622 -nan 0.1000 0.0004
##
## Iter TrainDeviance ValidDeviance StepSize Improve
## 1 1.6094 -nan 0.1000 0.1303
## 2 1.5222 -nan 0.1000 0.0874
## 3 1.4648 -nan 0.1000 0.0724
## 4 1.4198 -nan 0.1000 0.0573
## 5 1.3830 -nan 0.1000 0.0424
## 6 1.3544 -nan 0.1000 0.0475
## 7 1.3232 -nan 0.1000 0.0377
## 8 1.2981 -nan 0.1000 0.0383
## 9 1.2736 -nan 0.1000 0.0342
## 10 1.2519 -nan 0.1000 0.0317
## 20 1.0915 -nan 0.1000 0.0171
## 40 0.9172 -nan 0.1000 0.0093
## 60 0.8084 -nan 0.1000 0.0055
## 80 0.7319 -nan 0.1000 0.0053
## 100 0.6690 -nan 0.1000 0.0045
## 120 0.6172 -nan 0.1000 0.0025
## 140 0.5717 -nan 0.1000 0.0016
## 150 0.5535 -nan 0.1000 0.0021
##
## Iter TrainDeviance ValidDeviance StepSize Improve
## 1 1.6094 -nan 0.1000 0.1818
## 2 1.4880 -nan 0.1000 0.1392
## 3 1.3997 -nan 0.1000 0.1028
## 4 1.3333 -nan 0.1000 0.0863
## 5 1.2767 -nan 0.1000 0.0766
## 6 1.2267 -nan 0.1000 0.0564
## 7 1.1897 -nan 0.1000 0.0560
## 8 1.1534 -nan 0.1000 0.0530
## 9 1.1196 -nan 0.1000 0.0515
## 10 1.0877 -nan 0.1000 0.0382
## 20 0.8869 -nan 0.1000 0.0242
## 40 0.6746 -nan 0.1000 0.0080
## 60 0.5422 -nan 0.1000 0.0058
## 80 0.4502 -nan 0.1000 0.0040
## 100 0.3864 -nan 0.1000 0.0028
## 120 0.3344 -nan 0.1000 0.0023
## 140 0.2945 -nan 0.1000 0.0017
## 150 0.2745 -nan 0.1000 0.0019
##
## Iter TrainDeviance ValidDeviance StepSize Improve
## 1 1.6094 -nan 0.1000 0.2424
## 2 1.4576 -nan 0.1000 0.1702
## 3 1.3512 -nan 0.1000 0.1281
## 4 1.2693 -nan 0.1000 0.1037
## 5 1.2031 -nan 0.1000 0.0911
## 6 1.1450 -nan 0.1000 0.0738
## 7 1.0973 -nan 0.1000 0.0737
## 8 1.0500 -nan 0.1000 0.0594
## 9 1.0113 -nan 0.1000 0.0554
## 10 0.9736 -nan 0.1000 0.0519
## 20 0.7418 -nan 0.1000 0.0249
## 40 0.5155 -nan 0.1000 0.0085
## 60 0.3922 -nan 0.1000 0.0077
## 80 0.3109 -nan 0.1000 0.0027
## 100 0.2507 -nan 0.1000 0.0021
## 120 0.2081 -nan 0.1000 0.0013
## 140 0.1732 -nan 0.1000 0.0007
## 150 0.1604 -nan 0.1000 0.0014
##
## Iter TrainDeviance ValidDeviance StepSize Improve
## 1 1.6094 -nan 0.1000 0.2439
## 2 1.4584 -nan 0.1000 0.1618
## 3 1.3565 -nan 0.1000 0.1290
## 4 1.2744 -nan 0.1000 0.1043
## 5 1.2077 -nan 0.1000 0.0963
## 6 1.1482 -nan 0.1000 0.0682
## 7 1.1043 -nan 0.1000 0.0628
## 8 1.0640 -nan 0.1000 0.0695
## 9 1.0210 -nan 0.1000 0.0569
## 10 0.9847 -nan 0.1000 0.0489
## 20 0.7572 -nan 0.1000 0.0248
## 40 0.5256 -nan 0.1000 0.0116
## 60 0.4009 -nan 0.1000 0.0054
## 80 0.3176 -nan 0.1000 0.0042
## 100 0.2613 -nan 0.1000 0.0029
## 120 0.2202 -nan 0.1000 0.0012
## 140 0.1873 -nan 0.1000 0.0016
## 150 0.1734 -nan 0.1000 0.0017
Before we use these models to predict classes on the test set, we compare their resampling distributions (noting that there are 2 resamples for each model as we used 2-fold cross-validation) to get a general idea of the difference in performance between the models, and to estimate the out of sample error.
# collect resamples
results <- resamples(list(RF=modRf, GBM=modGbm))
# summary of the resampling distributions
summary(results)
##
## Call:
## summary.resamples(object = results)
##
## Models: RF, GBM
## Number of resamples: 2
##
## Accuracy
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## RF 0.9819453 0.9832561 0.9845670 0.9845670 0.9858779 0.9871888 0
## GBM 0.9558824 0.9569394 0.9579965 0.9579965 0.9590536 0.9601106 0
##
## Kappa
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## RF 0.9771553 0.9788156 0.9804759 0.9804759 0.9821362 0.9837964 0
## GBM 0.9441862 0.9455198 0.9468535 0.9468535 0.9481871 0.9495208 0
dotplot(results)
We now use the models to predict the class of the activities in the testing data set, and calculate the accuracy (or, equivalently, the out of sample error) of the models on this set.
# random forest
predRfTest <- predict(modRf, newdata=testing)
confusionMatrix(as.factor(testing$classe), predRfTest)$table
## Reference
## Prediction A B C D E
## A 1670 3 1 0 0
## B 5 1133 0 1 0
## C 0 13 1006 7 0
## D 0 0 14 949 1
## E 1 1 0 2 1078
# boosted trees
predGbmTest <- predict(modGbm, newdata=testing)
confusionMatrix(as.factor(testing$classe), predGbmTest)$table
## Reference
## Prediction A B C D E
## A 1638 20 13 1 2
## B 35 1074 28 2 0
## C 0 38 967 17 4
## D 0 3 35 919 7
## E 1 12 5 21 1043
After predicting the classes on the testing set, we obtain:
99.4% accuracy on the testing set and a kappa (κ) value of 0.99, for the model fitted using random forests.
96.3% accuracy and κ= 0.953 for boosted trees.
The accuracy of the random forest model is excellent, and the accuracy of the boosted trees model is also very good.
We then fit a model that combines the two best predictors, namely those produced by the random forest model and the boosted trees model, using a random forest, on the test set.
# fit the combination
predRfGbmTest <- data.frame(predRf = predRfTest, predGbm = predGbmTest,
classe=testing$classe)
modCombRfGbm <- train(classe ~ ., model = "rf", data=predRfGbmTest,
trControl = trainCtrl)
# predict on testing
predCombPredTest <- predict(modCombRfGbm, predRfGbmTest)
We use this final model to predict the classes on the validation set, and display the resulting confusion matrix.
# first predict using the random forest and boosted trees models
predRfValidation <- predict(modRf, newdata=testing)
predGbmValidation <- predict(modGbm, newdata=testing)
# feed these predictions into our final model
predCombRfGbmValidation <- predict(modCombRfGbm,
newdata=data.frame(predRf = predRfValidation,
predGbm = predGbmValidation))
# confusion matrix
cmCombRfGbmValidation <- confusionMatrix(as.factor(testing$classe),
predCombRfGbmValidation)
cmCombRfGbmValidation$table
## Reference
## Prediction A B C D E
## A 1670 3 1 0 0
## B 5 1133 0 1 0
## C 0 13 1006 7 0
## D 0 0 14 949 1
## E 1 1 0 2 1078
# accuracy, out of sample error, kappa on validation set
accCombRfGbmValidation <- cmCombRfGbmValidation$overall[['Accuracy']]
oseCombRfGbmValidation <- 1-accCombRfGbmValidation
kappaCombRfGbmValidation <- cmCombRfGbmValidation$overall[['Kappa']]
To make the final predictions, we first load the test cases, removing non-available values as we did for the training data file. We then use the previously fitted random forest and boosted trees models to make our first set of predictions, which we then pass to our model that combines the two predictors.
predTestCasesRf <- predict(modRf, newdata=validation)
predTestCasesGbm <- predict(modGbm, newdata=validation)
# prediction
predTestCasesCombModFit <- predict(modCombRfGbm,
newdata = data.frame(
predRf = predTestCasesRf,
predGbm = predTestCasesGbm))
predTestCasesCombModFit
## [1] B A B A A E D B A A B C B A E E A B B B
## Levels: A B C D E