Summary

One thing that people regularly do is quantify how much of a particular activity they do, but they rarely quantify how well they do it. In this project, the goal will be to use data from accelerometers on the belt, forearm, arm, and dumbbell of 6 participants. The goal of the project is to predict the manner in which they did the exercise. This is the “classe” variable in the training set.

Protocol

Model

Participants were asked to perform one set of 10 repetitions of the Unilateral Dumbbell Biceps Curl in five different fashions:

Class A: exactly according to the specification,
Class B: throwing the elbows to the front,
Class C: lifting the dumbbell only halfway,
Class D: lowering the dumbbell only halfway,
Class E: throwing the hips to the front.

Class A corresponds to the specified execution of the exercise, while the other 4 classes correspond to common mistakes.

Cross-Validation

Will be performed by subsampling the training data set randomly without replacement into 2 subsamples: subTraining data (70% of the original Training data set) and subTesting data (30%). Our models will be fitted on the subTraining data set, and tested on the subTesting data. Once the most accurate model is choosen, it will be tested on the original Testing data set.

Reproducibility and Library

The pseudo-random number generator seed was set at 1 for all code. For this project load this library.

##Seed
set.seed(1)

##Libraries
library(caret)

## Loading required package: lattice

## Loading required package: ggplot2

library(randomForest)

## randomForest 4.6-14

## Type rfNews() to see new features/changes/bug fixes.

## 
## Attaching package: 'randomForest'

## The following object is masked from 'package:ggplot2':
## 
##     margin

library(rpart)
library(rattle)

## Rattle: A free graphical interface for data science with R.
## Version 5.3.0 Copyright (c) 2006-2018 Togaware Pty Ltd.
## Type 'rattle()' to shake, rattle, and roll your data.

## 
## Attaching package: 'rattle'

## The following object is masked from 'package:randomForest':
## 
##     importance

library(RColorBrewer)
library(ggplot2)

Data

The training data for this project are available here:

training

The test data are available here:

testing

The data for this project come from this source: link. If you use the document you create for this class for any purpose please cite them as they have been very generous in allowing their data to be used for this kind of assignment.

Getting Data

library(data.table)

## Data training
url_data <- "https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv"
data <- fread(url_data,
              na.strings = c("NA,",""))

## Data testing
url_validation <- "https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv"
validation <- fread(url_validation, 
                    na.strings = c("NA", ""))

Cleaning Data

look at the structure of training data and delete columns with all missing values or irrelevant data.

## Training and testing
MissingValues <- sapply(data, function (x) any(is.na(x) | x == ""))
Predictor <- !MissingValues & grepl("belt|[^(fore)]arm|dumbbell|forearm",
                                    names(MissingValues))
predCandidates <- names(MissingValues)[Predictor]
varToInclude <- c("classe", predCandidates)
data <- data[,varToInclude, with = FALSE]
paste0("dimension data pre cleaning: ",dim(data))

## [1] "dimension data pre cleaning: 19622" "dimension data pre cleaning: 120"

data[,.N,classe]

myDataNZV <- nearZeroVar(data, saveMetrics=TRUE)
data <- subset(data, select = which(myDataNZV$nzv == FALSE))
paste0("dimensione data post cleaning: ",dim(data))

## [1] "dimensione data post cleaning: 19622"
## [2] "dimensione data post cleaning: 53"

Partition the data set

Partition the data set in two data sets

training 70%
testing 30%

inTrain <- createDataPartition(y = data$classe, p = 0.7, 
                               list = FALSE)
training <- data[inTrain, ]
testing <- data[-inTrain, ]
dim(training); dim(testing)

## [1] 13737    53

## [1] 5885   53

Prediction model

We train two classification-orientated models on the training set, using random forests (rf in R’s caret package), boosted trees (gbm). We train the models using 2-fold cross-validation. In other words, for each model and each candidate combination of tuning parameters (for instance, the tuning parameter for the random forest model is mtry, the number of randomly selected predictors), one fifth of the training data is held out and serves as a cross-validation set on which the accuracy is calculated after the model has been trained on the rest of the data. The best model is then fitted on the full training set.

trainCtrl <- trainControl(method = "cv", number = 2)

set.seed(1)
modRf <- train(classe ~ ., method = "rf", data = training,
               trControl = trainCtrl)
set.seed(1)
modGbm <- train(classe ~ ., method="gbm", data = training,
                trControl = trainCtrl)

## Iter   TrainDeviance   ValidDeviance   StepSize   Improve
##      1        1.6094            -nan     0.1000    0.1337
##      2        1.5207            -nan     0.1000    0.0902
##      3        1.4607            -nan     0.1000    0.0675
##      4        1.4154            -nan     0.1000    0.0531
##      5        1.3799            -nan     0.1000    0.0497
##      6        1.3468            -nan     0.1000    0.0402
##      7        1.3200            -nan     0.1000    0.0378
##      8        1.2945            -nan     0.1000    0.0314
##      9        1.2735            -nan     0.1000    0.0286
##     10        1.2527            -nan     0.1000    0.0321
##     20        1.0966            -nan     0.1000    0.0179
##     40        0.9152            -nan     0.1000    0.0077
##     60        0.8037            -nan     0.1000    0.0046
##     80        0.7262            -nan     0.1000    0.0040
##    100        0.6625            -nan     0.1000    0.0043
##    120        0.6075            -nan     0.1000    0.0024
##    140        0.5628            -nan     0.1000    0.0020
##    150        0.5448            -nan     0.1000    0.0010
## 
## Iter   TrainDeviance   ValidDeviance   StepSize   Improve
##      1        1.6094            -nan     0.1000    0.1836
##      2        1.4859            -nan     0.1000    0.1320
##      3        1.3990            -nan     0.1000    0.1047
##      4        1.3317            -nan     0.1000    0.0854
##      5        1.2755            -nan     0.1000    0.0675
##      6        1.2323            -nan     0.1000    0.0588
##      7        1.1939            -nan     0.1000    0.0652
##      8        1.1520            -nan     0.1000    0.0481
##      9        1.1208            -nan     0.1000    0.0500
##     10        1.0887            -nan     0.1000    0.0450
##     20        0.8861            -nan     0.1000    0.0216
##     40        0.6637            -nan     0.1000    0.0092
##     60        0.5375            -nan     0.1000    0.0065
##     80        0.4516            -nan     0.1000    0.0039
##    100        0.3826            -nan     0.1000    0.0054
##    120        0.3311            -nan     0.1000    0.0021
##    140        0.2885            -nan     0.1000    0.0009
##    150        0.2721            -nan     0.1000    0.0013
## 
## Iter   TrainDeviance   ValidDeviance   StepSize   Improve
##      1        1.6094            -nan     0.1000    0.2348
##      2        1.4595            -nan     0.1000    0.1680
##      3        1.3534            -nan     0.1000    0.1214
##      4        1.2767            -nan     0.1000    0.1048
##      5        1.2114            -nan     0.1000    0.0926
##      6        1.1519            -nan     0.1000    0.0751
##      7        1.1038            -nan     0.1000    0.0676
##      8        1.0602            -nan     0.1000    0.0715
##      9        1.0141            -nan     0.1000    0.0500
##     10        0.9789            -nan     0.1000    0.0540
##     20        0.7446            -nan     0.1000    0.0235
##     40        0.5074            -nan     0.1000    0.0107
##     60        0.3862            -nan     0.1000    0.0060
##     80        0.3086            -nan     0.1000    0.0053
##    100        0.2506            -nan     0.1000    0.0021
##    120        0.2100            -nan     0.1000    0.0017
##    140        0.1761            -nan     0.1000    0.0007
##    150        0.1622            -nan     0.1000    0.0004
## 
## Iter   TrainDeviance   ValidDeviance   StepSize   Improve
##      1        1.6094            -nan     0.1000    0.1303
##      2        1.5222            -nan     0.1000    0.0874
##      3        1.4648            -nan     0.1000    0.0724
##      4        1.4198            -nan     0.1000    0.0573
##      5        1.3830            -nan     0.1000    0.0424
##      6        1.3544            -nan     0.1000    0.0475
##      7        1.3232            -nan     0.1000    0.0377
##      8        1.2981            -nan     0.1000    0.0383
##      9        1.2736            -nan     0.1000    0.0342
##     10        1.2519            -nan     0.1000    0.0317
##     20        1.0915            -nan     0.1000    0.0171
##     40        0.9172            -nan     0.1000    0.0093
##     60        0.8084            -nan     0.1000    0.0055
##     80        0.7319            -nan     0.1000    0.0053
##    100        0.6690            -nan     0.1000    0.0045
##    120        0.6172            -nan     0.1000    0.0025
##    140        0.5717            -nan     0.1000    0.0016
##    150        0.5535            -nan     0.1000    0.0021
## 
## Iter   TrainDeviance   ValidDeviance   StepSize   Improve
##      1        1.6094            -nan     0.1000    0.1818
##      2        1.4880            -nan     0.1000    0.1392
##      3        1.3997            -nan     0.1000    0.1028
##      4        1.3333            -nan     0.1000    0.0863
##      5        1.2767            -nan     0.1000    0.0766
##      6        1.2267            -nan     0.1000    0.0564
##      7        1.1897            -nan     0.1000    0.0560
##      8        1.1534            -nan     0.1000    0.0530
##      9        1.1196            -nan     0.1000    0.0515
##     10        1.0877            -nan     0.1000    0.0382
##     20        0.8869            -nan     0.1000    0.0242
##     40        0.6746            -nan     0.1000    0.0080
##     60        0.5422            -nan     0.1000    0.0058
##     80        0.4502            -nan     0.1000    0.0040
##    100        0.3864            -nan     0.1000    0.0028
##    120        0.3344            -nan     0.1000    0.0023
##    140        0.2945            -nan     0.1000    0.0017
##    150        0.2745            -nan     0.1000    0.0019
## 
## Iter   TrainDeviance   ValidDeviance   StepSize   Improve
##      1        1.6094            -nan     0.1000    0.2424
##      2        1.4576            -nan     0.1000    0.1702
##      3        1.3512            -nan     0.1000    0.1281
##      4        1.2693            -nan     0.1000    0.1037
##      5        1.2031            -nan     0.1000    0.0911
##      6        1.1450            -nan     0.1000    0.0738
##      7        1.0973            -nan     0.1000    0.0737
##      8        1.0500            -nan     0.1000    0.0594
##      9        1.0113            -nan     0.1000    0.0554
##     10        0.9736            -nan     0.1000    0.0519
##     20        0.7418            -nan     0.1000    0.0249
##     40        0.5155            -nan     0.1000    0.0085
##     60        0.3922            -nan     0.1000    0.0077
##     80        0.3109            -nan     0.1000    0.0027
##    100        0.2507            -nan     0.1000    0.0021
##    120        0.2081            -nan     0.1000    0.0013
##    140        0.1732            -nan     0.1000    0.0007
##    150        0.1604            -nan     0.1000    0.0014
## 
## Iter   TrainDeviance   ValidDeviance   StepSize   Improve
##      1        1.6094            -nan     0.1000    0.2439
##      2        1.4584            -nan     0.1000    0.1618
##      3        1.3565            -nan     0.1000    0.1290
##      4        1.2744            -nan     0.1000    0.1043
##      5        1.2077            -nan     0.1000    0.0963
##      6        1.1482            -nan     0.1000    0.0682
##      7        1.1043            -nan     0.1000    0.0628
##      8        1.0640            -nan     0.1000    0.0695
##      9        1.0210            -nan     0.1000    0.0569
##     10        0.9847            -nan     0.1000    0.0489
##     20        0.7572            -nan     0.1000    0.0248
##     40        0.5256            -nan     0.1000    0.0116
##     60        0.4009            -nan     0.1000    0.0054
##     80        0.3176            -nan     0.1000    0.0042
##    100        0.2613            -nan     0.1000    0.0029
##    120        0.2202            -nan     0.1000    0.0012
##    140        0.1873            -nan     0.1000    0.0016
##    150        0.1734            -nan     0.1000    0.0017

Before we use these models to predict classes on the test set, we compare their resampling distributions (noting that there are 2 resamples for each model as we used 2-fold cross-validation) to get a general idea of the difference in performance between the models, and to estimate the out of sample error.

# collect resamples
results <- resamples(list(RF=modRf, GBM=modGbm))

# summary of the resampling distributions
summary(results)

## 
## Call:
## summary.resamples(object = results)
## 
## Models: RF, GBM 
## Number of resamples: 2 
## 
## Accuracy 
##          Min.   1st Qu.    Median      Mean   3rd Qu.      Max. NA's
## RF  0.9819453 0.9832561 0.9845670 0.9845670 0.9858779 0.9871888    0
## GBM 0.9558824 0.9569394 0.9579965 0.9579965 0.9590536 0.9601106    0
## 
## Kappa 
##          Min.   1st Qu.    Median      Mean   3rd Qu.      Max. NA's
## RF  0.9771553 0.9788156 0.9804759 0.9804759 0.9821362 0.9837964    0
## GBM 0.9441862 0.9455198 0.9468535 0.9468535 0.9481871 0.9495208    0

dotplot(results)

We now use the models to predict the class of the activities in the testing data set, and calculate the accuracy (or, equivalently, the out of sample error) of the models on this set.

# random forest
predRfTest <- predict(modRf, newdata=testing)
confusionMatrix(as.factor(testing$classe), predRfTest)$table

##           Reference
## Prediction    A    B    C    D    E
##          A 1670    3    1    0    0
##          B    5 1133    0    1    0
##          C    0   13 1006    7    0
##          D    0    0   14  949    1
##          E    1    1    0    2 1078

# boosted trees
predGbmTest <- predict(modGbm, newdata=testing)
confusionMatrix(as.factor(testing$classe), predGbmTest)$table

##           Reference
## Prediction    A    B    C    D    E
##          A 1638   20   13    1    2
##          B   35 1074   28    2    0
##          C    0   38  967   17    4
##          D    0    3   35  919    7
##          E    1   12    5   21 1043

After predicting the classes on the testing set, we obtain:

99.4% accuracy on the testing set and a kappa (κ) value of 0.99, for the model fitted using random forests.
96.3% accuracy and κ= 0.953 for boosted trees.

The accuracy of the random forest model is excellent, and the accuracy of the boosted trees model is also very good.

We then fit a model that combines the two best predictors, namely those produced by the random forest model and the boosted trees model, using a random forest, on the test set.

# fit the combination
predRfGbmTest <- data.frame(predRf = predRfTest, predGbm = predGbmTest,
                            classe=testing$classe)
modCombRfGbm <- train(classe ~ ., model = "rf", data=predRfGbmTest, 
                 trControl = trainCtrl)
# predict on testing
predCombPredTest <- predict(modCombRfGbm, predRfGbmTest)

We use this final model to predict the classes on the validation set, and display the resulting confusion matrix.

# first predict using the random forest and boosted trees models
predRfValidation <- predict(modRf, newdata=testing)
predGbmValidation <- predict(modGbm, newdata=testing)

# feed these predictions into our final model
predCombRfGbmValidation <- predict(modCombRfGbm, 
                              newdata=data.frame(predRf = predRfValidation, 
                                                 predGbm = predGbmValidation))

# confusion matrix
cmCombRfGbmValidation <- confusionMatrix(as.factor(testing$classe), 
                                         predCombRfGbmValidation)
cmCombRfGbmValidation$table

##           Reference
## Prediction    A    B    C    D    E
##          A 1670    3    1    0    0
##          B    5 1133    0    1    0
##          C    0   13 1006    7    0
##          D    0    0   14  949    1
##          E    1    1    0    2 1078

# accuracy, out of sample error, kappa on validation set
accCombRfGbmValidation <- cmCombRfGbmValidation$overall[['Accuracy']]
oseCombRfGbmValidation <- 1-accCombRfGbmValidation
kappaCombRfGbmValidation  <- cmCombRfGbmValidation$overall[['Kappa']]

Predictions

To make the final predictions, we first load the test cases, removing non-available values as we did for the training data file. We then use the previously fitted random forest and boosted trees models to make our first set of predictions, which we then pass to our model that combines the two predictors.

predTestCasesRf <- predict(modRf, newdata=validation)
predTestCasesGbm <- predict(modGbm, newdata=validation)

# prediction
predTestCasesCombModFit <- predict(modCombRfGbm, 
                                   newdata = data.frame(
                                     predRf = predTestCasesRf, 
                                     predGbm = predTestCasesGbm))
predTestCasesCombModFit

##  [1] B A B A A E D B A A B C B A E E A B B B
## Levels: A B C D E

Assignment

Matteo Gambera

9 aprile 2020