Practical Machine Learning

Abstract

In this document, we will predict how 6 partecipants perform various types of exercises, as described in the Background section, by analysing the “classe” variable in the training dataset. The resulting machine learning algorithm will then be applied to the test dataset and the predictions will be tested via the Course Project Prediction Quiz, online.

Background

From the dataset’s authors’ website we learn how the data was gathered. An excerpt reads:

“Six young health participants were asked to perform one set of 10 repetitions of the Unilateral Dumbbell Biceps Curl in five different fashions: exactly according to the specification (Class A), throwing the elbows to the front (Class B), lifting the dumbbell only halfway (Class C), lowering the dumbbell only halfway (Class D) and throwing the hips to the front (Class E).

Class A corresponds to the specified execution of the exercise, while the other 4 classes correspond to common mistakes. Participants were supervised by an experienced weight lifter to make sure the execution complied to the manner they were supposed to simulate. The exercises were performed by six male participants aged between 20-28 years, with little weight lifting experience. We made sure that all participants could easily simulate the mistakes in a safe and controlled manner by using a relatively light dumbbell (1.25kg)."

Full source:

Ugulino, W.; Cardador, D.; Vega, K.; Velloso, E.; Milidiu, R.; Fuks, H. Wearable Computing: Accelerometers’ Data Classification of Body Postures and Movements. Proceedings of 21st Brazilian Symposium on Artificial Intelligence. Advances in Artificial Intelligence - SBIA 2012. In: Lecture Notes in Computer Science. , pp. 52-61. Curitiba, PR: Springer Berlin / Heidelberg, 2012. ISBN 978-3-642-34458-9. DOI: 10.1007/978-3-642-34459-6_6.

Data Loading

The environment is cleared of any previous data and assigned variables loaded onto it and the appropriate libraries are loaded onto RStudio.

rm(list=ls())  
library(knitr)
library(caret)
library(rpart)
library(rpart.plot)
library(rattle)
library(randomForest)
library(corrplot)
library(gbm)
set.seed(291)

The datasets are then downloaded and the training set is divided into a 70/30 split, to have a training set and a testing set within the training set and leaving the testing set only for the predictions for the aforementioned quiz.

UrlTrain<-"http://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv"
UrlTest<-"http://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv"

train <- read.csv(url(UrlTrain))
test  <- read.csv(url(UrlTest))

InTrain  <- createDataPartition(train$classe, p=0.7, list=FALSE)
TrainSet <- train[InTrain, ]
TestSet  <- train[-InTrain, ]
dim(TrainSet)
## [1] 13737   160
dim(TestSet)
## [1] 5885  160

Seen the dimesions of the datasets (160 variables), we decide to clean them up by removing variables with:
1. Near Zero Variance
2. Mostly NAs values
3. Identification values

1. Near Zero Variance

nzv <- nearZeroVar(TrainSet)
TrainSet <- TrainSet[, -nzv]
TestSet  <- TestSet[, -nzv]
dim(TrainSet)
## [1] 13737   102
dim(TestSet)
## [1] 5885  102

2. NAs

NAs    <- sapply(TrainSet, function(x) mean(is.na(x))) > 0.95
TrainSet <- TrainSet[, NAs==FALSE]
TestSet  <- TestSet[, NAs==FALSE]
dim(TrainSet)
## [1] 13737    59
dim(TestSet)
## [1] 5885   59

3. IDs

TrainSet <- TrainSet[, -(1:5)]
TestSet  <- TestSet[, -(1:5)]
dim(TrainSet)
## [1] 13737    54
dim(TestSet)
## [1] 5885   54

Exploratory Analysis

With now more manageable datasets, we will perform a correlation analysis between the variables to see if anything sticks out.

CorAn <- cor(TrainSet[, -54])
corrplot(CorAn, order = "FPC", method = "color", type = "lower", 
         tl.cex = 0.8, tl.col = rgb(0, 0, 0))

The more correlated the variables are, the more saturated the colour will be in the matrix. If we exclude the obvious correlations (e.g. accel_belt_z to accel_belt_z), there aren’t many other correlations of note.

Prediction Model Building

For this assignment, we will use three different methods to build a model. The three models will then be run on the “mini” test dataset and the one with the highest accuracy will be used on the test dataset for the quiz. We will also include a confusion matrix at the end of each model to help the visualisation of its accuracy.

1. Random Forest

Firstly, we fit the model.

set.seed(291)
ControlRF <- trainControl(method="cv", number=3, verboseIter=FALSE)
ModFitRF <- train(classe ~ ., data=TrainSet, method="rf",
                          trControl=ControlRF)
ModFitRF$finalModel

## 
## Call:
##  randomForest(x = x, y = y, mtry = param$mtry) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 27
## 
##         OOB estimate of  error rate: 0.24%
## Confusion matrix:
##      A    B    C    D    E  class.error
## A 3905    0    0    0    1 0.0002560164
## B    7 2648    3    0    0 0.0037622272
## C    0    7 2389    0    0 0.0029215359
## D    0    0    9 2243    0 0.0039964476
## E    0    1    0    5 2519 0.0023762376

Then, we run the model on the test dataset.

PredRF <- predict(ModFitRF, newdata=TestSet)
ConfMatRF <- confusionMatrix(PredRF, TestSet$classe)
ConfMatRF

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 1674    1    0    0    0
##          B    0 1137    1    0    0
##          C    0    0 1025    6    0
##          D    0    1    0  958    8
##          E    0    0    0    0 1074
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9971          
##                  95% CI : (0.9954, 0.9983)
##     No Information Rate : 0.2845          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.9963          
##                                           
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            1.0000   0.9982   0.9990   0.9938   0.9926
## Specificity            0.9998   0.9998   0.9988   0.9982   1.0000
## Pos Pred Value         0.9994   0.9991   0.9942   0.9907   1.0000
## Neg Pred Value         1.0000   0.9996   0.9998   0.9988   0.9983
## Prevalence             0.2845   0.1935   0.1743   0.1638   0.1839
## Detection Rate         0.2845   0.1932   0.1742   0.1628   0.1825
## Detection Prevalence   0.2846   0.1934   0.1752   0.1643   0.1825
## Balanced Accuracy      0.9999   0.9990   0.9989   0.9960   0.9963

Finally, we plot the confusion matrix in a more visually pleasing (and intuitive) way.

plot(ConfMatRF$table, col = ConfMatRF$byClass, 
     main = paste("Random Forest - Accuracy =",
                  round(ConfMatRF$overall['Accuracy'], 4)))

2. Decision Trees

Firstly, we fit the model.

set.seed(291)
ModFitDT <- rpart(classe ~ ., data=TrainSet, method="class")
fancyRpartPlot(ModFitDT)

Then, we run the model on the test dataset.

PredDT <- predict(ModFitDT, newdata=TestSet, type="class")
ConfMatDT <- confusionMatrix(PredDT, TestSet$classe)
ConfMatDT

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 1499  253   39   72   21
##          B   32  633   32   17   17
##          C   13  122  888   99   20
##          D   86  121   57  686  149
##          E   44   10   10   90  875
## 
## Overall Statistics
##                                          
##                Accuracy : 0.7784         
##                  95% CI : (0.7676, 0.789)
##     No Information Rate : 0.2845         
##     P-Value [Acc > NIR] : < 2.2e-16      
##                                          
##                   Kappa : 0.7189         
##                                          
##  Mcnemar's Test P-Value : < 2.2e-16      
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            0.8955   0.5558   0.8655   0.7116   0.8087
## Specificity            0.9086   0.9794   0.9477   0.9161   0.9679
## Pos Pred Value         0.7956   0.8659   0.7776   0.6242   0.8503
## Neg Pred Value         0.9563   0.9018   0.9709   0.9419   0.9574
## Prevalence             0.2845   0.1935   0.1743   0.1638   0.1839
## Detection Rate         0.2547   0.1076   0.1509   0.1166   0.1487
## Detection Prevalence   0.3201   0.1242   0.1941   0.1867   0.1749
## Balanced Accuracy      0.9020   0.7676   0.9066   0.8138   0.8883

Finally, we plot the confusion matrix in a more visually pleasing (and intuitive) way.

plot(ConfMatDT$table, col = ConfMatDT$byClass, 
     main = paste("Decision Tree - Accuracy =",
                  round(ConfMatDT$overall['Accuracy'], 4)))

3. Generalised Boosted Model

Firstly, we fit the model.

set.seed(291)
ControlGBM <- trainControl(method = "repeatedcv", number = 5, repeats = 1)
ModFitGBM  <- train(classe ~ ., data=TrainSet, method = "gbm",
                    trControl = ControlGBM, verbose = FALSE)
ModFitGBM$finalModel

## A gradient boosted model with multinomial loss function.
## 150 iterations were performed.
## There were 53 predictors of which 53 had non-zero influence.

Then, we run the model on the test dataset.

PredGBM <- predict(ModFitGBM, newdata=TestSet)
ConfMatGBM <- confusionMatrix(PredGBM, TestSet$classe)
ConfMatGBM

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 1669    9    0    1    0
##          B    5 1123    7    2    0
##          C    0    5 1016   15    1
##          D    0    2    2  946   20
##          E    0    0    1    0 1061
## 
## Overall Statistics
##                                          
##                Accuracy : 0.9881         
##                  95% CI : (0.985, 0.9907)
##     No Information Rate : 0.2845         
##     P-Value [Acc > NIR] : < 2.2e-16      
##                                          
##                   Kappa : 0.985          
##                                          
##  Mcnemar's Test P-Value : NA             
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            0.9970   0.9860   0.9903   0.9813   0.9806
## Specificity            0.9976   0.9971   0.9957   0.9951   0.9998
## Pos Pred Value         0.9940   0.9877   0.9797   0.9753   0.9991
## Neg Pred Value         0.9988   0.9966   0.9979   0.9963   0.9956
## Prevalence             0.2845   0.1935   0.1743   0.1638   0.1839
## Detection Rate         0.2836   0.1908   0.1726   0.1607   0.1803
## Detection Prevalence   0.2853   0.1932   0.1762   0.1648   0.1805
## Balanced Accuracy      0.9973   0.9915   0.9930   0.9882   0.9902

Finally, we plot the confusion matrix in a more visually pleasing (and intuitive) way.

plot(ConfMatGBM$table, col = ConfMatGBM$byClass, 
     main = paste("GBM - Accuracy =", round(ConfMatGBM$overall['Accuracy'], 4)))

Applying the Selected Model to the Test Dataset

The accuracies of the three models are: 1. Random Forest: 0.9971
2. Decision Trees: 0.7784
3. Generalised Boosted Model: 0.9881

Since the Random Forest model is the most accurate one, we will apply it to the test dataset to predict the results needed for the aforementioned test.

PredictTest <- predict(ModFitRF, newdata=test)
PredictTest

##  [1] B A B A A E D B A A B C B A E E A B B B
## Levels: A B C D E

Practical Machine Learning

Sofia Riccomagno

18/05/2020

Practical Machine Learning

Abstract

Background

Data Loading

1. Near Zero Variance

2. NAs

3. IDs

Exploratory Analysis

Prediction Model Building

1. Random Forest

2. Decision Trees

3. Generalised Boosted Model

Applying the Selected Model to the Test Dataset