Practical Machine Learning

Background

Using devices such as Jawbone Up, Nike FuelBand, and Fitbit it is now possible to collect a large amount of data about personal activity relatively inexpensively. These type of devices are part of the quantified self movement – a group of enthusiasts who take measurements about themselves regularly to improve their health, to find patterns in their behavior, or because they are tech geeks. One thing that people regularly do is quantify how much of a particular activity they do, but they rarely quantify how well they do it. In this project, your goal will be to use data from accelerometers on the belt, forearm, arm, and dumbell of 6 participants. They were asked to perform barbell lifts correctly and incorrectly in 5 different ways. More information is available from the website here (see the section on the Weight Lifting Exercise Dataset).

      The goal of this project is to predict the manner in which six participants did their exercise.

Preparing Data

## First we define a function to compute the proportion of missing data for each
## variable, then we delete variables with more than 60% missing data

percent_miss <- function(x) {
      sum(is.na(x))/length(x)
}

too.miss <- as.numeric(apply(training, 2, percent_miss) > 0.6)
training <- training[, too.miss == 0]

nZeroVar <- nearZeroVar(training, saveMetrics = TRUE)
training <- training[, nZeroVar$nzv == FALSE]

## removing other not useful variables
training$X <- NULL 
training$cvtd_timestamp <- NULL


## defining a training and a test set
index <- rbinom(dim(training)[1], 1, p = 0.7)

training <- training[index == 1,]
testing <- training[index == 0, ]

## removing missing values from the outcome in the test set
testing <- testing %>%
      filter(!is.na(classe))

Decision tree

library(rpart)

mod1 <- rpart(classe ~., data = training, 
              method = "class", control = rpart.control(cp = 0.0001))

p1 <- predict(mod1, testing, type = "class")
confusionMatrix(p1, testing$classe)

Confusion Matrix and Statistics

          Reference
Prediction    A    B    C    D    E
         A 1147    1    0    0    3
         B    4  817    9    1    2
         C    0    3  684    1    1
         D    0    0    1  687    5
         E    0    0    1    3  752

Overall Statistics
                                          
               Accuracy : 0.9915          
                 95% CI : (0.9882, 0.9941)
    No Information Rate : 0.2792          
    P-Value [Acc > NIR] : < 2.2e-16       
                                          
                  Kappa : 0.9893          
 Mcnemar's Test P-Value : NA              

Statistics by Class:

                     Class: A Class: B Class: C Class: D Class: E
Sensitivity            0.9965   0.9951   0.9842   0.9928   0.9856
Specificity            0.9987   0.9952   0.9985   0.9983   0.9988
Pos Pred Value         0.9965   0.9808   0.9927   0.9913   0.9947
Neg Pred Value         0.9987   0.9988   0.9968   0.9985   0.9967
Prevalence             0.2792   0.1992   0.1686   0.1679   0.1851
Detection Rate         0.2783   0.1982   0.1659   0.1667   0.1824
Detection Prevalence   0.2792   0.2021   0.1672   0.1681   0.1834
Balanced Accuracy      0.9976   0.9951   0.9914   0.9955   0.9922

Support Vector Machine

library(e1071)

mod2 <- svm(classe ~., data = training, scale = TRUE)
p2 <- predict(mod2, testing, type = "class")

confusionMatrix(p2, testing$classe)

Confusion Matrix and Statistics

          Reference
Prediction    A    B    C    D    E
         A 1134   43    2    1    0
         B    5  758   24    0    3
         C   11   18  664   46   16
         D    0    0    5  645   18
         E    1    2    0    0  726

Overall Statistics
                                         
               Accuracy : 0.9527         
                 95% CI : (0.9458, 0.959)
    No Information Rate : 0.2792         
    P-Value [Acc > NIR] : < 2.2e-16      
                                         
                  Kappa : 0.9402         
 Mcnemar's Test P-Value : NA             

Statistics by Class:

                     Class: A Class: B Class: C Class: D Class: E
Sensitivity            0.9852   0.9233   0.9554   0.9321   0.9515
Specificity            0.9845   0.9903   0.9734   0.9933   0.9991
Pos Pred Value         0.9610   0.9595   0.8795   0.9656   0.9959
Neg Pred Value         0.9942   0.9811   0.9908   0.9864   0.9891
Prevalence             0.2792   0.1992   0.1686   0.1679   0.1851
Detection Rate         0.2751   0.1839   0.1611   0.1565   0.1761
Detection Prevalence   0.2863   0.1917   0.1832   0.1621   0.1769
Balanced Accuracy      0.9849   0.9568   0.9644   0.9627   0.9753

Random Forest

library(randomForest)

mod3 <- randomForest(classe ~., data = training, ntree = 200, mtry = 4)

p3 <- predict(mod3, testing)

confusionMatrix(p3, testing$classe)

Confusion Matrix and Statistics

          Reference
Prediction    A    B    C    D    E
         A 1151    0    0    0    0
         B    0  821    0    0    0
         C    0    0  695    0    0
         D    0    0    0  692    0
         E    0    0    0    0  763

Overall Statistics
                                     
               Accuracy : 1          
                 95% CI : (0.9991, 1)
    No Information Rate : 0.2792     
    P-Value [Acc > NIR] : < 2.2e-16  
                                     
                  Kappa : 1          
 Mcnemar's Test P-Value : NA         

Statistics by Class:

                     Class: A Class: B Class: C Class: D Class: E
Sensitivity            1.0000   1.0000   1.0000   1.0000   1.0000
Specificity            1.0000   1.0000   1.0000   1.0000   1.0000
Pos Pred Value         1.0000   1.0000   1.0000   1.0000   1.0000
Neg Pred Value         1.0000   1.0000   1.0000   1.0000   1.0000
Prevalence             0.2792   0.1992   0.1686   0.1679   0.1851
Detection Rate         0.2792   0.1992   0.1686   0.1679   0.1851
Detection Prevalence   0.2792   0.1992   0.1686   0.1679   0.1851
Balanced Accuracy      1.0000   1.0000   1.0000   1.0000   1.0000

plot(mod3)

As we have found an algorithm which has an accuracy of 1, we decide to retain it and to apply to the validation set.

Our final prediction for the 20 test samples are the following:

validation <- read.csv("pml-testing.csv", header = TRUE,
                       na.strings=c("NA","#DIV/0!",""))

knitr::kable(t(predict(mod3, validation)))

1	2	3	4	5	6	7	8	9	10	11	12	13	14	15	16	17	18	19	20
B	A	B	A	A	E	D	B	A	A	B	C	B	A	E	E	A	B	B	B

Practical Machine Learning

Final Project

Fabio Paderi

17/1/2018

Background

Load Data

Preparing Data

Decision tree

Support Vector Machine

Random Forest