Practical Machine Learning Course Assignment

Introduction

Using devices such as Jawbone Up, Nike FuelBand, and Fitbit it is now possible to collect a large amount of data about personal activity relatively inexpensively. These type of devices are part of the quantified self movement - a group of enthusiasts who take measurements about themselves regularly to improve their health, to find patterns in their behavior, or because they are tech geeks.

Participants of this project were asked to perform barbell lifts correctly and incorrectly in 5 different ways.The goal of your project is to predict the manner in which they did the exercise. This is the “classe” variable in the training set.

Load libraries and Dataset

library(knitr)
library(caret)

## Warning: package 'caret' was built under R version 3.2.5

## Warning: package 'ggplot2' was built under R version 3.2.5

library(rpart)
library(rpart.plot)

## Warning: package 'rpart.plot' was built under R version 3.2.5

library(rattle)

## Warning: package 'rattle' was built under R version 3.2.5

library(randomForest)

## Warning: package 'randomForest' was built under R version 3.2.5

#Load training and test datasets
training<-read.csv("pml-training.csv")
testing<- read.csv("pml-testing.csv")

Data Processing and Cleaning the Data

First, we will partition the training data into a training dataset (70%) and a test dataset (30%). The actual test dataset provided will remain untouched and used for predicting the test results.

train_partition<-createDataPartition(training$classe, p = 0.7, list = FALSE)
train_set<-training[train_partition,]
test_set<-training[-train_partition,]

Next, we remove variables that are mostly NA

NAs <- sapply(train_set, function(x) mean(is.na(x)))>.95
train_set<-train_set[, NAs == FALSE]
test_set<-test_set[, NAs == FALSE]

Followed by removing the variables which have near zero variance

nearZeroVariance <- nearZeroVar(train_set)
train_set <- train_set[,-nearZeroVariance]
test_set <- test_set[,-nearZeroVariance]

Finally, we remove the variables for identification only, columns 1 to 5.

train_set <-train_set[,-(1:5)]
test_set <-test_set[,-(1:5)]

Building the Prediction Models

2 models will be used to predict the ‘classe’ variable in the training set. The model with the higher accuracy will then be used for the quiz portion of the assignment.

Random Forest Model

set.seed(11111)
controlRandForest<-trainControl(method = "cv", number = 3, verboseIter = FALSE)
modelFitRandForest<-train(classe ~ ., data = train_set, method = "rf", trControl = controlRandForest)
modelFitRandForest$finalModel

## 
## Call:
##  randomForest(x = x, y = y, mtry = param$mtry) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 27
## 
##         OOB estimate of  error rate: 0.25%
## Confusion matrix:
##      A    B    C    D    E  class.error
## A 3903    2    0    0    1 0.0007680492
## B    7 2647    3    1    0 0.0041384500
## C    0    5 2390    1    0 0.0025041736
## D    0    0    6 2245    1 0.0031083481
## E    0    1    0    6 2518 0.0027722772

predictRandForest <- predict(modelFitRandForest, newdata = test_set)
confusionMatrixRandForest <- confusionMatrix(predictRandForest, test_set$classe)
confusionMatrixRandForest

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 1674    7    0    0    0
##          B    0 1132    2    0    0
##          C    0    0 1024    6    0
##          D    0    0    0  958    0
##          E    0    0    0    0 1082
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9975          
##                  95% CI : (0.9958, 0.9986)
##     No Information Rate : 0.2845          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.9968          
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            1.0000   0.9939   0.9981   0.9938   1.0000
## Specificity            0.9983   0.9996   0.9988   1.0000   1.0000
## Pos Pred Value         0.9958   0.9982   0.9942   1.0000   1.0000
## Neg Pred Value         1.0000   0.9985   0.9996   0.9988   1.0000
## Prevalence             0.2845   0.1935   0.1743   0.1638   0.1839
## Detection Rate         0.2845   0.1924   0.1740   0.1628   0.1839
## Detection Prevalence   0.2856   0.1927   0.1750   0.1628   0.1839
## Balanced Accuracy      0.9992   0.9967   0.9984   0.9969   1.0000

Generalized Boosted Model

set.seed(11111)
controlGBM <- trainControl(method = "repeatedcv", number = 5, repeats = 1)
modelFitGBM <-train(classe ~ ., data = train_set, method = "gbm", trControl = controlGBM, verbose = FALSE)

## Loading required package: gbm

## Warning: package 'gbm' was built under R version 3.2.5

## Loading required package: survival

## 
## Attaching package: 'survival'

## The following object is masked from 'package:caret':
## 
##     cluster

## Loading required package: splines

## Loading required package: parallel

## Loaded gbm 2.1.1

## Loading required package: plyr

## Warning: package 'plyr' was built under R version 3.2.5

modelFitGBM$finalModel

## A gradient boosted model with multinomial loss function.
## 150 iterations were performed.
## There were 53 predictors of which 41 had non-zero influence.

predictGBM <-predict(modelFitGBM, newdata = test_set)
confusionMatrixGBM <- confusionMatrix(predictGBM, test_set$classe)
confusionMatrixGBM

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 1669   20    0    0    0
##          B    4 1095    4    1    2
##          C    0   24 1021   15    2
##          D    1    0    0  947   10
##          E    0    0    1    1 1068
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9856          
##                  95% CI : (0.9822, 0.9884)
##     No Information Rate : 0.2845          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.9817          
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            0.9970   0.9614   0.9951   0.9824   0.9871
## Specificity            0.9953   0.9977   0.9916   0.9978   0.9996
## Pos Pred Value         0.9882   0.9901   0.9614   0.9885   0.9981
## Neg Pred Value         0.9988   0.9908   0.9990   0.9965   0.9971
## Prevalence             0.2845   0.1935   0.1743   0.1638   0.1839
## Detection Rate         0.2836   0.1861   0.1735   0.1609   0.1815
## Detection Prevalence   0.2870   0.1879   0.1805   0.1628   0.1818
## Balanced Accuracy      0.9961   0.9795   0.9933   0.9901   0.9933

Predicting Test Dataset Results

From the results above, we see that the Random Forest method is more accurate than the Generalized Boosted Model. Using the Random Forest Model to predict the test results:

predictTest<-predict(modelFitRandForest, newdata = testing)
predictTest

##  [1] B A B A A E D B A A B C B A E E A B B B
## Levels: A B C D E