Practical Machine Learning Project

Abstract

This project will develop a model to fit data taken from a fitness device to predict performance of a bicep curl. There are 5 classes of performance: (Class A) exactly according to the specification, (Class B) throwing the elbows to the front, (Class C) lifting the dumbbell only halfway, (Class D) lowering the dumbbell only halfway, and (Class E) throwing the hips to the front. The data was graciously provided by this source: http://groupware.les.inf.puc-rio.br/har

Data

The data is downloaded and loaded.

url.train <- "http://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv"
url.test <- "http://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv"
download.file(url.train, "./pmlTraining.csv"); download.file(url.test, "./pmlTesting.csv")
training <- read.csv("./pmlTraining.csv", na.strings=c("NA","#DIV/0!","")); 
testing <- read.csv("./pmlTesting.csv", na.strings=c("NA","#DIV/0!",""))

Partition

The data is partitioned to create a testing set for modelling. The test.build set will be used to cross validate the training data.

library(caret)

## Loading required package: lattice

## Loading required package: ggplot2

set.seed(613603)
inTrain <- createDataPartition(y = training$classe, p = 0.7, list = FALSE)
train.build <- training[inTrain,]
test.build <- training[-inTrain,]

Cleaning the Data

Variables with near zero variance are tested for and removed. Modelling this data is problematic without checking for near zero variance. Next, irrelevant variables are removed since they will not improve the model predictions. Lastly, columns that 90+% NAs are removed.

## Remove columns with near zero variance
nsv <- nearZeroVar(train.build)
train.build <- train.build[, -nsv]
test.build <- test.build[, -nsv]
testing <- testing[, -nsv]

## Remove first 6 columns thaat don't make sense to this model
train.build <- train.build[, -(1:6)]
test.build <- test.build[, -(1:6)]
testing <- testing[, -(1:6)]

## Remove columns with mostly NAs
isna <- is.na(train.build)
Cmeans <- colMeans(isna)
train.build <- train.build[Cmeans <= .9]
test.build <- test.build[Cmeans <= .9]
testing <- testing[Cmeans <= .9]

Modelling

Many types of models were attempted, but only the successful ones are represented here. We begin by creating a decision tree.

library(rpart)
set.seed(31834)
rpartFit <- rpart(classe ~ ., method = "class", data = train.build)
predict.Rpart <- predict(rpartFit, test.build, type = "class")
confusionMatrix(predict.Rpart, test.build$classe)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 1549  223   21  105   42
##          B   33  639   43   19   69
##          C   43  106  826  148  142
##          D   17   86   63  610   50
##          E   32   85   73   82  779
## 
## Overall Statistics
##                                           
##                Accuracy : 0.7482          
##                  95% CI : (0.7369, 0.7592)
##     No Information Rate : 0.2845          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.6798          
##  Mcnemar's Test P-Value : < 2.2e-16       
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            0.9253   0.5610   0.8051   0.6328   0.7200
## Specificity            0.9071   0.9654   0.9097   0.9561   0.9434
## Pos Pred Value         0.7985   0.7958   0.6530   0.7385   0.7412
## Neg Pred Value         0.9683   0.9016   0.9567   0.9300   0.9373
## Prevalence             0.2845   0.1935   0.1743   0.1638   0.1839
## Detection Rate         0.2632   0.1086   0.1404   0.1037   0.1324
## Detection Prevalence   0.3297   0.1364   0.2150   0.1404   0.1786
## Balanced Accuracy      0.9162   0.7632   0.8574   0.7944   0.8317

library(rattle)

## Rattle: A free graphical interface for data mining with R.
## Version 4.1.0 Copyright (c) 2006-2015 Togaware Pty Ltd.
## Type 'rattle()' to shake, rattle, and roll your data.

suppressWarnings(fancyRpartPlot(rpartFit))

Next we will try a random forest model.

library(randomForest)

## randomForest 4.6-12

## Type rfNews() to see new features/changes/bug fixes.

## 
## Attaching package: 'randomForest'

## The following object is masked from 'package:ggplot2':
## 
##     margin

set.seed(67484)
rfFit <- randomForest(classe ~., data = train.build)
predict.RF <- predict(rfFit, test.build)
confusionMatrix(predict.RF, test.build$classe)

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 1673    6    0    0    0
##          B    0 1133    6    0    0
##          C    1    0 1018   10    0
##          D    0    0    2  954    2
##          E    0    0    0    0 1080
## 
## Overall Statistics
##                                          
##                Accuracy : 0.9954         
##                  95% CI : (0.9933, 0.997)
##     No Information Rate : 0.2845         
##     P-Value [Acc > NIR] : < 2.2e-16      
##                                          
##                   Kappa : 0.9942         
##  Mcnemar's Test P-Value : NA             
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            0.9994   0.9947   0.9922   0.9896   0.9982
## Specificity            0.9986   0.9987   0.9977   0.9992   1.0000
## Pos Pred Value         0.9964   0.9947   0.9893   0.9958   1.0000
## Neg Pred Value         0.9998   0.9987   0.9984   0.9980   0.9996
## Prevalence             0.2845   0.1935   0.1743   0.1638   0.1839
## Detection Rate         0.2843   0.1925   0.1730   0.1621   0.1835
## Detection Prevalence   0.2853   0.1935   0.1749   0.1628   0.1835
## Balanced Accuracy      0.9990   0.9967   0.9950   0.9944   0.9991

plot(rfFit)

The plot shows that he Random Forest error falls significantly after approximately 50 trees.

NOTE: We attempted Linear Discriminant Analysis, K-Nearest Neighbor and Gradient Boosted Machine modelling, however, these methods all were resource heavy models that provided little or no improvement on the random forest method.

Finally, we will combine both the models with a random forest model attempt to increase accuracy.

set.seed(983346)
combDF <- data.frame(predict.RF, predict.Rpart ,classe = test.build$classe)
combFit <- train(classe ~ ., method = "rf", data = combDF)
predict.comb <- predict(combFit, test.build)
confusionMatrix(predict.comb, test.build$classe)$overall[1]

##  Accuracy 
## 0.9954121

Accuracy is not improved, however.

Cross Validation and Out of Sample Error Analysis

In the above models, we used test.build to cross validate the models and obtain our out of sample error. The out of sample error rate is: 25.18% for the decision tree, and 0.4% for both the random forest and combined models.

Final Prediction

So the random forest appears to have the best prediction even when combined with other models given that it has the lowest out of sample error rate and uses the least amount of resources to generate. Therefore we compute our predictions.

predict.Final <- predict(rfFit, testing)
predict.Final

##  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 
##  B  A  B  A  A  E  D  B  A  A  B  C  B  A  E  E  A  B  B  B 
## Levels: A B C D E