Machine Learing - Prediction Assignment

Executive Summary

Using devices such as Jawbone Up, Nike FuelBand, and Fitbit it is now possible to collect a large amount of data about personal activity relatively inexpensively. These type of devices are part of the quantified self movement – a group of enthusiasts who take measurements about themselves regularly to improve their health, to find patterns in their behavior, or because they are tech geeks. One thing that people regularly do is quantify how much of a particular activity they do, but they rarely quantify how well they do it. In this project, your goal will be to use data from accelerometers on the belt, forearm, arm, and dumbell of 6 participants. They were asked to perform barbell lifts correctly and incorrectly in 5 different ways.

Data Source

The training data for this project are available here: https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv

The test data are available here: https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv

The data for this project come from this source: http://web.archive.org/web/20161224072740/http:/groupware.les.inf.puc-rio.br/har. If you use the document you create for this class for any purpose please cite them as they have been very generous in allowing their data to be used for this kind of assignment.

Data Loading

trainData <- read.csv("pml-training.csv", sep = ",", header = TRUE, na.strings = c('#DIV/0', '', 'NA'))
testData <- read.csv("pml-testing.csv", sep = ",", header = TRUE, na.strings = c('#DIV/0', '', 'NA'))

Exploitory Data Analysis

Let’s take a look at the dimensions of the training and testing datasets, and also check variable CLASSE to see how it is distributed on given data.

dim(trainData)

## [1] 19622   160

dim(testData)

## [1]  20 160

summary(trainData$classe)

##    A    B    C    D    E 
## 5580 3797 3422 3216 3607

# There are many columns with NA values, these columns will be removed now
trainData <- trainData[, colSums(is.na(trainData)) == 0]
testData <- testData[, colSums(is.na(testData)) == 0]

# remove unwanted 7 columns from last
trainData = trainData[,-c(1:7)]
testData = testData[,-c(1:7)]

# recheck data dimension
dim(trainData)

## [1] 19622    53

dim(testData)

## [1] 20 53

Cross Validation

The Training dataset will be splitted into two parts, the 70% data set will be used as Traning data and rest will be used for Test data for prediction. Below we are going to to fit Dicision Tree(DT), Random Forest(RF) and Linear Discriminant Analysis(LDA) models to see how they perform.

# Set the seed for reproducibility
set.seed(4321)

# Load necessary libraries 
library(caret)

## Warning: package 'caret' was built under R version 3.4.3

## Loading required package: lattice

## Loading required package: ggplot2

## Warning: package 'ggplot2' was built under R version 3.4.3

library(randomForest)

## Warning: package 'randomForest' was built under R version 3.4.3

## randomForest 4.6-12

## Type rfNews() to see new features/changes/bug fixes.

## 
## Attaching package: 'randomForest'

## The following object is masked from 'package:ggplot2':
## 
##     margin

library(rpart)
library(rpart.plot)

## Warning: package 'rpart.plot' was built under R version 3.4.3

library(rattle)

## Warning: package 'rattle' was built under R version 3.4.3

## Rattle: A free graphical interface for data science with R.
## Version 5.1.0 Copyright (c) 2006-2017 Togaware Pty Ltd.
## Type 'rattle()' to shake, rattle, and roll your data.

## 
## Attaching package: 'rattle'

## The following object is masked from 'package:randomForest':
## 
##     importance

#Split the Training dataset into Train and Test 
inTrain <- createDataPartition(trainData$classe, p = 0.7, list = FALSE)
cvTrainData <- trainData[inTrain,]
cvTestData <- trainData[-inTrain,]
#dim(cvTrainData)

Desicion Tree Algorithm

A decision tree is a flow-chart-like structure, where each internal (non-leaf) node denotes a test on an attribute, each branch represents the outcome of a test, and each leaf (or terminal) node holds a class label. The topmost node in a tree is the root node. Let us use this algorithm to determine Test data prediction.

# Dicision tree algo application and fancy plot the outcome
fitDT <- rpart(classe ~ ., data = trainData, method = "class")
fancyRpartPlot(fitDT)

## Warning: labs do not fit even at cex 0.15, there may be some overplotting

# predicting the outcome variable(classe) on test dataset
predDT <- predict(fitDT, cvTestData, type = "class")

# Create Confusion Matrix on redicted outcome on test dataset
confDT <- confusionMatrix(predDT, cvTestData$classe)
confDT

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 1495  182   17   64   24
##          B   57  725   88   89   94
##          C   44  109  836  131  144
##          D   51   72   57  614   51
##          E   27   51   28   66  769
## 
## Overall Statistics
##                                           
##                Accuracy : 0.7543          
##                  95% CI : (0.7431, 0.7652)
##     No Information Rate : 0.2845          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.6885          
##  Mcnemar's Test P-Value : < 2.2e-16       
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            0.8931   0.6365   0.8148   0.6369   0.7107
## Specificity            0.9318   0.9309   0.9119   0.9531   0.9642
## Pos Pred Value         0.8389   0.6885   0.6614   0.7266   0.8172
## Neg Pred Value         0.9564   0.9143   0.9589   0.9306   0.9367
## Prevalence             0.2845   0.1935   0.1743   0.1638   0.1839
## Detection Rate         0.2540   0.1232   0.1421   0.1043   0.1307
## Detection Prevalence   0.3028   0.1789   0.2148   0.1436   0.1599
## Balanced Accuracy      0.9125   0.7837   0.8634   0.7950   0.8375

#plot the confusion matrix
plot(confDT$table, col = confDT$byClass, main = "Decision Tree Confusion Matrix")

Random Forest Model Prediction

Random forests or random decision forests are an ensemble learning method for classification, regression and other tasks, that operate by constructing a multitude of decision trees at training time and outputting the class that is the mode of the classes (classification) or mean prediction (regression) of the individual trees. Random decision forests correct for decision trees’ habit of overfitting to their training set.

# Fit Random Forest(RF) model
#fitRF <- train(classe ~ ., data = trainData, method = "rf")
fitRF <- randomForest(classe ~ ., data = trainData)
plot(fitRF)

# Predicting the outcome variable on test dataset
predRF <- predict(fitRF, cvTestData)

# Create Confusion Matrix on redicted outcome on test dataset
confRF <- confusionMatrix(predRF, cvTestData$classe)
confRF

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 1674    0    0    0    0
##          B    0 1139    0    0    0
##          C    0    0 1026    0    0
##          D    0    0    0  964    0
##          E    0    0    0    0 1082
## 
## Overall Statistics
##                                      
##                Accuracy : 1          
##                  95% CI : (0.9994, 1)
##     No Information Rate : 0.2845     
##     P-Value [Acc > NIR] : < 2.2e-16  
##                                      
##                   Kappa : 1          
##  Mcnemar's Test P-Value : NA         
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            1.0000   1.0000   1.0000   1.0000   1.0000
## Specificity            1.0000   1.0000   1.0000   1.0000   1.0000
## Pos Pred Value         1.0000   1.0000   1.0000   1.0000   1.0000
## Neg Pred Value         1.0000   1.0000   1.0000   1.0000   1.0000
## Prevalence             0.2845   0.1935   0.1743   0.1638   0.1839
## Detection Rate         0.2845   0.1935   0.1743   0.1638   0.1839
## Detection Prevalence   0.2845   0.1935   0.1743   0.1638   0.1839
## Balanced Accuracy      1.0000   1.0000   1.0000   1.0000   1.0000

#plot the confusion matrix
plot(confRF$table, col = confRF$byClass, main = "Random Forest Confusion Matrix")

Linear Discriminant Aalysis Prediction

Logistic regression is a classification algorithm traditionally limited to only two-class classification problems. If you have more than two classes then Linear Discriminant Analysis is the preferred linear classification technique.

# Fit Linear discriminant analysis (LDA) model
fitLDA <- train(classe ~ ., data = trainData, method = "lda")

# Predicting with Linear discriminant analysis(LDA) model
predLDA <- predict(fitLDA, cvTestData)

# Create Confusion Matrix on LDA outcome on test dataset
confLDA <- confusionMatrix(predLDA, cvTestData$classe)
confLDA

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 1375  176   93   57   47
##          B   37  730   95   36  183
##          C  137  139  680  110   95
##          D  122   45  129  725  107
##          E    3   49   29   36  650
## 
## Overall Statistics
##                                           
##                Accuracy : 0.7069          
##                  95% CI : (0.6951, 0.7185)
##     No Information Rate : 0.2845          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.6291          
##  Mcnemar's Test P-Value : < 2.2e-16       
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            0.8214   0.6409   0.6628   0.7521   0.6007
## Specificity            0.9114   0.9260   0.9010   0.9181   0.9756
## Pos Pred Value         0.7866   0.6753   0.5857   0.6427   0.8475
## Neg Pred Value         0.9277   0.9149   0.9268   0.9498   0.9156
## Prevalence             0.2845   0.1935   0.1743   0.1638   0.1839
## Detection Rate         0.2336   0.1240   0.1155   0.1232   0.1105
## Detection Prevalence   0.2970   0.1837   0.1973   0.1917   0.1303
## Balanced Accuracy      0.8664   0.7835   0.7819   0.8351   0.7882

#plot the confusion matrix
plot(confLDA$table, col = confLDA$byClass, main = "Linear Discriminant Aalysis Confusion Matrix")

Conclusion

The Random Forest (RF) model has an excellent performance, with 99.94% accuracy on the training dataset. The Linear Discriminant Analysis(LDA) model and Dicision Tree(DT) have an inferior performance compared to previous model, it has 70% and 75% accuracy respectively.

We have found the Random Forest(RF) model has best fitted model and it will be used to make the out of the sample prediction, in this case, the testing dataset with new 20 samples.

Random Forest predictions for Testing dataset

predict(fitRF,testData)

##  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 
##  B  A  B  A  A  E  D  B  A  A  B  C  B  A  E  E  A  B  B  B 
## Levels: A B C D E