Using devices such as Jawbone Up, Nike FuelBand, and Fitbit it is now possible to collect a large amount of data about personal activity relatively inexpensively. These type of devices are part of the quantified self movement – a group of enthusiasts who take measurements about themselves regularly to improve their health, to find patterns in their behavior, or because they are tech geeks. One thing that people regularly do is quantify how much of a particular activity they do, but they rarely quantify how well they do it. In this project, your goal will be to use data from accelerometers on the belt, forearm, arm, and dumbell of 6 participants. They were asked to perform barbell lifts correctly and incorrectly in 5 different ways.
The training data for this project are available here: https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv
The test data are available here: https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv
The data for this project come from this source: http://web.archive.org/web/20161224072740/http:/groupware.les.inf.puc-rio.br/har. If you use the document you create for this class for any purpose please cite them as they have been very generous in allowing their data to be used for this kind of assignment.
trainData <- read.csv("pml-training.csv", sep = ",", header = TRUE, na.strings = c('#DIV/0', '', 'NA'))
testData <- read.csv("pml-testing.csv", sep = ",", header = TRUE, na.strings = c('#DIV/0', '', 'NA'))
Let’s take a look at the dimensions of the training and testing datasets, and also check variable CLASSE to see how it is distributed on given data.
dim(trainData)
## [1] 19622 160
dim(testData)
## [1] 20 160
summary(trainData$classe)
## A B C D E
## 5580 3797 3422 3216 3607
# There are many columns with NA values, these columns will be removed now
trainData <- trainData[, colSums(is.na(trainData)) == 0]
testData <- testData[, colSums(is.na(testData)) == 0]
# remove unwanted 7 columns from last
trainData = trainData[,-c(1:7)]
testData = testData[,-c(1:7)]
# recheck data dimension
dim(trainData)
## [1] 19622 53
dim(testData)
## [1] 20 53
The Training dataset will be splitted into two parts, the 70% data set will be used as Traning data and rest will be used for Test data for prediction. Below we are going to to fit Dicision Tree(DT), Random Forest(RF) and Linear Discriminant Analysis(LDA) models to see how they perform.
# Set the seed for reproducibility
set.seed(4321)
# Load necessary libraries
library(caret)
## Warning: package 'caret' was built under R version 3.4.3
## Loading required package: lattice
## Loading required package: ggplot2
## Warning: package 'ggplot2' was built under R version 3.4.3
library(randomForest)
## Warning: package 'randomForest' was built under R version 3.4.3
## randomForest 4.6-12
## Type rfNews() to see new features/changes/bug fixes.
##
## Attaching package: 'randomForest'
## The following object is masked from 'package:ggplot2':
##
## margin
library(rpart)
library(rpart.plot)
## Warning: package 'rpart.plot' was built under R version 3.4.3
library(rattle)
## Warning: package 'rattle' was built under R version 3.4.3
## Rattle: A free graphical interface for data science with R.
## Version 5.1.0 Copyright (c) 2006-2017 Togaware Pty Ltd.
## Type 'rattle()' to shake, rattle, and roll your data.
##
## Attaching package: 'rattle'
## The following object is masked from 'package:randomForest':
##
## importance
#Split the Training dataset into Train and Test
inTrain <- createDataPartition(trainData$classe, p = 0.7, list = FALSE)
cvTrainData <- trainData[inTrain,]
cvTestData <- trainData[-inTrain,]
#dim(cvTrainData)
A decision tree is a flow-chart-like structure, where each internal (non-leaf) node denotes a test on an attribute, each branch represents the outcome of a test, and each leaf (or terminal) node holds a class label. The topmost node in a tree is the root node. Let us use this algorithm to determine Test data prediction.
# Dicision tree algo application and fancy plot the outcome
fitDT <- rpart(classe ~ ., data = trainData, method = "class")
fancyRpartPlot(fitDT)
## Warning: labs do not fit even at cex 0.15, there may be some overplotting
# predicting the outcome variable(classe) on test dataset
predDT <- predict(fitDT, cvTestData, type = "class")
# Create Confusion Matrix on redicted outcome on test dataset
confDT <- confusionMatrix(predDT, cvTestData$classe)
confDT
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 1495 182 17 64 24
## B 57 725 88 89 94
## C 44 109 836 131 144
## D 51 72 57 614 51
## E 27 51 28 66 769
##
## Overall Statistics
##
## Accuracy : 0.7543
## 95% CI : (0.7431, 0.7652)
## No Information Rate : 0.2845
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.6885
## Mcnemar's Test P-Value : < 2.2e-16
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 0.8931 0.6365 0.8148 0.6369 0.7107
## Specificity 0.9318 0.9309 0.9119 0.9531 0.9642
## Pos Pred Value 0.8389 0.6885 0.6614 0.7266 0.8172
## Neg Pred Value 0.9564 0.9143 0.9589 0.9306 0.9367
## Prevalence 0.2845 0.1935 0.1743 0.1638 0.1839
## Detection Rate 0.2540 0.1232 0.1421 0.1043 0.1307
## Detection Prevalence 0.3028 0.1789 0.2148 0.1436 0.1599
## Balanced Accuracy 0.9125 0.7837 0.8634 0.7950 0.8375
#plot the confusion matrix
plot(confDT$table, col = confDT$byClass, main = "Decision Tree Confusion Matrix")
Random forests or random decision forests are an ensemble learning method for classification, regression and other tasks, that operate by constructing a multitude of decision trees at training time and outputting the class that is the mode of the classes (classification) or mean prediction (regression) of the individual trees. Random decision forests correct for decision trees’ habit of overfitting to their training set.
# Fit Random Forest(RF) model
#fitRF <- train(classe ~ ., data = trainData, method = "rf")
fitRF <- randomForest(classe ~ ., data = trainData)
plot(fitRF)
# Predicting the outcome variable on test dataset
predRF <- predict(fitRF, cvTestData)
# Create Confusion Matrix on redicted outcome on test dataset
confRF <- confusionMatrix(predRF, cvTestData$classe)
confRF
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 1674 0 0 0 0
## B 0 1139 0 0 0
## C 0 0 1026 0 0
## D 0 0 0 964 0
## E 0 0 0 0 1082
##
## Overall Statistics
##
## Accuracy : 1
## 95% CI : (0.9994, 1)
## No Information Rate : 0.2845
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 1
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 1.0000 1.0000 1.0000 1.0000 1.0000
## Specificity 1.0000 1.0000 1.0000 1.0000 1.0000
## Pos Pred Value 1.0000 1.0000 1.0000 1.0000 1.0000
## Neg Pred Value 1.0000 1.0000 1.0000 1.0000 1.0000
## Prevalence 0.2845 0.1935 0.1743 0.1638 0.1839
## Detection Rate 0.2845 0.1935 0.1743 0.1638 0.1839
## Detection Prevalence 0.2845 0.1935 0.1743 0.1638 0.1839
## Balanced Accuracy 1.0000 1.0000 1.0000 1.0000 1.0000
#plot the confusion matrix
plot(confRF$table, col = confRF$byClass, main = "Random Forest Confusion Matrix")
Logistic regression is a classification algorithm traditionally limited to only two-class classification problems. If you have more than two classes then Linear Discriminant Analysis is the preferred linear classification technique.
# Fit Linear discriminant analysis (LDA) model
fitLDA <- train(classe ~ ., data = trainData, method = "lda")
# Predicting with Linear discriminant analysis(LDA) model
predLDA <- predict(fitLDA, cvTestData)
# Create Confusion Matrix on LDA outcome on test dataset
confLDA <- confusionMatrix(predLDA, cvTestData$classe)
confLDA
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 1375 176 93 57 47
## B 37 730 95 36 183
## C 137 139 680 110 95
## D 122 45 129 725 107
## E 3 49 29 36 650
##
## Overall Statistics
##
## Accuracy : 0.7069
## 95% CI : (0.6951, 0.7185)
## No Information Rate : 0.2845
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.6291
## Mcnemar's Test P-Value : < 2.2e-16
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 0.8214 0.6409 0.6628 0.7521 0.6007
## Specificity 0.9114 0.9260 0.9010 0.9181 0.9756
## Pos Pred Value 0.7866 0.6753 0.5857 0.6427 0.8475
## Neg Pred Value 0.9277 0.9149 0.9268 0.9498 0.9156
## Prevalence 0.2845 0.1935 0.1743 0.1638 0.1839
## Detection Rate 0.2336 0.1240 0.1155 0.1232 0.1105
## Detection Prevalence 0.2970 0.1837 0.1973 0.1917 0.1303
## Balanced Accuracy 0.8664 0.7835 0.7819 0.8351 0.7882
#plot the confusion matrix
plot(confLDA$table, col = confLDA$byClass, main = "Linear Discriminant Aalysis Confusion Matrix")
The Random Forest (RF) model has an excellent performance, with 99.94% accuracy on the training dataset. The Linear Discriminant Analysis(LDA) model and Dicision Tree(DT) have an inferior performance compared to previous model, it has 70% and 75% accuracy respectively.
We have found the Random Forest(RF) model has best fitted model and it will be used to make the out of the sample prediction, in this case, the testing dataset with new 20 samples.
predict(fitRF,testData)
## 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
## B A B A A E D B A A B C B A E E A B B B
## Levels: A B C D E