Course 4: Project: Excerise Machine Learning

INTRODUCTION

This is coursera Pratical machine learning project. Here, we would focus on various different factors for weighing different excersies. The goal will be to use data from accelerometers on the belt, forearm, arm, and dumbell of 6 participants. We are provided with both training and testing data.

PROJECT INITIALIZATION

The project required header files has to be added.

library(ggplot2)
library(GGally)

## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2

library(caret)

## Loading required package: lattice

library(rpart)
library(rattle)

## Loading required package: tibble

## Loading required package: bitops

## Rattle: A free graphical interface for data science with R.
## Version 5.4.0 Copyright (c) 2006-2020 Togaware Pty Ltd.
## Type 'rattle()' to shake, rattle, and roll your data.

library(randomForest)

## randomForest 4.6-14

## Type rfNews() to see new features/changes/bug fixes.

## 
## Attaching package: 'randomForest'

## The following object is masked from 'package:rattle':
## 
##     importance

## The following object is masked from 'package:ggplot2':
## 
##     margin

library(dplyr)

## 
## Attaching package: 'dplyr'

## The following object is masked from 'package:randomForest':
## 
##     combine

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(gbm)

## Loaded gbm 2.1.8

DATA COLLECTION

The following steps is to retrieve the data from relevant souces into R database for further calculations. The data provided included training and testing data.

train_input <-read.csv("https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv",
                    header = TRUE)
test_input <-read.csv("https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv",
                   header = TRUE)
dim(train_input)

## [1] 19622   160

dim(test_input)

## [1]  20 160

CLEANSING DATA

Here we need to tidy the data captured and remove missing columns which are completely filled with missing values.

train_data <- train_input[, colSums(is.na(train_input)) == 0] # to remove col with NAs
test_data <- test_input[, colSums(is.na(test_input)) == 0]
dim(train_data)

## [1] 19622    93

dim(test_data)

## [1] 20 60

train_set <- train_data[,-c(1:7)]
valid_set <- test_data[,-c(1:7)]

PREDICTION ANALYSIS

For prediction we need to split the train_data into training set and testing data. For that we would use the R package caret to create partition.

set.seed(1590)
intrain <- createDataPartition(y = train_set$classe,
                               p = 0.8, 
                               list = FALSE)
trainData <- train_set[intrain,]
testData <- train_set[-intrain,]
dim(trainData)

## [1] 15699    86

dim(testData)

## [1] 3923   86

# To remove variables which are non-necessary 
cols <- nearZeroVar(x = trainData)
trainData <- trainData[,-cols]
testData <- testData[,-cols]

Method 1: Random Forest

Here we would apply Random Forest method to create the prediction.

rf_mdl <- train(classe ~ ., 
                       data = trainData,
                       method = "rf")
rf_mdl$finalModel

## 
## Call:
##  randomForest(x = x, y = y, mtry = param$mtry) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 27
## 
##         OOB estimate of  error rate: 0.59%
## Confusion matrix:
##      A    B    C    D    E  class.error
## A 4462    2    0    0    0 0.0004480287
## B   18 3009   10    1    0 0.0095457538
## C    0   14 2715    9    0 0.0084002922
## D    0    2   22 2546    3 0.0104935873
## E    0    2    2    7 2875 0.0038115038

To further check the status of the model.

rf_pred <- predict(object = rf_mdl, newdata = testData)
rf_cm <- confusionMatrix(rf_pred, as.factor(testData$classe))
rf_cm

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 1111    7    0    0    0
##          B    1  752    1    1    0
##          C    3    0  680    5    4
##          D    0    0    3  637    4
##          E    1    0    0    0  713
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9924          
##                  95% CI : (0.9891, 0.9948)
##     No Information Rate : 0.2845          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.9903          
##                                           
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            0.9955   0.9908   0.9942   0.9907   0.9889
## Specificity            0.9975   0.9991   0.9963   0.9979   0.9997
## Pos Pred Value         0.9937   0.9960   0.9827   0.9891   0.9986
## Neg Pred Value         0.9982   0.9978   0.9988   0.9982   0.9975
## Prevalence             0.2845   0.1935   0.1744   0.1639   0.1838
## Detection Rate         0.2832   0.1917   0.1733   0.1624   0.1817
## Detection Prevalence   0.2850   0.1925   0.1764   0.1642   0.1820
## Balanced Accuracy      0.9965   0.9949   0.9952   0.9943   0.9943

Method 2: GBM

Here we would apply GBM method to create the prediction.

set.seed(1111)
gbm_trctrl <- trainControl(method = "repeatedcv",
                           repeats = 1, 
                           number = 3)
gbm_mdl <- train(classe ~ .,
                       data = trainData,
                       method = "gbm", 
                       trControl = gbm_trctrl)
gbm_mdl$finalModel

To further check the status of the model.

gbm_pred <- predict(object = gbm_mdl, newdata = testData)
gbm_cm <- confusionMatrix(gbm_pred, as.factor(testData$classe))
gbm_cm

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 1091   28    0    0    2
##          B   12  704   26    3    6
##          C   12   25  644   22    6
##          D    1    2   12  613    9
##          E    0    0    2    5  698
## 
## Overall Statistics
##                                          
##                Accuracy : 0.9559         
##                  95% CI : (0.949, 0.9621)
##     No Information Rate : 0.2845         
##     P-Value [Acc > NIR] : < 2.2e-16      
##                                          
##                   Kappa : 0.9442         
##                                          
##  Mcnemar's Test P-Value : 0.0002073      
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            0.9776   0.9275   0.9415   0.9533   0.9681
## Specificity            0.9893   0.9851   0.9799   0.9927   0.9978
## Pos Pred Value         0.9732   0.9374   0.9083   0.9623   0.9901
## Neg Pred Value         0.9911   0.9827   0.9876   0.9909   0.9929
## Prevalence             0.2845   0.1935   0.1744   0.1639   0.1838
## Detection Rate         0.2781   0.1795   0.1642   0.1563   0.1779
## Detection Prevalence   0.2858   0.1914   0.1807   0.1624   0.1797
## Balanced Accuracy      0.9835   0.9563   0.9607   0.9730   0.9830

Method 3: Decision Tree

Here we would apply Decision Tree method to create the prediction.

set.seed(3333)
dt_mdl <- rpart(classe ~., data = trainData, method = "class")
fancyRpartPlot(dt_mdl)

## Warning: labs do not fit even at cex 0.15, there may be some overplotting

To further the model performance

dt_pred <- predict(dt_mdl, newdata = testData, type = "class")
dt_cm <- confusionMatrix(dt_pred, as.factor(testData$classe))
dt_cm

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   A   B   C   D   E
##          A 969 129  15  46  11
##          B  35 435  36  50  61
##          C  39  82 538  93  75
##          D  31  48  46 395  32
##          E  42  65  49  59 542
## 
## Overall Statistics
##                                           
##                Accuracy : 0.7339          
##                  95% CI : (0.7197, 0.7477)
##     No Information Rate : 0.2845          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.6629          
##                                           
##  Mcnemar's Test P-Value : < 2.2e-16       
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            0.8683   0.5731   0.7865   0.6143   0.7517
## Specificity            0.9284   0.9425   0.9108   0.9521   0.9329
## Pos Pred Value         0.8282   0.7050   0.6505   0.7156   0.7160
## Neg Pred Value         0.9466   0.9020   0.9528   0.9264   0.9435
## Prevalence             0.2845   0.1935   0.1744   0.1639   0.1838
## Detection Rate         0.2470   0.1109   0.1371   0.1007   0.1382
## Detection Prevalence   0.2982   0.1573   0.2108   0.1407   0.1930
## Balanced Accuracy      0.8983   0.7578   0.8487   0.7832   0.8423

The overall performance of each prediction model can be depicted as follows:

acc_cm <- data.frame(rf_cm$overall[1], gbm_cm$overall[1], dt_cm$overall[1])

RUNNING WITH TEST DATA

On mapping the each different model to the required valid data set and calculating the accuracy.

rf_pred_valid <- predict(rf_mdl, newdata = valid_set)

gbm_pred_valid <- predict(gbm_mdl, newdata = valid_set)

dt_pred_valid <- predict(dt_mdl, newdata = valid_set, type = "class")

pred_output <- data.frame(rf_pred_valid, 
                          gbm_pred_valid,
                          dt_pred_valid)
headings <- c( "RandomForest", "Gbm", "DecisionTree")
names(pred_output) <- headings

CONCLUSION

From the above analysis we can see the resubstitution error and generalization error that occurs in each of the 2 scenarios using the test_data and valid_data.

acc_cm

##          rf_cm.overall.1. gbm_cm.overall.1. dt_cm.overall.1.
## Accuracy        0.9923528         0.9559011        0.7338771

We found that random forest had the highest in-sample accuracy of 0.9923528

The predicted output for the validation set is as below:

pred_output

##    RandomForest Gbm DecisionTree
## 1             B   B            B
## 2             A   A            A
## 3             B   B            E
## 4             A   A            D
## 5             A   A            A
## 6             E   E            C
## 7             D   D            D
## 8             B   B            A
## 9             A   A            A
## 10            A   A            A
## 11            B   B            C
## 12            C   C            E
## 13            B   B            C
## 14            A   A            A
## 15            E   E            E
## 16            E   E            E
## 17            A   A            A
## 18            B   B            B
## 19            B   B            B
## 20            B   B            B

Thus, we found the predicted output and most effective was the random forest method having a higher level of accuracy compared to genralized boosting method.