Practical Machine Learning Project

OVERVIEW

This to predict the manner in which 6 participants performed some exercise as described below. This is the “classe” variable in the training set. The machine learning algorithm described here is applied to the 20 test cases available in the test data and the predictions are submitted in appropriate format to the Course Project Prediction Quiz for automated grading. Using devices such as Jawbone Up, Nike FuelBand, and Fitbit it is now possible to collect a large amount of data about personal activity relatively inexpensively. These type of devices are part of the quantified self movement - a group of enthusiasts who take measurements about themselves regularly to improve their health, to find patterns in their behavior, or because they are tech geeks. One thing that people regularly do is quantify how much of a particular activity they do, but they rarely quantify how well they do it. In this project, your goal will be to use data from accelerometers on the belt, forearm, arm, and dumbell of 6 participants. They were asked to perform barbell lifts correctly and incorrectly in 5 different ways. More information is available from the website here: http://groupware.les.inf.puc-rio.br/har (see the section on the Weight Lifting Exercise Dataset).

Environment Setting and Read files into R

rm(list=ls())

#Set the working environment. Switch to required directory
setwd("C:/Users/Gracy/Coursera - Data Science Specialization/Course 8 - Practical Machine Learning/Week 4")

fileUrl.train <- "https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv"
fileUrl.test <- "https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv"

#download.file(fileUrl.train,destfile = "pml-training.csv")
#download.file(fileUrl.test,destfile = "pml-testing.csv")

library(caret)
library(parallel)
library(doParallel)
library(dplyr)

training <- read.csv("pml-training.csv")
testing <- read.csv("pml-testing.csv")
dim(testing)

## [1]  20 160

dim(training)

## [1] 19622   160

Clean up the Data

# remove variables with Nearly Zero Variance
NZV <- nearZeroVar(training)
training <- training[, -NZV]
testing  <- testing[, -NZV]
dim(training)

## [1] 19622   100

dim(testing)

## [1]  20 100

# remove variables that are mostly NA
AllNA    <- sapply(training, function(x) mean(is.na(x))) > 0.95
training <- training[, AllNA==FALSE]
testing  <- testing[, AllNA==FALSE]
dim(training)

## [1] 19622    59

dim(testing)

## [1] 20 59

# remove identification only variables (columns 1 to 5)
training <- training[, -(1:5)]
testing  <- testing[, -(1:5)]
dim(training)

## [1] 19622    54

dim(testing)

## [1] 20 54

Making an initial fit with all predictors for feature selection

To improve processing time of the multiple executions of the train() function, caret supports the parallel processing capabilities of the parallel package. Parallel processing in caret can be accomplished with the parallel and doParallel packages.

#Method: Random Forest
cluster <- makeCluster(detectCores() - 1) # convention to leave 1 core for OS
registerDoParallel(cluster)

fitControl <- trainControl(method = "cv", number = 5, allowParallel = TRUE)

fit <- train(classe ~ ., method="rf",data=training,trControl = fitControl)

stopCluster(cluster)
registerDoSEQ()

Feature Selection using Variable Importance method

The varImp process chooses the best 20 predictors and those are the columns selected for testing and training data.

var.imp <- varImp(fit)
var.imp

## rf variable importance
## 
##   only 20 most important variables shown (out of 53)
## 
##                      Overall
## num_window           100.000
## roll_belt             63.274
## pitch_forearm         39.252
## yaw_belt              30.506
## magnet_dumbbell_z     28.253
## pitch_belt            27.909
## magnet_dumbbell_y     27.693
## roll_forearm          22.104
## accel_dumbbell_y      12.909
## magnet_dumbbell_x     10.314
## accel_forearm_x        9.668
## roll_dumbbell          9.600
## accel_belt_z           8.762
## total_accel_dumbbell   8.407
## accel_dumbbell_z       7.656
## magnet_belt_z          6.767
## magnet_forearm_z       6.500
## magnet_belt_y          6.389
## magnet_belt_x          5.261
## roll_arm               4.819

var.imp.cols.train <- c("num_window","roll_belt", "pitch_forearm","yaw_belt",     "magnet_dumbbell_z","pitch_belt",      "magnet_dumbbell_y",
"roll_forearm",     "accel_dumbbell_y", "magnet_dumbbell_x","accel_forearm_x",  "roll_dumbbell",    
"accel_belt_z",     "total_accel_dumbbell","accel_dumbbell_z", "magnet_belt_z",          "magnet_forearm_z",       
"magnet_belt_y",          "magnet_belt_x",          "roll_arm" , "classe")
var.imp.cols.test <- c("num_window","roll_belt", "pitch_forearm","yaw_belt",     "magnet_dumbbell_z","pitch_belt",      "magnet_dumbbell_y",
"roll_forearm",     "accel_dumbbell_y", "magnet_dumbbell_x","accel_forearm_x",  "roll_dumbbell",    
"accel_belt_z",     "total_accel_dumbbell","accel_dumbbell_z", "magnet_belt_z",          "magnet_forearm_z",       
"magnet_belt_y",          "magnet_belt_x",          "roll_arm" )
training <- training[,c(var.imp.cols.train)]
testing <- testing[,c(var.imp.cols.test)]

plot(var.imp)

Best Fit

This best fit method of leaps library calculates the best predictors using residual sum of squares measure. We find that after the first 15 features the remaining variables are not of much importance. So we proced on to the final fit methods with these 15 predictors.

#best fit
library(leaps)
# Perform a best fit
bestFit=regsubsets(classe~.,training,nvmax=15)

# Generate a summary of the fit
bfSummary=summary(bestFit)

# Plot the Residual Sum of Squares vs number of variables 
plot(bfSummary$rss,xlab="Number of Variables",ylab="RSS",type="l",main="Best fit RSS vs No of features")
# Get the index of the minimum value

a=which.min(bfSummary$rss)
# Mark this in red
points(a,bfSummary$rss[a],col="red",cex=2,pch=20)

#The plot below shows that the Best fit occurs with all 15 features included. Notice that there is no significant change in RSS from 15 features onward.

Random Forest ( with final set of predictors)

#Method: Random Forest
cluster <- makeCluster(detectCores() - 1) # convention to leave 1 core for OS
registerDoParallel(cluster)

fitControl <- trainControl(method = "cv", number = 5, allowParallel = TRUE)

finalfit1 <- train(classe ~ ., method="rf",data=training,trControl = fitControl)

stopCluster(cluster)
registerDoSEQ()

result1 <- confusionMatrix(finalfit1)
result1

## Cross-Validated (5 fold) Confusion Matrix 
## 
## (entries are percentual average cell counts across resamples)
##  
##           Reference
## Prediction    A    B    C    D    E
##          A 28.4  0.0  0.0  0.0  0.0
##          B  0.0 19.3  0.0  0.0  0.0
##          C  0.0  0.0 17.4  0.0  0.0
##          D  0.0  0.0  0.0 16.3  0.0
##          E  0.0  0.0  0.0  0.0 18.3
##                             
##  Accuracy (average) : 0.9979

plot(result1$table, col = result1$byClass, main = "Random Forest")

Decision Trees

#Method: Decision Trees
cluster <- makeCluster(detectCores() - 1) # convention to leave 1 core for OS
registerDoParallel(cluster)

fitControl <- trainControl(method = "cv", number = 5, allowParallel = TRUE)

finalfit2 <- train(classe ~ ., method="rpart",data=training,trControl = fitControl)

stopCluster(cluster)
registerDoSEQ()
result2 <- confusionMatrix(finalfit2)
result2

## Cross-Validated (5 fold) Confusion Matrix 
## 
## (entries are percentual average cell counts across resamples)
##  
##           Reference
## Prediction    A    B    C    D    E
##          A 24.7  5.6  4.3  4.4  1.5
##          B  0.8  7.3  0.8  3.2  2.3
##          C  2.8  6.4 12.4  8.4  4.6
##          D  0.0  0.0  0.0  0.0  0.0
##          E  0.1  0.0  0.0  0.4 10.0
##                            
##  Accuracy (average) : 0.544

plot(result2$table, col = result2$byClass, main = "Decision Trees")

Gradient Boosting

#Method: Gradient Boost
cluster <- makeCluster(detectCores() - 1) # convention to leave 1 core for OS
registerDoParallel(cluster)

fitControl <- trainControl(method = "cv", number = 5, allowParallel = TRUE)

finalfit3 <- train(classe ~ ., method="gbm",data=training,trControl = fitControl)

## Iter   TrainDeviance   ValidDeviance   StepSize   Improve
##      1        1.6094             nan     0.1000    0.2352
##      2        1.4593             nan     0.1000    0.1633
##      3        1.3573             nan     0.1000    0.1285
##      4        1.2783             nan     0.1000    0.1135
##      5        1.2078             nan     0.1000    0.0852
##      6        1.1542             nan     0.1000    0.0782
##      7        1.1050             nan     0.1000    0.0746
##      8        1.0590             nan     0.1000    0.0585
##      9        1.0226             nan     0.1000    0.0641
##     10        0.9848             nan     0.1000    0.0604
##     20        0.7126             nan     0.1000    0.0338
##     40        0.4573             nan     0.1000    0.0172
##     60        0.3174             nan     0.1000    0.0118
##     80        0.2305             nan     0.1000    0.0042
##    100        0.1741             nan     0.1000    0.0044
##    120        0.1334             nan     0.1000    0.0021
##    140        0.1054             nan     0.1000    0.0024
##    150        0.0930             nan     0.1000    0.0013

stopCluster(cluster)
registerDoSEQ()
result3 <- confusionMatrix(finalfit3)
result3

## Cross-Validated (5 fold) Confusion Matrix 
## 
## (entries are percentual average cell counts across resamples)
##  
##           Reference
## Prediction    A    B    C    D    E
##          A 28.4  0.1  0.0  0.0  0.0
##          B  0.0 19.1  0.1  0.0  0.0
##          C  0.0  0.1 17.3  0.2  0.0
##          D  0.0  0.0  0.1 16.1  0.1
##          E  0.0  0.0  0.0  0.0 18.3
##                             
##  Accuracy (average) : 0.9924

plot(result3$table, col = result3$byClass, main = "Gradient Boost")

The accuracy of the above 3 regression modeling methods are:

Random Forest : 0.9979 Decision Tree : 0.544 GBM : 0.9924

PREDICTION with best model - Random Forest

The required accuracy for this project is achieved from Random Forest. Therefore predicting with that model.

prediction <- predict(finalfit1,newdata = testing)
prediction

##  [1] B A B A A E D B A A B C B A E E A B B B
## Levels: A B C D E

Out of Sample Error

outOfSampleError.accuracy <- sum(prediction == testing$classe)/length(prediction)

## Warning in is.na(e2): is.na() applied to non-(list or vector) of type
## 'NULL'

outOfSampleError.accuracy

## [1] 0

outOfSampleError <- 1 - outOfSampleError.accuracy

outOfSampleError

## [1] 1