This to predict the manner in which 6 participants performed some exercise as described below. This is the “classe” variable in the training set. The machine learning algorithm described here is applied to the 20 test cases available in the test data and the predictions are submitted in appropriate format to the Course Project Prediction Quiz for automated grading. Using devices such as Jawbone Up, Nike FuelBand, and Fitbit it is now possible to collect a large amount of data about personal activity relatively inexpensively. These type of devices are part of the quantified self movement - a group of enthusiasts who take measurements about themselves regularly to improve their health, to find patterns in their behavior, or because they are tech geeks. One thing that people regularly do is quantify how much of a particular activity they do, but they rarely quantify how well they do it. In this project, your goal will be to use data from accelerometers on the belt, forearm, arm, and dumbell of 6 participants. They were asked to perform barbell lifts correctly and incorrectly in 5 different ways. More information is available from the website here: http://groupware.les.inf.puc-rio.br/har (see the section on the Weight Lifting Exercise Dataset).
rm(list=ls())
#Set the working environment. Switch to required directory
setwd("C:/Users/Gracy/Coursera - Data Science Specialization/Course 8 - Practical Machine Learning/Week 4")
fileUrl.train <- "https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv"
fileUrl.test <- "https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv"
#download.file(fileUrl.train,destfile = "pml-training.csv")
#download.file(fileUrl.test,destfile = "pml-testing.csv")
library(caret)
library(parallel)
library(doParallel)
library(dplyr)
training <- read.csv("pml-training.csv")
testing <- read.csv("pml-testing.csv")
dim(testing)
## [1] 20 160
dim(training)
## [1] 19622 160
# remove variables with Nearly Zero Variance
NZV <- nearZeroVar(training)
training <- training[, -NZV]
testing <- testing[, -NZV]
dim(training)
## [1] 19622 100
dim(testing)
## [1] 20 100
# remove variables that are mostly NA
AllNA <- sapply(training, function(x) mean(is.na(x))) > 0.95
training <- training[, AllNA==FALSE]
testing <- testing[, AllNA==FALSE]
dim(training)
## [1] 19622 59
dim(testing)
## [1] 20 59
# remove identification only variables (columns 1 to 5)
training <- training[, -(1:5)]
testing <- testing[, -(1:5)]
dim(training)
## [1] 19622 54
dim(testing)
## [1] 20 54
To improve processing time of the multiple executions of the train() function, caret supports the parallel processing capabilities of the parallel package. Parallel processing in caret can be accomplished with the parallel and doParallel packages.
#Method: Random Forest
cluster <- makeCluster(detectCores() - 1) # convention to leave 1 core for OS
registerDoParallel(cluster)
fitControl <- trainControl(method = "cv", number = 5, allowParallel = TRUE)
fit <- train(classe ~ ., method="rf",data=training,trControl = fitControl)
stopCluster(cluster)
registerDoSEQ()
The varImp process chooses the best 20 predictors and those are the columns selected for testing and training data.
var.imp <- varImp(fit)
var.imp
## rf variable importance
##
## only 20 most important variables shown (out of 53)
##
## Overall
## num_window 100.000
## roll_belt 63.274
## pitch_forearm 39.252
## yaw_belt 30.506
## magnet_dumbbell_z 28.253
## pitch_belt 27.909
## magnet_dumbbell_y 27.693
## roll_forearm 22.104
## accel_dumbbell_y 12.909
## magnet_dumbbell_x 10.314
## accel_forearm_x 9.668
## roll_dumbbell 9.600
## accel_belt_z 8.762
## total_accel_dumbbell 8.407
## accel_dumbbell_z 7.656
## magnet_belt_z 6.767
## magnet_forearm_z 6.500
## magnet_belt_y 6.389
## magnet_belt_x 5.261
## roll_arm 4.819
var.imp.cols.train <- c("num_window","roll_belt", "pitch_forearm","yaw_belt", "magnet_dumbbell_z","pitch_belt", "magnet_dumbbell_y",
"roll_forearm", "accel_dumbbell_y", "magnet_dumbbell_x","accel_forearm_x", "roll_dumbbell",
"accel_belt_z", "total_accel_dumbbell","accel_dumbbell_z", "magnet_belt_z", "magnet_forearm_z",
"magnet_belt_y", "magnet_belt_x", "roll_arm" , "classe")
var.imp.cols.test <- c("num_window","roll_belt", "pitch_forearm","yaw_belt", "magnet_dumbbell_z","pitch_belt", "magnet_dumbbell_y",
"roll_forearm", "accel_dumbbell_y", "magnet_dumbbell_x","accel_forearm_x", "roll_dumbbell",
"accel_belt_z", "total_accel_dumbbell","accel_dumbbell_z", "magnet_belt_z", "magnet_forearm_z",
"magnet_belt_y", "magnet_belt_x", "roll_arm" )
training <- training[,c(var.imp.cols.train)]
testing <- testing[,c(var.imp.cols.test)]
plot(var.imp)
This best fit method of leaps library calculates the best predictors using residual sum of squares measure. We find that after the first 15 features the remaining variables are not of much importance. So we proced on to the final fit methods with these 15 predictors.
#best fit
library(leaps)
# Perform a best fit
bestFit=regsubsets(classe~.,training,nvmax=15)
# Generate a summary of the fit
bfSummary=summary(bestFit)
# Plot the Residual Sum of Squares vs number of variables
plot(bfSummary$rss,xlab="Number of Variables",ylab="RSS",type="l",main="Best fit RSS vs No of features")
# Get the index of the minimum value
a=which.min(bfSummary$rss)
# Mark this in red
points(a,bfSummary$rss[a],col="red",cex=2,pch=20)
#The plot below shows that the Best fit occurs with all 15 features included. Notice that there is no significant change in RSS from 15 features onward.
#Method: Random Forest
cluster <- makeCluster(detectCores() - 1) # convention to leave 1 core for OS
registerDoParallel(cluster)
fitControl <- trainControl(method = "cv", number = 5, allowParallel = TRUE)
finalfit1 <- train(classe ~ ., method="rf",data=training,trControl = fitControl)
stopCluster(cluster)
registerDoSEQ()
result1 <- confusionMatrix(finalfit1)
result1
## Cross-Validated (5 fold) Confusion Matrix
##
## (entries are percentual average cell counts across resamples)
##
## Reference
## Prediction A B C D E
## A 28.4 0.0 0.0 0.0 0.0
## B 0.0 19.3 0.0 0.0 0.0
## C 0.0 0.0 17.4 0.0 0.0
## D 0.0 0.0 0.0 16.3 0.0
## E 0.0 0.0 0.0 0.0 18.3
##
## Accuracy (average) : 0.9979
plot(result1$table, col = result1$byClass, main = "Random Forest")
#Method: Decision Trees
cluster <- makeCluster(detectCores() - 1) # convention to leave 1 core for OS
registerDoParallel(cluster)
fitControl <- trainControl(method = "cv", number = 5, allowParallel = TRUE)
finalfit2 <- train(classe ~ ., method="rpart",data=training,trControl = fitControl)
stopCluster(cluster)
registerDoSEQ()
result2 <- confusionMatrix(finalfit2)
result2
## Cross-Validated (5 fold) Confusion Matrix
##
## (entries are percentual average cell counts across resamples)
##
## Reference
## Prediction A B C D E
## A 24.7 5.6 4.3 4.4 1.5
## B 0.8 7.3 0.8 3.2 2.3
## C 2.8 6.4 12.4 8.4 4.6
## D 0.0 0.0 0.0 0.0 0.0
## E 0.1 0.0 0.0 0.4 10.0
##
## Accuracy (average) : 0.544
plot(result2$table, col = result2$byClass, main = "Decision Trees")
#Method: Gradient Boost
cluster <- makeCluster(detectCores() - 1) # convention to leave 1 core for OS
registerDoParallel(cluster)
fitControl <- trainControl(method = "cv", number = 5, allowParallel = TRUE)
finalfit3 <- train(classe ~ ., method="gbm",data=training,trControl = fitControl)
## Iter TrainDeviance ValidDeviance StepSize Improve
## 1 1.6094 nan 0.1000 0.2352
## 2 1.4593 nan 0.1000 0.1633
## 3 1.3573 nan 0.1000 0.1285
## 4 1.2783 nan 0.1000 0.1135
## 5 1.2078 nan 0.1000 0.0852
## 6 1.1542 nan 0.1000 0.0782
## 7 1.1050 nan 0.1000 0.0746
## 8 1.0590 nan 0.1000 0.0585
## 9 1.0226 nan 0.1000 0.0641
## 10 0.9848 nan 0.1000 0.0604
## 20 0.7126 nan 0.1000 0.0338
## 40 0.4573 nan 0.1000 0.0172
## 60 0.3174 nan 0.1000 0.0118
## 80 0.2305 nan 0.1000 0.0042
## 100 0.1741 nan 0.1000 0.0044
## 120 0.1334 nan 0.1000 0.0021
## 140 0.1054 nan 0.1000 0.0024
## 150 0.0930 nan 0.1000 0.0013
stopCluster(cluster)
registerDoSEQ()
result3 <- confusionMatrix(finalfit3)
result3
## Cross-Validated (5 fold) Confusion Matrix
##
## (entries are percentual average cell counts across resamples)
##
## Reference
## Prediction A B C D E
## A 28.4 0.1 0.0 0.0 0.0
## B 0.0 19.1 0.1 0.0 0.0
## C 0.0 0.1 17.3 0.2 0.0
## D 0.0 0.0 0.1 16.1 0.1
## E 0.0 0.0 0.0 0.0 18.3
##
## Accuracy (average) : 0.9924
plot(result3$table, col = result3$byClass, main = "Gradient Boost")
The accuracy of the above 3 regression modeling methods are:
Random Forest : 0.9979 Decision Tree : 0.544 GBM : 0.9924
The required accuracy for this project is achieved from Random Forest. Therefore predicting with that model.
prediction <- predict(finalfit1,newdata = testing)
prediction
## [1] B A B A A E D B A A B C B A E E A B B B
## Levels: A B C D E
outOfSampleError.accuracy <- sum(prediction == testing$classe)/length(prediction)
## Warning in is.na(e2): is.na() applied to non-(list or vector) of type
## 'NULL'
outOfSampleError.accuracy
## [1] 0
outOfSampleError <- 1 - outOfSampleError.accuracy
outOfSampleError
## [1] 1