Executive Summary:
Intially removed columns with >60% of NAs in the dataset using ‘cleanData’. Loaded data and removed not required first 7 columns in dataset(pt).
Created ‘trainset’ to train the model. ‘validset’ to validate the model and then use algoritms on ‘testset’. ‘Decision Trees’, ‘GBM’ & ‘RF’ methods have been used for analysis. Out of this GBM and RF are close with utmost accuracy on ‘validset’ with > 95% accuracy.Used Cross validation with 10 iterations on all 3 methods in ‘train’ function as parameter to ‘trControl’ argument.
Used parallel processing for faster data processing. Finally cleaned NAs same as ‘pt’ dataset and predicted using Decision trees, GBM & RFs on ‘testset’. GBM and RF gave output with same accuracy. SO the error Rate is 100-99.54 = 0.046 i.e. 0.46%
APPLIED OUTPUT on 20 Cases gave 100% accuracy output as:B A B A A E D B A A B C B A E E A B B B.
Loading required libraries for data analysis
library(caret)
## Loading required package: lattice
## Loading required package: ggplot2
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(rpart.plot)
## Loading required package: rpart
No exploratory analysis given as main focus on machine learning.
pt = read.csv("pml-training.csv", na.strings = c("NA","NaN","NULL","!DIV/0",""," "))
#pt = read.csv("pmltrain.csv")
ptbackup = pt
dim(pt)
## [1] 19622 160
#glimpse(pt)
classe = factor(pt$classe)
class(classe)
## [1] "factor"
pt= pt[,-ncol(pt)]
pt =apply(pt,2,as.numeric) # COnvert all variables to numeric & remove response variable column
## Warning in apply(pt, 2, as.numeric): NAs introduced by coercion
## Warning in apply(pt, 2, as.numeric): NAs introduced by coercion
## Warning in apply(pt, 2, as.numeric): NAs introduced by coercion
## Warning in apply(pt, 2, as.numeric): NAs introduced by coercion
## Warning in apply(pt, 2, as.numeric): NAs introduced by coercion
## Warning in apply(pt, 2, as.numeric): NAs introduced by coercion
## Warning in apply(pt, 2, as.numeric): NAs introduced by coercion
## Warning in apply(pt, 2, as.numeric): NAs introduced by coercion
## Warning in apply(pt, 2, as.numeric): NAs introduced by coercion
## Warning in apply(pt, 2, as.numeric): NAs introduced by coercion
## Warning in apply(pt, 2, as.numeric): NAs introduced by coercion
## Warning in apply(pt, 2, as.numeric): NAs introduced by coercion
## Warning in apply(pt, 2, as.numeric): NAs introduced by coercion
## Warning in apply(pt, 2, as.numeric): NAs introduced by coercion
## Warning in apply(pt, 2, as.numeric): NAs introduced by coercion
## Warning in apply(pt, 2, as.numeric): NAs introduced by coercion
## Warning in apply(pt, 2, as.numeric): NAs introduced by coercion
## Warning in apply(pt, 2, as.numeric): NAs introduced by coercion
## Warning in apply(pt, 2, as.numeric): NAs introduced by coercion
## Warning in apply(pt, 2, as.numeric): NAs introduced by coercion
## Warning in apply(pt, 2, as.numeric): NAs introduced by coercion
## Warning in apply(pt, 2, as.numeric): NAs introduced by coercion
## Warning in apply(pt, 2, as.numeric): NAs introduced by coercion
## Warning in apply(pt, 2, as.numeric): NAs introduced by coercion
## Warning in apply(pt, 2, as.numeric): NAs introduced by coercion
## Warning in apply(pt, 2, as.numeric): NAs introduced by coercion
## Warning in apply(pt, 2, as.numeric): NAs introduced by coercion
## Warning in apply(pt, 2, as.numeric): NAs introduced by coercion
## Warning in apply(pt, 2, as.numeric): NAs introduced by coercion
## Warning in apply(pt, 2, as.numeric): NAs introduced by coercion
## Warning in apply(pt, 2, as.numeric): NAs introduced by coercion
## Warning in apply(pt, 2, as.numeric): NAs introduced by coercion
## Warning in apply(pt, 2, as.numeric): NAs introduced by coercion
## Warning in apply(pt, 2, as.numeric): NAs introduced by coercion
## Warning in apply(pt, 2, as.numeric): NAs introduced by coercion
## Warning in apply(pt, 2, as.numeric): NAs introduced by coercion
#apply(pt,2,typeof)
pt = pt[,-(1:7)] # remove till num_window as variables not required for further analysis
dim(pt)
## [1] 19622 152
#library(dplyr)
#glimpse(pt) # Veryfying required variables intact or not
pt = cleanData(pt)
pt = data.frame(pt)
pt$classe = classe # # pt = mutate(pt, classe =classe)
class(pt$classe)
## [1] "factor"
#glimpse(pt)
dim(pt)
## [1] 19622 53
Training the model using rpart, gbm and random forests and compare accuracies
No missing values so NO imputation required. create ‘trainset’ and ‘validset’. Use ‘parallel’ for multi-core processing. trained 3 models on ‘trainset’ and applied all three techniques 1)‘Decision Trees’ - with 50% accuracy on ‘validset’. 2) ‘GBM Method’ - with >96% accuracy on validset and 3) ‘random forest-RF’ method with >99% accuracy on ‘validset’ . looking at accracy both GBM and RF methods are suitable.
used cross validataion with 10 iterations and allowed parallel processing in ‘train’ fucntion and in trControl parameter.
#library(caret)
#pt = pt[sample(n),]
sum(is.na(pt))
## [1] 0
inTrain = createDataPartition(y=pt$classe,p=0.9,list = FALSE)
trainset = pt[inTrain,]
validset = pt[-inTrain,]
library(parallel) # parallelizing for faster output
library(doParallel)
## Loading required package: foreach
## Loading required package: iterators
cluster <- makeCluster(detectCores() - 1) # convention to leave 1 core for OS
registerDoParallel(cluster)
trctrl = trainControl(method = "cv", number = 10, allowParallel = TRUE )
rpart_model = train(classe~., data = trainset, method = "rpart", trControl = trctrl)
gbm_model = train(classe~., data = trainset, method = "gbm", trControl = trctrl,verbose= TRUE)
## Loading required package: gbm
## Loading required package: survival
##
## Attaching package: 'survival'
## The following object is masked from 'package:caret':
##
## cluster
## Loading required package: splines
## Loaded gbm 2.1.1
## Loading required package: plyr
## -------------------------------------------------------------------------
## You have loaded plyr after dplyr - this is likely to cause problems.
## If you need functions from both plyr and dplyr, please load plyr first, then dplyr:
## library(plyr); library(dplyr)
## -------------------------------------------------------------------------
##
## Attaching package: 'plyr'
## The following objects are masked from 'package:dplyr':
##
## arrange, count, desc, failwith, id, mutate, rename, summarise,
## summarize
## Iter TrainDeviance ValidDeviance StepSize Improve
## 1 1.6094 nan 0.1000 0.2310
## 2 1.4617 nan 0.1000 0.1632
## 3 1.3600 nan 0.1000 0.1302
## 4 1.2791 nan 0.1000 0.1121
## 5 1.2101 nan 0.1000 0.0877
## 6 1.1548 nan 0.1000 0.0783
## 7 1.1058 nan 0.1000 0.0611
## 8 1.0668 nan 0.1000 0.0571
## 9 1.0301 nan 0.1000 0.0612
## 10 0.9932 nan 0.1000 0.0484
## 20 0.7618 nan 0.1000 0.0222
## 40 0.5427 nan 0.1000 0.0169
## 60 0.4120 nan 0.1000 0.0068
## 80 0.3316 nan 0.1000 0.0058
## 100 0.2727 nan 0.1000 0.0031
## 120 0.2275 nan 0.1000 0.0028
## 140 0.1936 nan 0.1000 0.0014
## 150 0.1808 nan 0.1000 0.0020
rf_model = train(classe~., data = trainset, method = "rf", trControl = trctrl)
## Loading required package: randomForest
## randomForest 4.6-12
## Type rfNews() to see new features/changes/bug fixes.
##
## Attaching package: 'randomForest'
## The following object is masked from 'package:dplyr':
##
## combine
## The following object is masked from 'package:ggplot2':
##
## margin
stopCluster(cluster)
prp(rpart_model$finalModel)

rpart_pred = predict(rpart_model,validset)
confusionMatrix(rpart_pred,validset$classe)
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 510 156 171 147 36
## B 11 137 6 63 64
## C 37 86 165 111 102
## D 0 0 0 0 0
## E 0 0 0 0 158
##
## Overall Statistics
##
## Accuracy : 0.4949
## 95% CI : (0.4725, 0.5173)
## No Information Rate : 0.2847
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.3395
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 0.9140 0.3615 0.48246 0.0000 0.43889
## Specificity 0.6362 0.9089 0.79234 1.0000 1.00000
## Pos Pred Value 0.5000 0.4875 0.32934 NaN 1.00000
## Neg Pred Value 0.9489 0.8559 0.87868 0.8362 0.88790
## Prevalence 0.2847 0.1934 0.17449 0.1638 0.18367
## Detection Rate 0.2602 0.0699 0.08418 0.0000 0.08061
## Detection Prevalence 0.5204 0.1434 0.25561 0.0000 0.08061
## Balanced Accuracy 0.7751 0.6352 0.63740 0.5000 0.71944
gbm_pred = predict(gbm_model,validset)
confusionMatrix(gbm_pred,validset$classe)
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 551 11 0 0 0
## B 6 358 8 0 4
## C 1 10 333 13 3
## D 0 0 1 307 3
## E 0 0 0 1 350
##
## Overall Statistics
##
## Accuracy : 0.9689
## 95% CI : (0.9602, 0.9761)
## No Information Rate : 0.2847
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.9606
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 0.9875 0.9446 0.9737 0.9564 0.9722
## Specificity 0.9922 0.9886 0.9833 0.9976 0.9994
## Pos Pred Value 0.9804 0.9521 0.9250 0.9871 0.9972
## Neg Pred Value 0.9950 0.9867 0.9944 0.9915 0.9938
## Prevalence 0.2847 0.1934 0.1745 0.1638 0.1837
## Detection Rate 0.2811 0.1827 0.1699 0.1566 0.1786
## Detection Prevalence 0.2867 0.1918 0.1837 0.1587 0.1791
## Balanced Accuracy 0.9898 0.9666 0.9785 0.9770 0.9858
rf_pred= predict(rf_model,validset)
confAcc=confusionMatrix(rf_pred,validset$classe)
errorRateRF = 1-confAcc$overall[1]
library(scales)
percent(errorRateRF)
## [1] "0.459%"