Executive Summary:

Intially removed columns with >60% of NAs in the dataset using ‘cleanData’. Loaded data and removed not required first 7 columns in dataset(pt).

Created ‘trainset’ to train the model. ‘validset’ to validate the model and then use algoritms on ‘testset’. ‘Decision Trees’, ‘GBM’ & ‘RF’ methods have been used for analysis. Out of this GBM and RF are close with utmost accuracy on ‘validset’ with > 95% accuracy.Used Cross validation with 10 iterations on all 3 methods in ‘train’ function as parameter to ‘trControl’ argument.

Used parallel processing for faster data processing. Finally cleaned NAs same as ‘pt’ dataset and predicted using Decision trees, GBM & RFs on ‘testset’. GBM and RF gave output with same accuracy. SO the error Rate is 100-99.54 = 0.046 i.e. 0.46%

APPLIED OUTPUT on 20 Cases gave 100% accuracy output as:B A B A A E D B A A B C B A E E A B B B.

Loading required libraries for data analysis

library(caret)
## Loading required package: lattice
## Loading required package: ggplot2
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(rpart.plot)
## Loading required package: rpart

Clean Dataset Function

Using cleanData function try to remove NAs and check any columns with near zero variance. apply() function detects number of NAs in each column of the dataset. ‘rm60col’ with ‘pmldf’ filters those columns with more than 60% of NAs.Total 52 columns have non zero variance.

cleanData = function(pmldf) {
  
dfNA = data.frame(apply(is.na(pmldf),2,sum)) # NAs Exist & convert to data frame for further use
rm60col =dfNA/nrow(pmldf)>0.6

pmldf = pmldf[,!rm60col]
dim(pmldf)
#glimpse(pmldf)
apply(is.na(pmldf),2,sum) # Cheeck again for any missing values

rmNzvCol = nearZeroVar(pmldf,saveMetrics = TRUE) # all false indicate no nearzerovar variables in dataset
rmNzvCol
#nrow(rmNzvCol)
dim(pmldf)

return(data.frame(pmldf))
}

loading data and initial processing of data

load pml train dataset and filter ‘classe’variable initially and also initial 7 columns as they are not essential for further analysis. #### Convert all variables to numeric and call ’cleanData’ function along add back ‘classe’ variable.

No exploratory analysis given as main focus on machine learning.

pt = read.csv("pml-training.csv", na.strings = c("NA","NaN","NULL","!DIV/0",""," "))
#pt = read.csv("pmltrain.csv")
ptbackup = pt

dim(pt)
## [1] 19622   160
#glimpse(pt)
classe = factor(pt$classe)
class(classe)
## [1] "factor"
pt= pt[,-ncol(pt)]
pt =apply(pt,2,as.numeric) # COnvert all variables to numeric & remove response variable column
## Warning in apply(pt, 2, as.numeric): NAs introduced by coercion

## Warning in apply(pt, 2, as.numeric): NAs introduced by coercion

## Warning in apply(pt, 2, as.numeric): NAs introduced by coercion

## Warning in apply(pt, 2, as.numeric): NAs introduced by coercion

## Warning in apply(pt, 2, as.numeric): NAs introduced by coercion

## Warning in apply(pt, 2, as.numeric): NAs introduced by coercion

## Warning in apply(pt, 2, as.numeric): NAs introduced by coercion

## Warning in apply(pt, 2, as.numeric): NAs introduced by coercion

## Warning in apply(pt, 2, as.numeric): NAs introduced by coercion

## Warning in apply(pt, 2, as.numeric): NAs introduced by coercion

## Warning in apply(pt, 2, as.numeric): NAs introduced by coercion

## Warning in apply(pt, 2, as.numeric): NAs introduced by coercion

## Warning in apply(pt, 2, as.numeric): NAs introduced by coercion

## Warning in apply(pt, 2, as.numeric): NAs introduced by coercion

## Warning in apply(pt, 2, as.numeric): NAs introduced by coercion

## Warning in apply(pt, 2, as.numeric): NAs introduced by coercion

## Warning in apply(pt, 2, as.numeric): NAs introduced by coercion

## Warning in apply(pt, 2, as.numeric): NAs introduced by coercion

## Warning in apply(pt, 2, as.numeric): NAs introduced by coercion

## Warning in apply(pt, 2, as.numeric): NAs introduced by coercion

## Warning in apply(pt, 2, as.numeric): NAs introduced by coercion

## Warning in apply(pt, 2, as.numeric): NAs introduced by coercion

## Warning in apply(pt, 2, as.numeric): NAs introduced by coercion

## Warning in apply(pt, 2, as.numeric): NAs introduced by coercion

## Warning in apply(pt, 2, as.numeric): NAs introduced by coercion

## Warning in apply(pt, 2, as.numeric): NAs introduced by coercion

## Warning in apply(pt, 2, as.numeric): NAs introduced by coercion

## Warning in apply(pt, 2, as.numeric): NAs introduced by coercion

## Warning in apply(pt, 2, as.numeric): NAs introduced by coercion

## Warning in apply(pt, 2, as.numeric): NAs introduced by coercion

## Warning in apply(pt, 2, as.numeric): NAs introduced by coercion

## Warning in apply(pt, 2, as.numeric): NAs introduced by coercion

## Warning in apply(pt, 2, as.numeric): NAs introduced by coercion

## Warning in apply(pt, 2, as.numeric): NAs introduced by coercion

## Warning in apply(pt, 2, as.numeric): NAs introduced by coercion

## Warning in apply(pt, 2, as.numeric): NAs introduced by coercion
#apply(pt,2,typeof)
pt = pt[,-(1:7)]  # remove till num_window as variables not required for further analysis
dim(pt)
## [1] 19622   152
#library(dplyr)
#glimpse(pt) # Veryfying required variables intact or not
pt = cleanData(pt)
pt = data.frame(pt)

pt$classe = classe # # pt = mutate(pt,  classe =classe)
class(pt$classe)
## [1] "factor"
#glimpse(pt)
dim(pt)
## [1] 19622    53

Training the model using rpart, gbm and random forests and compare accuracies

No missing values so NO imputation required. create ‘trainset’ and ‘validset’. Use ‘parallel’ for multi-core processing. trained 3 models on ‘trainset’ and applied all three techniques 1)‘Decision Trees’ - with 50% accuracy on ‘validset’. 2) ‘GBM Method’ - with >96% accuracy on validset and 3) ‘random forest-RF’ method with >99% accuracy on ‘validset’ . looking at accracy both GBM and RF methods are suitable.

used cross validataion with 10 iterations and allowed parallel processing in ‘train’ fucntion and in trControl parameter.

#library(caret)
#pt = pt[sample(n),]
sum(is.na(pt))
## [1] 0
inTrain = createDataPartition(y=pt$classe,p=0.9,list = FALSE) 

trainset = pt[inTrain,]
validset = pt[-inTrain,]

library(parallel) # parallelizing for faster output
library(doParallel)
## Loading required package: foreach
## Loading required package: iterators
cluster <- makeCluster(detectCores() - 1) # convention to leave 1 core for OS
registerDoParallel(cluster)

trctrl = trainControl(method = "cv", number = 10, allowParallel = TRUE )
rpart_model = train(classe~., data = trainset, method = "rpart", trControl = trctrl)
gbm_model = train(classe~., data = trainset, method = "gbm", trControl = trctrl,verbose= TRUE)
## Loading required package: gbm
## Loading required package: survival
## 
## Attaching package: 'survival'
## The following object is masked from 'package:caret':
## 
##     cluster
## Loading required package: splines
## Loaded gbm 2.1.1
## Loading required package: plyr
## -------------------------------------------------------------------------
## You have loaded plyr after dplyr - this is likely to cause problems.
## If you need functions from both plyr and dplyr, please load plyr first, then dplyr:
## library(plyr); library(dplyr)
## -------------------------------------------------------------------------
## 
## Attaching package: 'plyr'
## The following objects are masked from 'package:dplyr':
## 
##     arrange, count, desc, failwith, id, mutate, rename, summarise,
##     summarize
## Iter   TrainDeviance   ValidDeviance   StepSize   Improve
##      1        1.6094             nan     0.1000    0.2310
##      2        1.4617             nan     0.1000    0.1632
##      3        1.3600             nan     0.1000    0.1302
##      4        1.2791             nan     0.1000    0.1121
##      5        1.2101             nan     0.1000    0.0877
##      6        1.1548             nan     0.1000    0.0783
##      7        1.1058             nan     0.1000    0.0611
##      8        1.0668             nan     0.1000    0.0571
##      9        1.0301             nan     0.1000    0.0612
##     10        0.9932             nan     0.1000    0.0484
##     20        0.7618             nan     0.1000    0.0222
##     40        0.5427             nan     0.1000    0.0169
##     60        0.4120             nan     0.1000    0.0068
##     80        0.3316             nan     0.1000    0.0058
##    100        0.2727             nan     0.1000    0.0031
##    120        0.2275             nan     0.1000    0.0028
##    140        0.1936             nan     0.1000    0.0014
##    150        0.1808             nan     0.1000    0.0020
rf_model = train(classe~., data = trainset, method = "rf", trControl = trctrl)
## Loading required package: randomForest
## randomForest 4.6-12
## Type rfNews() to see new features/changes/bug fixes.
## 
## Attaching package: 'randomForest'
## The following object is masked from 'package:dplyr':
## 
##     combine
## The following object is masked from 'package:ggplot2':
## 
##     margin
stopCluster(cluster)

prp(rpart_model$finalModel)

rpart_pred = predict(rpart_model,validset)
confusionMatrix(rpart_pred,validset$classe)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   A   B   C   D   E
##          A 510 156 171 147  36
##          B  11 137   6  63  64
##          C  37  86 165 111 102
##          D   0   0   0   0   0
##          E   0   0   0   0 158
## 
## Overall Statistics
##                                           
##                Accuracy : 0.4949          
##                  95% CI : (0.4725, 0.5173)
##     No Information Rate : 0.2847          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.3395          
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            0.9140   0.3615  0.48246   0.0000  0.43889
## Specificity            0.6362   0.9089  0.79234   1.0000  1.00000
## Pos Pred Value         0.5000   0.4875  0.32934      NaN  1.00000
## Neg Pred Value         0.9489   0.8559  0.87868   0.8362  0.88790
## Prevalence             0.2847   0.1934  0.17449   0.1638  0.18367
## Detection Rate         0.2602   0.0699  0.08418   0.0000  0.08061
## Detection Prevalence   0.5204   0.1434  0.25561   0.0000  0.08061
## Balanced Accuracy      0.7751   0.6352  0.63740   0.5000  0.71944
gbm_pred = predict(gbm_model,validset)
confusionMatrix(gbm_pred,validset$classe)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   A   B   C   D   E
##          A 551  11   0   0   0
##          B   6 358   8   0   4
##          C   1  10 333  13   3
##          D   0   0   1 307   3
##          E   0   0   0   1 350
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9689          
##                  95% CI : (0.9602, 0.9761)
##     No Information Rate : 0.2847          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.9606          
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            0.9875   0.9446   0.9737   0.9564   0.9722
## Specificity            0.9922   0.9886   0.9833   0.9976   0.9994
## Pos Pred Value         0.9804   0.9521   0.9250   0.9871   0.9972
## Neg Pred Value         0.9950   0.9867   0.9944   0.9915   0.9938
## Prevalence             0.2847   0.1934   0.1745   0.1638   0.1837
## Detection Rate         0.2811   0.1827   0.1699   0.1566   0.1786
## Detection Prevalence   0.2867   0.1918   0.1837   0.1587   0.1791
## Balanced Accuracy      0.9898   0.9666   0.9785   0.9770   0.9858
rf_pred= predict(rf_model,validset)
confAcc=confusionMatrix(rf_pred,validset$classe)

errorRateRF = 1-confAcc$overall[1]
library(scales)
percent(errorRateRF)
## [1] "0.459%"

Load test set, preprocess data and predict using pml-testing set

Load pml test set and remove initial 7 columns and apply ‘cleanData’ function for removing NA columns.Removed problem_id as it is not required. Both ‘GBM’ and ‘RF’ gives same output on testset with 100% accurate prediction (answers all 20 questions correctly in the assignment)

testset = read.csv("pml-testing.csv")
dim(testset)
## [1]  20 160
#glimpse(testset)

testset = testset[,-c(1:7)]

testset1 = cleanData(testset) # clear NAs from testset dataset
dim(testset1)
## [1] 20 53
Pred_cases= testset1$problem_id
testset1$problem_id = NULL
rpart.test_pred = predict(rpart_model, newdata = testset1)
rpart.test_pred
##  [1] C A C A A C C A A A C C C A C A A A A C
## Levels: A B C D E
gbm.test_pred = predict(gbm_model, newdata = testset1)
gbm.test_pred
##  [1] B A B A A E D B A A B C B A E E A B B B
## Levels: A B C D E
rf.test_pred = predict(rf_model, newdata = testset1)
data.frame(Pred_cases, rf.test_pred)
##    Pred_cases rf.test_pred
## 1           1            B
## 2           2            A
## 3           3            B
## 4           4            A
## 5           5            A
## 6           6            E
## 7           7            D
## 8           8            B
## 9           9            A
## 10         10            A
## 11         11            B
## 12         12            C
## 13         13            B
## 14         14            A
## 15         15            E
## 16         16            E
## 17         17            A
## 18         18            B
## 19         19            B
## 20         20            B