Final Project Report - Practical Machine Learning

Introduction

Using devices such as Jawbone Up, Nike FuelBand, and Fitbit it is now possible to collect a large amount of data about personal activity relatively inexpensively. These type of devices are part of the quantified self movement - a group of enthusiasts who take measurements about themselves regularly to improve their health, to find patterns in their behavior, or because they are tech geeks. One thing that people regularly do is quantify how much of a particular activity they do, but they rarely quantify how well they do it. In this project, your goal will be to use data from accelerometers on the belt, forearm, arm, and dumbell of 6 participants. They were asked to perform barbell lifts correctly and incorrectly in 5 different ways. More information is available from the website here: http://groupware.les.inf.puc-rio.br/har (see the section on the Weight Lifting Exercise Dataset).

Data Sources

The training data for this project are available here: https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv

The test data are available here: https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv

The data for this project come from this source: http://groupware.les.inf.puc-rio.br/har. If you use the document you create for this class for any purpose please cite them as they have been very generous in allowing their data to be used for this kind of assignment.

Reproducibility

In order to reproduce the same results, certain set of packages need to be installed, as well as setting a pseudo random seed equal to one which has been used here. *Note:To install the caret package in R, run : install.packages(“caret”)

The following Libraries were used for this project, should be installed and loaded in working environment.

library(caret)

## Warning: package 'caret' was built under R version 3.2.5

## Loading required package: lattice

## Loading required package: ggplot2

## Warning: package 'ggplot2' was built under R version 3.2.3

library(gbm)

## Warning: package 'gbm' was built under R version 3.2.3

## Loading required package: survival

## 
## Attaching package: 'survival'

## The following object is masked from 'package:caret':
## 
##     cluster

## Loading required package: splines

## Loading required package: parallel

## Loaded gbm 2.1.1

library(ipred)
library(xgboost)

Getting the data

The training data set can be found on the following URL:

trainUrl <- "http://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv"

The testing data set can be found on the following URL:

testUrl <- "http://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv"

Load the Data

The train and test data is loaded in the following steps. The data can be loaded through url function directly or it can be loaded from local folder. In case of loading from local folder path of working directory has to be set appropriately. “NA”,“#DIV/0!”,“” data cells are considered as NA here.

#train <- read.csv(url(trainUrl), na.strings=c("NA","#DIV/0!",""))
#test <- read.csv(url(testUrl), na.strings=c("NA","#DIV/0!",""))

setwd("E:/SS/Coursera Data Science Specialization/Practical Machine Learning/Course Project")
train <- read.csv("pml-training.csv", na.strings=c("NA","#DIV/0!",""))
test  <- read.csv("pml-testing.csv", na.strings=c("NA","#DIV/0!",""))

Snapshot of Data

A snapshot of data for first 10 variables can be viewed by summary function.

summary(train[,c(1:10)])

##        X            user_name    raw_timestamp_part_1 raw_timestamp_part_2
##  Min.   :    1   adelmo  :3892   Min.   :1.322e+09    Min.   :   294      
##  1st Qu.: 4906   carlitos:3112   1st Qu.:1.323e+09    1st Qu.:252912      
##  Median : 9812   charles :3536   Median :1.323e+09    Median :496380      
##  Mean   : 9812   eurico  :3070   Mean   :1.323e+09    Mean   :500656      
##  3rd Qu.:14717   jeremy  :3402   3rd Qu.:1.323e+09    3rd Qu.:751891      
##  Max.   :19622   pedro   :2610   Max.   :1.323e+09    Max.   :998801      
##                                                                           
##           cvtd_timestamp  new_window    num_window      roll_belt     
##  28/11/2011 14:14: 1498   no :19216   Min.   :  1.0   Min.   :-28.90  
##  05/12/2011 11:24: 1497   yes:  406   1st Qu.:222.0   1st Qu.:  1.10  
##  30/11/2011 17:11: 1440               Median :424.0   Median :113.00  
##  05/12/2011 11:25: 1425               Mean   :430.6   Mean   : 64.41  
##  02/12/2011 14:57: 1380               3rd Qu.:644.0   3rd Qu.:123.00  
##  02/12/2011 13:34: 1375               Max.   :864.0   Max.   :162.00  
##  (Other)         :11007                                               
##    pitch_belt          yaw_belt      
##  Min.   :-55.8000   Min.   :-180.00  
##  1st Qu.:  1.7600   1st Qu.: -88.30  
##  Median :  5.2800   Median : -13.00  
##  Mean   :  0.3053   Mean   : -11.21  
##  3rd Qu.: 14.9000   3rd Qu.:  12.90  
##  Max.   : 60.3000   Max.   : 179.00  
##

summary(test[,c(1,10)])

##        X            yaw_belt     
##  Min.   : 1.00   Min.   :-93.70  
##  1st Qu.: 5.75   1st Qu.:-88.62  
##  Median :10.50   Median :-87.85  
##  Mean   :10.50   Mean   :-59.30  
##  3rd Qu.:15.25   3rd Qu.:-63.50  
##  Max.   :20.00   Max.   :162.00

Cleaning the data

Lets clean the data in the following steps Same modifications/ transformations should be applied to both train and test data.

Remove the variable column with NA values

Lets remove the columns with NA values as most of them have over 19000 of observations out of 19622 in train dataset as NA.

options(warn=-1)
# remove columns with NA
ind <- 0
for(i in 1:160){
  ind[i] <-  ifelse(is.na(train[,i]), "yes","no")
    }
sel1 <- names(train)[which(ind=="no")]
train1 <- train[,sel1]
test1  <- test[,sel1[1:59]]

Remove NearZeroVariance Variables

NearZeroVariance variables don’t capture the pattern of data. So they can be identified and removed.

# Selecting NearZeroVariance Variables
NZVar <- nearZeroVar(train1, saveMetrics=TRUE)
sel2 <- row.names(NZVar)[which(NZVar$nzv==FALSE)]

Selecting the remaining variables

# Select the remaining variables
train2 <- train1[, sel2]
test2  <- test1 [, sel2[-59]] ## sel2[59] is the classe which is not present in test2 data

Remove the ID variable

The ID variable should be removed from analysis.

# Remove the id variable in first column
train3  <- train2[,-1]
test3   <- test2[,-1]

converting train3$cvtd_timestamp to numeric. It has 20 levels

# converting train3$cvtd_timestamp to numeric 
train3$cvtd_timestamp <- as.numeric(train3$cvtd_timestamp)
test3$cvtd_timestamp  <- as.numeric(test3$cvtd_timestamp)

Cross-validation

Cross-validation will be performed by subsampling our training data set randomly . This is achieved by using trainControl function in caret, where arguements method=“cv” and number=5 is mentioned. This result in 5 fold cross-validation . The model is fitted on 4 folds of data and tested on remaining set. So effectively each fold get chance to act as test data. This help in better analysis and interpretation.

Expected out-of-sample error

The summary of model fitted object shows out-of-sample error and accuracy. Accuracy means the proportion of test data correctly predicted by the model. Error is (1-Accuracy) proportion. Since, the model is fitted in cross-validation process several times, accuracy is displayed for each of the model. The best model is selected in final model, which is used for prediction.

ML algorithms for prediction

Random forest , Boosting algorithms are known for their ability of detecting the features that are important for classification.

Model Random Forest

Random Forest algorithm is used here. Caret Package has implementation of Random Forest in method “ranger”. Model is used on test3 data for prediction.

# model Random Forest method "ranger" 
set.seed(431)
modRF <- train(classe ~ ., data=train3, method="ranger", trControl=trainControl(method="cv",number=5))

## Loading required package: e1071

## Loading required package: ranger

## Growing trees.. Progress: 45%. Estimated remaining time: 37 seconds.
## Growing trees.. Progress: 89%. Estimated remaining time: 7 seconds.
## Growing trees.. Progress: 25%. Estimated remaining time: 1 minute, 31 seconds.
## Growing trees.. Progress: 52%. Estimated remaining time: 57 seconds.
## Growing trees.. Progress: 78%. Estimated remaining time: 26 seconds.
## Growing trees.. Progress: 45%. Estimated remaining time: 38 seconds.
## Growing trees.. Progress: 89%. Estimated remaining time: 7 seconds.
## Growing trees.. Progress: 26%. Estimated remaining time: 1 minute, 30 seconds.
## Growing trees.. Progress: 51%. Estimated remaining time: 1 minute, 0 seconds.
## Growing trees.. Progress: 76%. Estimated remaining time: 29 seconds.
## Growing trees.. Progress: 44%. Estimated remaining time: 40 seconds.
## Growing trees.. Progress: 88%. Estimated remaining time: 8 seconds.
## Growing trees.. Progress: 25%. Estimated remaining time: 1 minute, 32 seconds.
## Growing trees.. Progress: 51%. Estimated remaining time: 1 minute, 0 seconds.
## Growing trees.. Progress: 76%. Estimated remaining time: 28 seconds.
## Growing trees.. Progress: 44%. Estimated remaining time: 39 seconds.
## Growing trees.. Progress: 89%. Estimated remaining time: 7 seconds.
## Growing trees.. Progress: 25%. Estimated remaining time: 1 minute, 32 seconds.
## Growing trees.. Progress: 51%. Estimated remaining time: 1 minute, 0 seconds.
## Growing trees.. Progress: 75%. Estimated remaining time: 30 seconds.
## Growing trees.. Progress: 45%. Estimated remaining time: 37 seconds.
## Growing trees.. Progress: 90%. Estimated remaining time: 6 seconds.
## Growing trees.. Progress: 25%. Estimated remaining time: 1 minute, 31 seconds.
## Growing trees.. Progress: 51%. Estimated remaining time: 59 seconds.
## Growing trees.. Progress: 77%. Estimated remaining time: 27 seconds.
## Growing trees.. Progress: 36%. Estimated remaining time: 55 seconds.
## Growing trees.. Progress: 71%. Estimated remaining time: 25 seconds.

# to see summary of model object and final model
modRF

## Random Forest 
## 
## 19622 samples
##    57 predictor
##     5 classes: 'A', 'B', 'C', 'D', 'E' 
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold) 
## Summary of sample sizes: 15699, 15697, 15696, 15699, 15697 
## Resampling results across tuning parameters:
## 
##   mtry  Accuracy   Kappa    
##    2    0.9974008  0.9967122
##   31    0.9995923  0.9994843
##   61    0.9991846  0.9989686
## 
## Accuracy was used to select the optimal model using  the largest value.
## The final value used for the model was mtry = 31.

modRF$finalModel

## Ranger result
## 
## Call:
##  ranger(.outcome ~ ., data = x, mtry = param$mtry, write.forest = TRUE,      probability = classProbs, ...) 
## 
## Type:                             Classification 
## Number of trees:                  500 
## Sample size:                      19622 
## Number of independent variables:  61 
## Mtry:                             31 
## Target node size:                 1 
## Variable importance mode:         none 
## OOB prediction error:             0.03 %

# prediction
predRF <- predict(modRF, newdata=test3)

Model Boosting

Boosting algorithm is used here. Caret Package has implementation of Boosting in method “gbm”. Model is used on test3 data for prediction.

# model gbm method "gbm"
set.seed(431)
modgbm <- train(classe ~ ., data=train3, method="gbm", trControl=trainControl(method="cv",number=5))

## Iter   TrainDeviance   ValidDeviance   StepSize   Improve
##      1        1.6094             nan     0.1000    0.1304
##      2        1.5232             nan     0.1000    0.0859
##      3        1.4660             nan     0.1000    0.0660
##      4        1.4224             nan     0.1000    0.0602
##      5        1.3824             nan     0.1000    0.0474
##      6        1.3516             nan     0.1000    0.0461
##      7        1.3228             nan     0.1000    0.0421
##      8        1.2964             nan     0.1000    0.0341
##      9        1.2743             nan     0.1000    0.0361
##     10        1.2494             nan     0.1000    0.0313
##     20        1.0719             nan     0.1000    0.0216
##     40        0.8683             nan     0.1000    0.0140
##     60        0.7357             nan     0.1000    0.0079
##     80        0.6347             nan     0.1000    0.0060
##    100        0.5540             nan     0.1000    0.0055
##    120        0.4880             nan     0.1000    0.0049
##    140        0.4367             nan     0.1000    0.0029
##    150        0.4132             nan     0.1000    0.0033
## 
## Iter   TrainDeviance   ValidDeviance   StepSize   Improve
##      1        1.6094             nan     0.1000    0.2002
##      2        1.4818             nan     0.1000    0.1396
##      3        1.3928             nan     0.1000    0.1110
##      4        1.3210             nan     0.1000    0.0919
##      5        1.2629             nan     0.1000    0.0810
##      6        1.2114             nan     0.1000    0.0730
##      7        1.1661             nan     0.1000    0.0647
##      8        1.1252             nan     0.1000    0.0647
##      9        1.0857             nan     0.1000    0.0562
##     10        1.0500             nan     0.1000    0.0490
##     20        0.8026             nan     0.1000    0.0307
##     40        0.5166             nan     0.1000    0.0156
##     60        0.3594             nan     0.1000    0.0075
##     80        0.2602             nan     0.1000    0.0085
##    100        0.1920             nan     0.1000    0.0048
##    120        0.1431             nan     0.1000    0.0039
##    140        0.1076             nan     0.1000    0.0016
##    150        0.0954             nan     0.1000    0.0028
## 
## Iter   TrainDeviance   ValidDeviance   StepSize   Improve
##      1        1.6094             nan     0.1000    0.2630
##      2        1.4451             nan     0.1000    0.1752
##      3        1.3340             nan     0.1000    0.1307
##      4        1.2530             nan     0.1000    0.1226
##      5        1.1767             nan     0.1000    0.1070
##      6        1.1113             nan     0.1000    0.0938
##      7        1.0492             nan     0.1000    0.0852
##      8        0.9961             nan     0.1000    0.0778
##      9        0.9476             nan     0.1000    0.0784
##     10        0.9017             nan     0.1000    0.0682
##     20        0.6050             nan     0.1000    0.0352
##     40        0.3201             nan     0.1000    0.0135
##     60        0.1874             nan     0.1000    0.0083
##     80        0.1202             nan     0.1000    0.0038
##    100        0.0812             nan     0.1000    0.0025
##    120        0.0599             nan     0.1000    0.0007
##    140        0.0452             nan     0.1000    0.0007
##    150        0.0394             nan     0.1000    0.0011
## 
## Iter   TrainDeviance   ValidDeviance   StepSize   Improve
##      1        1.6094             nan     0.1000    0.1276
##      2        1.5230             nan     0.1000    0.0893
##      3        1.4634             nan     0.1000    0.0666
##      4        1.4201             nan     0.1000    0.0539
##      5        1.3850             nan     0.1000    0.0547
##      6        1.3502             nan     0.1000    0.0467
##      7        1.3206             nan     0.1000    0.0431
##      8        1.2935             nan     0.1000    0.0347
##      9        1.2715             nan     0.1000    0.0383
##     10        1.2459             nan     0.1000    0.0328
##     20        1.0739             nan     0.1000    0.0179
##     40        0.8661             nan     0.1000    0.0132
##     60        0.7340             nan     0.1000    0.0085
##     80        0.6329             nan     0.1000    0.0081
##    100        0.5524             nan     0.1000    0.0059
##    120        0.4902             nan     0.1000    0.0046
##    140        0.4382             nan     0.1000    0.0037
##    150        0.4149             nan     0.1000    0.0037
## 
## Iter   TrainDeviance   ValidDeviance   StepSize   Improve
##      1        1.6094             nan     0.1000    0.1931
##      2        1.4840             nan     0.1000    0.1374
##      3        1.3964             nan     0.1000    0.1160
##      4        1.3231             nan     0.1000    0.0910
##      5        1.2644             nan     0.1000    0.0800
##      6        1.2141             nan     0.1000    0.0774
##      7        1.1668             nan     0.1000    0.0673
##      8        1.1248             nan     0.1000    0.0605
##      9        1.0872             nan     0.1000    0.0502
##     10        1.0554             nan     0.1000    0.0488
##     20        0.7965             nan     0.1000    0.0276
##     40        0.5108             nan     0.1000    0.0198
##     60        0.3502             nan     0.1000    0.0101
##     80        0.2545             nan     0.1000    0.0056
##    100        0.1862             nan     0.1000    0.0044
##    120        0.1387             nan     0.1000    0.0016
##    140        0.1025             nan     0.1000    0.0014
##    150        0.0912             nan     0.1000    0.0021
## 
## Iter   TrainDeviance   ValidDeviance   StepSize   Improve
##      1        1.6094             nan     0.1000    0.2481
##      2        1.4510             nan     0.1000    0.1799
##      3        1.3386             nan     0.1000    0.1366
##      4        1.2518             nan     0.1000    0.1163
##      5        1.1788             nan     0.1000    0.1021
##      6        1.1146             nan     0.1000    0.0933
##      7        1.0556             nan     0.1000    0.0894
##      8        1.0005             nan     0.1000    0.0756
##      9        0.9542             nan     0.1000    0.0666
##     10        0.9122             nan     0.1000    0.0659
##     20        0.6092             nan     0.1000    0.0332
##     40        0.3142             nan     0.1000    0.0138
##     60        0.1942             nan     0.1000    0.0075
##     80        0.1241             nan     0.1000    0.0034
##    100        0.0849             nan     0.1000    0.0027
##    120        0.0600             nan     0.1000    0.0019
##    140        0.0435             nan     0.1000    0.0011
##    150        0.0374             nan     0.1000    0.0004
## 
## Iter   TrainDeviance   ValidDeviance   StepSize   Improve
##      1        1.6094             nan     0.1000    0.1246
##      2        1.5245             nan     0.1000    0.0886
##      3        1.4667             nan     0.1000    0.0677
##      4        1.4223             nan     0.1000    0.0617
##      5        1.3829             nan     0.1000    0.0488
##      6        1.3518             nan     0.1000    0.0434
##      7        1.3237             nan     0.1000    0.0440
##      8        1.2973             nan     0.1000    0.0423
##      9        1.2698             nan     0.1000    0.0330
##     10        1.2488             nan     0.1000    0.0338
##     20        1.0744             nan     0.1000    0.0194
##     40        0.8715             nan     0.1000    0.0113
##     60        0.7365             nan     0.1000    0.0089
##     80        0.6331             nan     0.1000    0.0050
##    100        0.5538             nan     0.1000    0.0055
##    120        0.4903             nan     0.1000    0.0034
##    140        0.4369             nan     0.1000    0.0030
##    150        0.4145             nan     0.1000    0.0034
## 
## Iter   TrainDeviance   ValidDeviance   StepSize   Improve
##      1        1.6094             nan     0.1000    0.1977
##      2        1.4840             nan     0.1000    0.1371
##      3        1.3967             nan     0.1000    0.1090
##      4        1.3263             nan     0.1000    0.1007
##      5        1.2630             nan     0.1000    0.0836
##      6        1.2098             nan     0.1000    0.0755
##      7        1.1635             nan     0.1000    0.0669
##      8        1.1215             nan     0.1000    0.0562
##      9        1.0860             nan     0.1000    0.0522
##     10        1.0529             nan     0.1000    0.0500
##     20        0.7916             nan     0.1000    0.0297
##     40        0.5117             nan     0.1000    0.0205
##     60        0.3524             nan     0.1000    0.0081
##     80        0.2508             nan     0.1000    0.0053
##    100        0.1849             nan     0.1000    0.0048
##    120        0.1387             nan     0.1000    0.0019
##    140        0.1051             nan     0.1000    0.0037
##    150        0.0910             nan     0.1000    0.0013
## 
## Iter   TrainDeviance   ValidDeviance   StepSize   Improve
##      1        1.6094             nan     0.1000    0.2516
##      2        1.4488             nan     0.1000    0.1873
##      3        1.3303             nan     0.1000    0.1463
##      4        1.2398             nan     0.1000    0.1099
##      5        1.1695             nan     0.1000    0.1000
##      6        1.1070             nan     0.1000    0.0780
##      7        1.0573             nan     0.1000    0.0891
##      8        1.0034             nan     0.1000    0.0676
##      9        0.9618             nan     0.1000    0.0708
##     10        0.9184             nan     0.1000    0.0642
##     20        0.6097             nan     0.1000    0.0429
##     40        0.3178             nan     0.1000    0.0134
##     60        0.1925             nan     0.1000    0.0061
##     80        0.1252             nan     0.1000    0.0050
##    100        0.0848             nan     0.1000    0.0027
##    120        0.0591             nan     0.1000    0.0023
##    140        0.0440             nan     0.1000    0.0006
##    150        0.0385             nan     0.1000    0.0007
## 
## Iter   TrainDeviance   ValidDeviance   StepSize   Improve
##      1        1.6094             nan     0.1000    0.1276
##      2        1.5236             nan     0.1000    0.0871
##      3        1.4665             nan     0.1000    0.0674
##      4        1.4223             nan     0.1000    0.0519
##      5        1.3874             nan     0.1000    0.0548
##      6        1.3532             nan     0.1000    0.0463
##      7        1.3243             nan     0.1000    0.0384
##      8        1.2997             nan     0.1000    0.0433
##      9        1.2716             nan     0.1000    0.0351
##     10        1.2495             nan     0.1000    0.0291
##     20        1.0754             nan     0.1000    0.0199
##     40        0.8698             nan     0.1000    0.0102
##     60        0.7374             nan     0.1000    0.0100
##     80        0.6351             nan     0.1000    0.0068
##    100        0.5557             nan     0.1000    0.0049
##    120        0.4909             nan     0.1000    0.0045
##    140        0.4378             nan     0.1000    0.0035
##    150        0.4147             nan     0.1000    0.0030
## 
## Iter   TrainDeviance   ValidDeviance   StepSize   Improve
##      1        1.6094             nan     0.1000    0.1909
##      2        1.4853             nan     0.1000    0.1373
##      3        1.3962             nan     0.1000    0.1117
##      4        1.3255             nan     0.1000    0.0911
##      5        1.2685             nan     0.1000    0.0758
##      6        1.2200             nan     0.1000    0.0798
##      7        1.1709             nan     0.1000    0.0742
##      8        1.1262             nan     0.1000    0.0668
##      9        1.0860             nan     0.1000    0.0575
##     10        1.0507             nan     0.1000    0.0455
##     20        0.7961             nan     0.1000    0.0324
##     40        0.5169             nan     0.1000    0.0250
##     60        0.3548             nan     0.1000    0.0070
##     80        0.2517             nan     0.1000    0.0071
##    100        0.1853             nan     0.1000    0.0040
##    120        0.1394             nan     0.1000    0.0054
##    140        0.1067             nan     0.1000    0.0022
##    150        0.0939             nan     0.1000    0.0019
## 
## Iter   TrainDeviance   ValidDeviance   StepSize   Improve
##      1        1.6094             nan     0.1000    0.2513
##      2        1.4497             nan     0.1000    0.1891
##      3        1.3318             nan     0.1000    0.1284
##      4        1.2496             nan     0.1000    0.1230
##      5        1.1737             nan     0.1000    0.0996
##      6        1.1115             nan     0.1000    0.0964
##      7        1.0485             nan     0.1000    0.0733
##      8        1.0006             nan     0.1000    0.0759
##      9        0.9545             nan     0.1000    0.0627
##     10        0.9149             nan     0.1000    0.0687
##     20        0.6069             nan     0.1000    0.0258
##     40        0.3197             nan     0.1000    0.0168
##     60        0.1935             nan     0.1000    0.0083
##     80        0.1224             nan     0.1000    0.0037
##    100        0.0815             nan     0.1000    0.0016
##    120        0.0595             nan     0.1000    0.0013
##    140        0.0438             nan     0.1000    0.0011
##    150        0.0382             nan     0.1000    0.0010
## 
## Iter   TrainDeviance   ValidDeviance   StepSize   Improve
##      1        1.6094             nan     0.1000    0.1272
##      2        1.5230             nan     0.1000    0.0866
##      3        1.4649             nan     0.1000    0.0686
##      4        1.4194             nan     0.1000    0.0529
##      5        1.3845             nan     0.1000    0.0535
##      6        1.3492             nan     0.1000    0.0463
##      7        1.3198             nan     0.1000    0.0431
##      8        1.2932             nan     0.1000    0.0391
##      9        1.2688             nan     0.1000    0.0384
##     10        1.2437             nan     0.1000    0.0298
##     20        1.0715             nan     0.1000    0.0210
##     40        0.8644             nan     0.1000    0.0109
##     60        0.7311             nan     0.1000    0.0090
##     80        0.6298             nan     0.1000    0.0054
##    100        0.5493             nan     0.1000    0.0047
##    120        0.4869             nan     0.1000    0.0038
##    140        0.4347             nan     0.1000    0.0030
##    150        0.4112             nan     0.1000    0.0014
## 
## Iter   TrainDeviance   ValidDeviance   StepSize   Improve
##      1        1.6094             nan     0.1000    0.2002
##      2        1.4811             nan     0.1000    0.1453
##      3        1.3899             nan     0.1000    0.1112
##      4        1.3175             nan     0.1000    0.0874
##      5        1.2614             nan     0.1000    0.0913
##      6        1.2042             nan     0.1000    0.0763
##      7        1.1563             nan     0.1000    0.0618
##      8        1.1173             nan     0.1000    0.0575
##      9        1.0808             nan     0.1000    0.0567
##     10        1.0452             nan     0.1000    0.0478
##     20        0.7848             nan     0.1000    0.0277
##     40        0.5030             nan     0.1000    0.0133
##     60        0.3519             nan     0.1000    0.0095
##     80        0.2543             nan     0.1000    0.0100
##    100        0.1838             nan     0.1000    0.0044
##    120        0.1389             nan     0.1000    0.0018
##    140        0.1060             nan     0.1000    0.0017
##    150        0.0932             nan     0.1000    0.0008
## 
## Iter   TrainDeviance   ValidDeviance   StepSize   Improve
##      1        1.6094             nan     0.1000    0.2562
##      2        1.4459             nan     0.1000    0.1850
##      3        1.3311             nan     0.1000    0.1383
##      4        1.2446             nan     0.1000    0.1066
##      5        1.1767             nan     0.1000    0.1083
##      6        1.1101             nan     0.1000    0.0847
##      7        1.0559             nan     0.1000    0.0935
##      8        0.9983             nan     0.1000    0.0691
##      9        0.9563             nan     0.1000    0.0746
##     10        0.9103             nan     0.1000    0.0596
##     20        0.6069             nan     0.1000    0.0406
##     40        0.3183             nan     0.1000    0.0155
##     60        0.1968             nan     0.1000    0.0087
##     80        0.1210             nan     0.1000    0.0048
##    100        0.0816             nan     0.1000    0.0025
##    120        0.0592             nan     0.1000    0.0012
##    140        0.0436             nan     0.1000    0.0015
##    150        0.0377             nan     0.1000    0.0006
## 
## Iter   TrainDeviance   ValidDeviance   StepSize   Improve
##      1        1.6094             nan     0.1000    0.2549
##      2        1.4498             nan     0.1000    0.1748
##      3        1.3407             nan     0.1000    0.1435
##      4        1.2510             nan     0.1000    0.1192
##      5        1.1765             nan     0.1000    0.0964
##      6        1.1156             nan     0.1000    0.0999
##      7        1.0532             nan     0.1000    0.0777
##      8        1.0056             nan     0.1000    0.0806
##      9        0.9571             nan     0.1000    0.0743
##     10        0.9117             nan     0.1000    0.0592
##     20        0.6073             nan     0.1000    0.0247
##     40        0.3204             nan     0.1000    0.0114
##     60        0.1865             nan     0.1000    0.0037
##     80        0.1204             nan     0.1000    0.0054
##    100        0.0805             nan     0.1000    0.0030
##    120        0.0573             nan     0.1000    0.0009
##    140        0.0414             nan     0.1000    0.0010
##    150        0.0366             nan     0.1000    0.0010

# to see summary of model object and final model
modgbm

## Stochastic Gradient Boosting 
## 
## 19622 samples
##    57 predictor
##     5 classes: 'A', 'B', 'C', 'D', 'E' 
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold) 
## Summary of sample sizes: 15699, 15697, 15696, 15699, 15697 
## Resampling results across tuning parameters:
## 
##   interaction.depth  n.trees  Accuracy   Kappa    
##   1                   50      0.8180115  0.7690522
##   1                  100      0.8958828  0.8681486
##   1                  150      0.9238616  0.9035631
##   2                   50      0.9439408  0.9290420
##   2                  100      0.9825704  0.9779489
##   2                  150      0.9915909  0.9893633
##   3                   50      0.9746716  0.9679485
##   3                  100      0.9926612  0.9907172
##   3                  150      0.9964834  0.9955522
## 
## Tuning parameter 'shrinkage' was held constant at a value of 0.1
## 
## Tuning parameter 'n.minobsinnode' was held constant at a value of 10
## Accuracy was used to select the optimal model using  the largest value.
## The final values used for the model were n.trees = 150,
##  interaction.depth = 3, shrinkage = 0.1 and n.minobsinnode = 10.

modgbm$finalModel

## A gradient boosted model with multinomial loss function.
## 150 iterations were performed.
## There were 61 predictors of which 36 had non-zero influence.

# prediction
predgbm <- predict(modgbm,newdata=test3)

Model Bagging

Bagging algorithm is used here. Caret Package has implementation of Boosting in method “treebag”. Model is used on test3 data for prediction.

# model bagging method "treebag"
set.seed(431)
modbag  <- train(classe ~ ., data=train3, method="treebag", trControl=trainControl(method="cv",number=5))
# to see summary of model object and final model
modbag

## Bagged CART 
## 
## 19622 samples
##    57 predictor
##     5 classes: 'A', 'B', 'C', 'D', 'E' 
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold) 
## Summary of sample sizes: 15699, 15697, 15696, 15699, 15697 
## Resampling results:
## 
##   Accuracy   Kappa    
##   0.9990826  0.9988396
## 
##

modbag$finalModel

## 
## Bagging classification trees with 25 bootstrap replications

# prediction
predbag <- predict(modbag,newdata=test3)

Model extreme gradient boosting

Extreme gradient boosting algorithm is used here. Caret Package has implementation of Boosting in method “xgbtree”.Model is used on test3 data for prediction.

# model xgboost method "xgbTree"
set.seed(431)
modxgbTree  <- train(classe ~ ., data=train3, method="xgbTree", trControl=trainControl(method="cv",number=5))
# to see summary of model object 
modxgbTree

## eXtreme Gradient Boosting 
## 
## 19622 samples
##    57 predictor
##     5 classes: 'A', 'B', 'C', 'D', 'E' 
## 
## No pre-processing
## Resampling: Cross-Validated (5 fold) 
## Summary of sample sizes: 15699, 15697, 15696, 15699, 15697 
## Resampling results across tuning parameters:
## 
##   eta  max_depth  colsample_bytree  nrounds  Accuracy   Kappa    
##   0.3  1          0.6                50      0.8084795  0.7570188
##   0.3  1          0.6               100      0.8537360  0.8147445
##   0.3  1          0.6               150      0.8814603  0.8499619
##   0.3  1          0.8                50      0.7903380  0.7338623
##   0.3  1          0.8               100      0.8319748  0.7871383
##   0.3  1          0.8               150      0.8562335  0.8179227
##   0.3  2          0.6                50      0.9393032  0.9232044
##   0.3  2          0.6               100      0.9681996  0.9597723
##   0.3  2          0.6               150      0.9826220  0.9780190
##   0.3  2          0.8                50      0.9299252  0.9113662
##   0.3  2          0.8               100      0.9489346  0.9353973
##   0.3  2          0.8               150      0.9622872  0.9522965
##   0.3  3          0.6                50      0.9729890  0.9658206
##   0.3  3          0.6               100      0.9897053  0.9869772
##   0.3  3          0.6               150      0.9961777  0.9951652
##   0.3  3          0.8                50      0.9625926  0.9526644
##   0.3  3          0.8               100      0.9813980  0.9764666
##   0.3  3          0.8               150      0.9887369  0.9857530
##   0.4  1          0.6                50      0.8322808  0.7874994
##   0.4  1          0.6               100      0.8800325  0.8481384
##   0.4  1          0.6               150      0.9050558  0.8798415
##   0.4  1          0.8                50      0.8191830  0.7708108
##   0.4  1          0.8               100      0.8532785  0.8141549
##   0.4  1          0.8               150      0.8753448  0.8422368
##   0.4  2          0.6                50      0.9517380  0.9389608
##   0.4  2          0.6               100      0.9797676  0.9744055
##   0.4  2          0.6               150      0.9911831  0.9888476
##   0.4  2          0.8                50      0.9437360  0.9288364
##   0.4  2          0.8               100      0.9671277  0.9584268
##   0.4  2          0.8               150      0.9784420  0.9727308
##   0.4  3          0.6                50      0.9836404  0.9793028
##   0.4  3          0.6               100      0.9960755  0.9950360
##   0.4  3          0.6               150      0.9991335  0.9989040
##   0.4  3          0.8                50      0.9727852  0.9655635
##   0.4  3          0.8               100      0.9876670  0.9843982
##   0.4  3          0.8               150      0.9942413  0.9927157
## 
## Tuning parameter 'gamma' was held constant at a value of 0
## 
## Tuning parameter 'min_child_weight' was held constant at a value of 1
## Accuracy was used to select the optimal model using  the largest value.
## The final values used for the model were nrounds = 150, max_depth = 3,
##  eta = 0.4, gamma = 0, colsample_bytree = 0.6 and min_child_weight = 1.

# prediction
predxgbTree <- predict(modxgbTree ,newdata=test3)

Random Forest yielded better Results.

Random Forest machine learning algorithm isapplied to the 20 test cases available in the test data .

Generating submission files

The Submission dataframe is generated and Submission.csv file is used for submission.

Submission = data.frame(X= test$X, Predictions=predRF)
write.csv(Submission, "Submission.csv", row.names=FALSE)