Using devices such as Jawbone Up, Nike FuelBand, and Fitbit it is now possible to collect a large amount of data about personal activity relatively inexpensively. These type of devices are part of the quantified self movement - a group of enthusiasts who take measurements about themselves regularly to improve their health, to find patterns in their behavior, or because they are tech geeks. One thing that people regularly do is quantify how much of a particular activity they do, but they rarely quantify how well they do it. In this project, your goal will be to use data from accelerometers on the belt, forearm, arm, and dumbell of 6 participants. They were asked to perform barbell lifts correctly and incorrectly in 5 different ways. More information is available from the website here: http://groupware.les.inf.puc-rio.br/har (see the section on the Weight Lifting Exercise Dataset).
The training data for this project are available here: https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv
The test data are available here: https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv
The data for this project come from this source: http://groupware.les.inf.puc-rio.br/har. If you use the document you create for this class for any purpose please cite them as they have been very generous in allowing their data to be used for this kind of assignment.
In order to reproduce the same results, certain set of packages need to be installed, as well as setting a pseudo random seed equal to one which has been used here. *Note:To install the caret package in R, run : install.packages(“caret”)
The following Libraries were used for this project, should be installed and loaded in working environment.
library(caret)
## Warning: package 'caret' was built under R version 3.2.5
## Loading required package: lattice
## Loading required package: ggplot2
## Warning: package 'ggplot2' was built under R version 3.2.3
library(gbm)
## Warning: package 'gbm' was built under R version 3.2.3
## Loading required package: survival
##
## Attaching package: 'survival'
## The following object is masked from 'package:caret':
##
## cluster
## Loading required package: splines
## Loading required package: parallel
## Loaded gbm 2.1.1
library(ipred)
library(xgboost)
The training data set can be found on the following URL:
trainUrl <- "http://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv"
The testing data set can be found on the following URL:
testUrl <- "http://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv"
Load the Data
The train and test data is loaded in the following steps. The data can be loaded through url function directly or it can be loaded from local folder. In case of loading from local folder path of working directory has to be set appropriately. “NA”,“#DIV/0!”,“” data cells are considered as NA here.
#train <- read.csv(url(trainUrl), na.strings=c("NA","#DIV/0!",""))
#test <- read.csv(url(testUrl), na.strings=c("NA","#DIV/0!",""))
setwd("E:/SS/Coursera Data Science Specialization/Practical Machine Learning/Course Project")
train <- read.csv("pml-training.csv", na.strings=c("NA","#DIV/0!",""))
test <- read.csv("pml-testing.csv", na.strings=c("NA","#DIV/0!",""))
Snapshot of Data
A snapshot of data for first 10 variables can be viewed by summary function.
summary(train[,c(1:10)])
## X user_name raw_timestamp_part_1 raw_timestamp_part_2
## Min. : 1 adelmo :3892 Min. :1.322e+09 Min. : 294
## 1st Qu.: 4906 carlitos:3112 1st Qu.:1.323e+09 1st Qu.:252912
## Median : 9812 charles :3536 Median :1.323e+09 Median :496380
## Mean : 9812 eurico :3070 Mean :1.323e+09 Mean :500656
## 3rd Qu.:14717 jeremy :3402 3rd Qu.:1.323e+09 3rd Qu.:751891
## Max. :19622 pedro :2610 Max. :1.323e+09 Max. :998801
##
## cvtd_timestamp new_window num_window roll_belt
## 28/11/2011 14:14: 1498 no :19216 Min. : 1.0 Min. :-28.90
## 05/12/2011 11:24: 1497 yes: 406 1st Qu.:222.0 1st Qu.: 1.10
## 30/11/2011 17:11: 1440 Median :424.0 Median :113.00
## 05/12/2011 11:25: 1425 Mean :430.6 Mean : 64.41
## 02/12/2011 14:57: 1380 3rd Qu.:644.0 3rd Qu.:123.00
## 02/12/2011 13:34: 1375 Max. :864.0 Max. :162.00
## (Other) :11007
## pitch_belt yaw_belt
## Min. :-55.8000 Min. :-180.00
## 1st Qu.: 1.7600 1st Qu.: -88.30
## Median : 5.2800 Median : -13.00
## Mean : 0.3053 Mean : -11.21
## 3rd Qu.: 14.9000 3rd Qu.: 12.90
## Max. : 60.3000 Max. : 179.00
##
summary(test[,c(1,10)])
## X yaw_belt
## Min. : 1.00 Min. :-93.70
## 1st Qu.: 5.75 1st Qu.:-88.62
## Median :10.50 Median :-87.85
## Mean :10.50 Mean :-59.30
## 3rd Qu.:15.25 3rd Qu.:-63.50
## Max. :20.00 Max. :162.00
Lets clean the data in the following steps Same modifications/ transformations should be applied to both train and test data.
Remove the variable column with NA values
Lets remove the columns with NA values as most of them have over 19000 of observations out of 19622 in train dataset as NA.
options(warn=-1)
# remove columns with NA
ind <- 0
for(i in 1:160){
ind[i] <- ifelse(is.na(train[,i]), "yes","no")
}
sel1 <- names(train)[which(ind=="no")]
train1 <- train[,sel1]
test1 <- test[,sel1[1:59]]
Remove NearZeroVariance Variables
NearZeroVariance variables don’t capture the pattern of data. So they can be identified and removed.
# Selecting NearZeroVariance Variables
NZVar <- nearZeroVar(train1, saveMetrics=TRUE)
sel2 <- row.names(NZVar)[which(NZVar$nzv==FALSE)]
Selecting the remaining variables
# Select the remaining variables
train2 <- train1[, sel2]
test2 <- test1 [, sel2[-59]] ## sel2[59] is the classe which is not present in test2 data
Remove the ID variable
The ID variable should be removed from analysis.
# Remove the id variable in first column
train3 <- train2[,-1]
test3 <- test2[,-1]
converting train3$cvtd_timestamp to numeric. It has 20 levels
# converting train3$cvtd_timestamp to numeric
train3$cvtd_timestamp <- as.numeric(train3$cvtd_timestamp)
test3$cvtd_timestamp <- as.numeric(test3$cvtd_timestamp)
Cross-validation will be performed by subsampling our training data set randomly . This is achieved by using trainControl function in caret, where arguements method=“cv” and number=5 is mentioned. This result in 5 fold cross-validation . The model is fitted on 4 folds of data and tested on remaining set. So effectively each fold get chance to act as test data. This help in better analysis and interpretation.
The summary of model fitted object shows out-of-sample error and accuracy. Accuracy means the proportion of test data correctly predicted by the model. Error is (1-Accuracy) proportion. Since, the model is fitted in cross-validation process several times, accuracy is displayed for each of the model. The best model is selected in final model, which is used for prediction.
Random forest , Boosting algorithms are known for their ability of detecting the features that are important for classification.
Model Random Forest
Random Forest algorithm is used here. Caret Package has implementation of Random Forest in method “ranger”. Model is used on test3 data for prediction.
# model Random Forest method "ranger"
set.seed(431)
modRF <- train(classe ~ ., data=train3, method="ranger", trControl=trainControl(method="cv",number=5))
## Loading required package: e1071
## Loading required package: ranger
## Growing trees.. Progress: 45%. Estimated remaining time: 37 seconds.
## Growing trees.. Progress: 89%. Estimated remaining time: 7 seconds.
## Growing trees.. Progress: 25%. Estimated remaining time: 1 minute, 31 seconds.
## Growing trees.. Progress: 52%. Estimated remaining time: 57 seconds.
## Growing trees.. Progress: 78%. Estimated remaining time: 26 seconds.
## Growing trees.. Progress: 45%. Estimated remaining time: 38 seconds.
## Growing trees.. Progress: 89%. Estimated remaining time: 7 seconds.
## Growing trees.. Progress: 26%. Estimated remaining time: 1 minute, 30 seconds.
## Growing trees.. Progress: 51%. Estimated remaining time: 1 minute, 0 seconds.
## Growing trees.. Progress: 76%. Estimated remaining time: 29 seconds.
## Growing trees.. Progress: 44%. Estimated remaining time: 40 seconds.
## Growing trees.. Progress: 88%. Estimated remaining time: 8 seconds.
## Growing trees.. Progress: 25%. Estimated remaining time: 1 minute, 32 seconds.
## Growing trees.. Progress: 51%. Estimated remaining time: 1 minute, 0 seconds.
## Growing trees.. Progress: 76%. Estimated remaining time: 28 seconds.
## Growing trees.. Progress: 44%. Estimated remaining time: 39 seconds.
## Growing trees.. Progress: 89%. Estimated remaining time: 7 seconds.
## Growing trees.. Progress: 25%. Estimated remaining time: 1 minute, 32 seconds.
## Growing trees.. Progress: 51%. Estimated remaining time: 1 minute, 0 seconds.
## Growing trees.. Progress: 75%. Estimated remaining time: 30 seconds.
## Growing trees.. Progress: 45%. Estimated remaining time: 37 seconds.
## Growing trees.. Progress: 90%. Estimated remaining time: 6 seconds.
## Growing trees.. Progress: 25%. Estimated remaining time: 1 minute, 31 seconds.
## Growing trees.. Progress: 51%. Estimated remaining time: 59 seconds.
## Growing trees.. Progress: 77%. Estimated remaining time: 27 seconds.
## Growing trees.. Progress: 36%. Estimated remaining time: 55 seconds.
## Growing trees.. Progress: 71%. Estimated remaining time: 25 seconds.
# to see summary of model object and final model
modRF
## Random Forest
##
## 19622 samples
## 57 predictor
## 5 classes: 'A', 'B', 'C', 'D', 'E'
##
## No pre-processing
## Resampling: Cross-Validated (5 fold)
## Summary of sample sizes: 15699, 15697, 15696, 15699, 15697
## Resampling results across tuning parameters:
##
## mtry Accuracy Kappa
## 2 0.9974008 0.9967122
## 31 0.9995923 0.9994843
## 61 0.9991846 0.9989686
##
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 31.
modRF$finalModel
## Ranger result
##
## Call:
## ranger(.outcome ~ ., data = x, mtry = param$mtry, write.forest = TRUE, probability = classProbs, ...)
##
## Type: Classification
## Number of trees: 500
## Sample size: 19622
## Number of independent variables: 61
## Mtry: 31
## Target node size: 1
## Variable importance mode: none
## OOB prediction error: 0.03 %
# prediction
predRF <- predict(modRF, newdata=test3)
Model Boosting
Boosting algorithm is used here. Caret Package has implementation of Boosting in method “gbm”. Model is used on test3 data for prediction.
# model gbm method "gbm"
set.seed(431)
modgbm <- train(classe ~ ., data=train3, method="gbm", trControl=trainControl(method="cv",number=5))
## Iter TrainDeviance ValidDeviance StepSize Improve
## 1 1.6094 nan 0.1000 0.1304
## 2 1.5232 nan 0.1000 0.0859
## 3 1.4660 nan 0.1000 0.0660
## 4 1.4224 nan 0.1000 0.0602
## 5 1.3824 nan 0.1000 0.0474
## 6 1.3516 nan 0.1000 0.0461
## 7 1.3228 nan 0.1000 0.0421
## 8 1.2964 nan 0.1000 0.0341
## 9 1.2743 nan 0.1000 0.0361
## 10 1.2494 nan 0.1000 0.0313
## 20 1.0719 nan 0.1000 0.0216
## 40 0.8683 nan 0.1000 0.0140
## 60 0.7357 nan 0.1000 0.0079
## 80 0.6347 nan 0.1000 0.0060
## 100 0.5540 nan 0.1000 0.0055
## 120 0.4880 nan 0.1000 0.0049
## 140 0.4367 nan 0.1000 0.0029
## 150 0.4132 nan 0.1000 0.0033
##
## Iter TrainDeviance ValidDeviance StepSize Improve
## 1 1.6094 nan 0.1000 0.2002
## 2 1.4818 nan 0.1000 0.1396
## 3 1.3928 nan 0.1000 0.1110
## 4 1.3210 nan 0.1000 0.0919
## 5 1.2629 nan 0.1000 0.0810
## 6 1.2114 nan 0.1000 0.0730
## 7 1.1661 nan 0.1000 0.0647
## 8 1.1252 nan 0.1000 0.0647
## 9 1.0857 nan 0.1000 0.0562
## 10 1.0500 nan 0.1000 0.0490
## 20 0.8026 nan 0.1000 0.0307
## 40 0.5166 nan 0.1000 0.0156
## 60 0.3594 nan 0.1000 0.0075
## 80 0.2602 nan 0.1000 0.0085
## 100 0.1920 nan 0.1000 0.0048
## 120 0.1431 nan 0.1000 0.0039
## 140 0.1076 nan 0.1000 0.0016
## 150 0.0954 nan 0.1000 0.0028
##
## Iter TrainDeviance ValidDeviance StepSize Improve
## 1 1.6094 nan 0.1000 0.2630
## 2 1.4451 nan 0.1000 0.1752
## 3 1.3340 nan 0.1000 0.1307
## 4 1.2530 nan 0.1000 0.1226
## 5 1.1767 nan 0.1000 0.1070
## 6 1.1113 nan 0.1000 0.0938
## 7 1.0492 nan 0.1000 0.0852
## 8 0.9961 nan 0.1000 0.0778
## 9 0.9476 nan 0.1000 0.0784
## 10 0.9017 nan 0.1000 0.0682
## 20 0.6050 nan 0.1000 0.0352
## 40 0.3201 nan 0.1000 0.0135
## 60 0.1874 nan 0.1000 0.0083
## 80 0.1202 nan 0.1000 0.0038
## 100 0.0812 nan 0.1000 0.0025
## 120 0.0599 nan 0.1000 0.0007
## 140 0.0452 nan 0.1000 0.0007
## 150 0.0394 nan 0.1000 0.0011
##
## Iter TrainDeviance ValidDeviance StepSize Improve
## 1 1.6094 nan 0.1000 0.1276
## 2 1.5230 nan 0.1000 0.0893
## 3 1.4634 nan 0.1000 0.0666
## 4 1.4201 nan 0.1000 0.0539
## 5 1.3850 nan 0.1000 0.0547
## 6 1.3502 nan 0.1000 0.0467
## 7 1.3206 nan 0.1000 0.0431
## 8 1.2935 nan 0.1000 0.0347
## 9 1.2715 nan 0.1000 0.0383
## 10 1.2459 nan 0.1000 0.0328
## 20 1.0739 nan 0.1000 0.0179
## 40 0.8661 nan 0.1000 0.0132
## 60 0.7340 nan 0.1000 0.0085
## 80 0.6329 nan 0.1000 0.0081
## 100 0.5524 nan 0.1000 0.0059
## 120 0.4902 nan 0.1000 0.0046
## 140 0.4382 nan 0.1000 0.0037
## 150 0.4149 nan 0.1000 0.0037
##
## Iter TrainDeviance ValidDeviance StepSize Improve
## 1 1.6094 nan 0.1000 0.1931
## 2 1.4840 nan 0.1000 0.1374
## 3 1.3964 nan 0.1000 0.1160
## 4 1.3231 nan 0.1000 0.0910
## 5 1.2644 nan 0.1000 0.0800
## 6 1.2141 nan 0.1000 0.0774
## 7 1.1668 nan 0.1000 0.0673
## 8 1.1248 nan 0.1000 0.0605
## 9 1.0872 nan 0.1000 0.0502
## 10 1.0554 nan 0.1000 0.0488
## 20 0.7965 nan 0.1000 0.0276
## 40 0.5108 nan 0.1000 0.0198
## 60 0.3502 nan 0.1000 0.0101
## 80 0.2545 nan 0.1000 0.0056
## 100 0.1862 nan 0.1000 0.0044
## 120 0.1387 nan 0.1000 0.0016
## 140 0.1025 nan 0.1000 0.0014
## 150 0.0912 nan 0.1000 0.0021
##
## Iter TrainDeviance ValidDeviance StepSize Improve
## 1 1.6094 nan 0.1000 0.2481
## 2 1.4510 nan 0.1000 0.1799
## 3 1.3386 nan 0.1000 0.1366
## 4 1.2518 nan 0.1000 0.1163
## 5 1.1788 nan 0.1000 0.1021
## 6 1.1146 nan 0.1000 0.0933
## 7 1.0556 nan 0.1000 0.0894
## 8 1.0005 nan 0.1000 0.0756
## 9 0.9542 nan 0.1000 0.0666
## 10 0.9122 nan 0.1000 0.0659
## 20 0.6092 nan 0.1000 0.0332
## 40 0.3142 nan 0.1000 0.0138
## 60 0.1942 nan 0.1000 0.0075
## 80 0.1241 nan 0.1000 0.0034
## 100 0.0849 nan 0.1000 0.0027
## 120 0.0600 nan 0.1000 0.0019
## 140 0.0435 nan 0.1000 0.0011
## 150 0.0374 nan 0.1000 0.0004
##
## Iter TrainDeviance ValidDeviance StepSize Improve
## 1 1.6094 nan 0.1000 0.1246
## 2 1.5245 nan 0.1000 0.0886
## 3 1.4667 nan 0.1000 0.0677
## 4 1.4223 nan 0.1000 0.0617
## 5 1.3829 nan 0.1000 0.0488
## 6 1.3518 nan 0.1000 0.0434
## 7 1.3237 nan 0.1000 0.0440
## 8 1.2973 nan 0.1000 0.0423
## 9 1.2698 nan 0.1000 0.0330
## 10 1.2488 nan 0.1000 0.0338
## 20 1.0744 nan 0.1000 0.0194
## 40 0.8715 nan 0.1000 0.0113
## 60 0.7365 nan 0.1000 0.0089
## 80 0.6331 nan 0.1000 0.0050
## 100 0.5538 nan 0.1000 0.0055
## 120 0.4903 nan 0.1000 0.0034
## 140 0.4369 nan 0.1000 0.0030
## 150 0.4145 nan 0.1000 0.0034
##
## Iter TrainDeviance ValidDeviance StepSize Improve
## 1 1.6094 nan 0.1000 0.1977
## 2 1.4840 nan 0.1000 0.1371
## 3 1.3967 nan 0.1000 0.1090
## 4 1.3263 nan 0.1000 0.1007
## 5 1.2630 nan 0.1000 0.0836
## 6 1.2098 nan 0.1000 0.0755
## 7 1.1635 nan 0.1000 0.0669
## 8 1.1215 nan 0.1000 0.0562
## 9 1.0860 nan 0.1000 0.0522
## 10 1.0529 nan 0.1000 0.0500
## 20 0.7916 nan 0.1000 0.0297
## 40 0.5117 nan 0.1000 0.0205
## 60 0.3524 nan 0.1000 0.0081
## 80 0.2508 nan 0.1000 0.0053
## 100 0.1849 nan 0.1000 0.0048
## 120 0.1387 nan 0.1000 0.0019
## 140 0.1051 nan 0.1000 0.0037
## 150 0.0910 nan 0.1000 0.0013
##
## Iter TrainDeviance ValidDeviance StepSize Improve
## 1 1.6094 nan 0.1000 0.2516
## 2 1.4488 nan 0.1000 0.1873
## 3 1.3303 nan 0.1000 0.1463
## 4 1.2398 nan 0.1000 0.1099
## 5 1.1695 nan 0.1000 0.1000
## 6 1.1070 nan 0.1000 0.0780
## 7 1.0573 nan 0.1000 0.0891
## 8 1.0034 nan 0.1000 0.0676
## 9 0.9618 nan 0.1000 0.0708
## 10 0.9184 nan 0.1000 0.0642
## 20 0.6097 nan 0.1000 0.0429
## 40 0.3178 nan 0.1000 0.0134
## 60 0.1925 nan 0.1000 0.0061
## 80 0.1252 nan 0.1000 0.0050
## 100 0.0848 nan 0.1000 0.0027
## 120 0.0591 nan 0.1000 0.0023
## 140 0.0440 nan 0.1000 0.0006
## 150 0.0385 nan 0.1000 0.0007
##
## Iter TrainDeviance ValidDeviance StepSize Improve
## 1 1.6094 nan 0.1000 0.1276
## 2 1.5236 nan 0.1000 0.0871
## 3 1.4665 nan 0.1000 0.0674
## 4 1.4223 nan 0.1000 0.0519
## 5 1.3874 nan 0.1000 0.0548
## 6 1.3532 nan 0.1000 0.0463
## 7 1.3243 nan 0.1000 0.0384
## 8 1.2997 nan 0.1000 0.0433
## 9 1.2716 nan 0.1000 0.0351
## 10 1.2495 nan 0.1000 0.0291
## 20 1.0754 nan 0.1000 0.0199
## 40 0.8698 nan 0.1000 0.0102
## 60 0.7374 nan 0.1000 0.0100
## 80 0.6351 nan 0.1000 0.0068
## 100 0.5557 nan 0.1000 0.0049
## 120 0.4909 nan 0.1000 0.0045
## 140 0.4378 nan 0.1000 0.0035
## 150 0.4147 nan 0.1000 0.0030
##
## Iter TrainDeviance ValidDeviance StepSize Improve
## 1 1.6094 nan 0.1000 0.1909
## 2 1.4853 nan 0.1000 0.1373
## 3 1.3962 nan 0.1000 0.1117
## 4 1.3255 nan 0.1000 0.0911
## 5 1.2685 nan 0.1000 0.0758
## 6 1.2200 nan 0.1000 0.0798
## 7 1.1709 nan 0.1000 0.0742
## 8 1.1262 nan 0.1000 0.0668
## 9 1.0860 nan 0.1000 0.0575
## 10 1.0507 nan 0.1000 0.0455
## 20 0.7961 nan 0.1000 0.0324
## 40 0.5169 nan 0.1000 0.0250
## 60 0.3548 nan 0.1000 0.0070
## 80 0.2517 nan 0.1000 0.0071
## 100 0.1853 nan 0.1000 0.0040
## 120 0.1394 nan 0.1000 0.0054
## 140 0.1067 nan 0.1000 0.0022
## 150 0.0939 nan 0.1000 0.0019
##
## Iter TrainDeviance ValidDeviance StepSize Improve
## 1 1.6094 nan 0.1000 0.2513
## 2 1.4497 nan 0.1000 0.1891
## 3 1.3318 nan 0.1000 0.1284
## 4 1.2496 nan 0.1000 0.1230
## 5 1.1737 nan 0.1000 0.0996
## 6 1.1115 nan 0.1000 0.0964
## 7 1.0485 nan 0.1000 0.0733
## 8 1.0006 nan 0.1000 0.0759
## 9 0.9545 nan 0.1000 0.0627
## 10 0.9149 nan 0.1000 0.0687
## 20 0.6069 nan 0.1000 0.0258
## 40 0.3197 nan 0.1000 0.0168
## 60 0.1935 nan 0.1000 0.0083
## 80 0.1224 nan 0.1000 0.0037
## 100 0.0815 nan 0.1000 0.0016
## 120 0.0595 nan 0.1000 0.0013
## 140 0.0438 nan 0.1000 0.0011
## 150 0.0382 nan 0.1000 0.0010
##
## Iter TrainDeviance ValidDeviance StepSize Improve
## 1 1.6094 nan 0.1000 0.1272
## 2 1.5230 nan 0.1000 0.0866
## 3 1.4649 nan 0.1000 0.0686
## 4 1.4194 nan 0.1000 0.0529
## 5 1.3845 nan 0.1000 0.0535
## 6 1.3492 nan 0.1000 0.0463
## 7 1.3198 nan 0.1000 0.0431
## 8 1.2932 nan 0.1000 0.0391
## 9 1.2688 nan 0.1000 0.0384
## 10 1.2437 nan 0.1000 0.0298
## 20 1.0715 nan 0.1000 0.0210
## 40 0.8644 nan 0.1000 0.0109
## 60 0.7311 nan 0.1000 0.0090
## 80 0.6298 nan 0.1000 0.0054
## 100 0.5493 nan 0.1000 0.0047
## 120 0.4869 nan 0.1000 0.0038
## 140 0.4347 nan 0.1000 0.0030
## 150 0.4112 nan 0.1000 0.0014
##
## Iter TrainDeviance ValidDeviance StepSize Improve
## 1 1.6094 nan 0.1000 0.2002
## 2 1.4811 nan 0.1000 0.1453
## 3 1.3899 nan 0.1000 0.1112
## 4 1.3175 nan 0.1000 0.0874
## 5 1.2614 nan 0.1000 0.0913
## 6 1.2042 nan 0.1000 0.0763
## 7 1.1563 nan 0.1000 0.0618
## 8 1.1173 nan 0.1000 0.0575
## 9 1.0808 nan 0.1000 0.0567
## 10 1.0452 nan 0.1000 0.0478
## 20 0.7848 nan 0.1000 0.0277
## 40 0.5030 nan 0.1000 0.0133
## 60 0.3519 nan 0.1000 0.0095
## 80 0.2543 nan 0.1000 0.0100
## 100 0.1838 nan 0.1000 0.0044
## 120 0.1389 nan 0.1000 0.0018
## 140 0.1060 nan 0.1000 0.0017
## 150 0.0932 nan 0.1000 0.0008
##
## Iter TrainDeviance ValidDeviance StepSize Improve
## 1 1.6094 nan 0.1000 0.2562
## 2 1.4459 nan 0.1000 0.1850
## 3 1.3311 nan 0.1000 0.1383
## 4 1.2446 nan 0.1000 0.1066
## 5 1.1767 nan 0.1000 0.1083
## 6 1.1101 nan 0.1000 0.0847
## 7 1.0559 nan 0.1000 0.0935
## 8 0.9983 nan 0.1000 0.0691
## 9 0.9563 nan 0.1000 0.0746
## 10 0.9103 nan 0.1000 0.0596
## 20 0.6069 nan 0.1000 0.0406
## 40 0.3183 nan 0.1000 0.0155
## 60 0.1968 nan 0.1000 0.0087
## 80 0.1210 nan 0.1000 0.0048
## 100 0.0816 nan 0.1000 0.0025
## 120 0.0592 nan 0.1000 0.0012
## 140 0.0436 nan 0.1000 0.0015
## 150 0.0377 nan 0.1000 0.0006
##
## Iter TrainDeviance ValidDeviance StepSize Improve
## 1 1.6094 nan 0.1000 0.2549
## 2 1.4498 nan 0.1000 0.1748
## 3 1.3407 nan 0.1000 0.1435
## 4 1.2510 nan 0.1000 0.1192
## 5 1.1765 nan 0.1000 0.0964
## 6 1.1156 nan 0.1000 0.0999
## 7 1.0532 nan 0.1000 0.0777
## 8 1.0056 nan 0.1000 0.0806
## 9 0.9571 nan 0.1000 0.0743
## 10 0.9117 nan 0.1000 0.0592
## 20 0.6073 nan 0.1000 0.0247
## 40 0.3204 nan 0.1000 0.0114
## 60 0.1865 nan 0.1000 0.0037
## 80 0.1204 nan 0.1000 0.0054
## 100 0.0805 nan 0.1000 0.0030
## 120 0.0573 nan 0.1000 0.0009
## 140 0.0414 nan 0.1000 0.0010
## 150 0.0366 nan 0.1000 0.0010
# to see summary of model object and final model
modgbm
## Stochastic Gradient Boosting
##
## 19622 samples
## 57 predictor
## 5 classes: 'A', 'B', 'C', 'D', 'E'
##
## No pre-processing
## Resampling: Cross-Validated (5 fold)
## Summary of sample sizes: 15699, 15697, 15696, 15699, 15697
## Resampling results across tuning parameters:
##
## interaction.depth n.trees Accuracy Kappa
## 1 50 0.8180115 0.7690522
## 1 100 0.8958828 0.8681486
## 1 150 0.9238616 0.9035631
## 2 50 0.9439408 0.9290420
## 2 100 0.9825704 0.9779489
## 2 150 0.9915909 0.9893633
## 3 50 0.9746716 0.9679485
## 3 100 0.9926612 0.9907172
## 3 150 0.9964834 0.9955522
##
## Tuning parameter 'shrinkage' was held constant at a value of 0.1
##
## Tuning parameter 'n.minobsinnode' was held constant at a value of 10
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were n.trees = 150,
## interaction.depth = 3, shrinkage = 0.1 and n.minobsinnode = 10.
modgbm$finalModel
## A gradient boosted model with multinomial loss function.
## 150 iterations were performed.
## There were 61 predictors of which 36 had non-zero influence.
# prediction
predgbm <- predict(modgbm,newdata=test3)
Model Bagging
Bagging algorithm is used here. Caret Package has implementation of Boosting in method “treebag”. Model is used on test3 data for prediction.
# model bagging method "treebag"
set.seed(431)
modbag <- train(classe ~ ., data=train3, method="treebag", trControl=trainControl(method="cv",number=5))
# to see summary of model object and final model
modbag
## Bagged CART
##
## 19622 samples
## 57 predictor
## 5 classes: 'A', 'B', 'C', 'D', 'E'
##
## No pre-processing
## Resampling: Cross-Validated (5 fold)
## Summary of sample sizes: 15699, 15697, 15696, 15699, 15697
## Resampling results:
##
## Accuracy Kappa
## 0.9990826 0.9988396
##
##
modbag$finalModel
##
## Bagging classification trees with 25 bootstrap replications
# prediction
predbag <- predict(modbag,newdata=test3)
Model extreme gradient boosting
Extreme gradient boosting algorithm is used here. Caret Package has implementation of Boosting in method “xgbtree”.Model is used on test3 data for prediction.
# model xgboost method "xgbTree"
set.seed(431)
modxgbTree <- train(classe ~ ., data=train3, method="xgbTree", trControl=trainControl(method="cv",number=5))
# to see summary of model object
modxgbTree
## eXtreme Gradient Boosting
##
## 19622 samples
## 57 predictor
## 5 classes: 'A', 'B', 'C', 'D', 'E'
##
## No pre-processing
## Resampling: Cross-Validated (5 fold)
## Summary of sample sizes: 15699, 15697, 15696, 15699, 15697
## Resampling results across tuning parameters:
##
## eta max_depth colsample_bytree nrounds Accuracy Kappa
## 0.3 1 0.6 50 0.8084795 0.7570188
## 0.3 1 0.6 100 0.8537360 0.8147445
## 0.3 1 0.6 150 0.8814603 0.8499619
## 0.3 1 0.8 50 0.7903380 0.7338623
## 0.3 1 0.8 100 0.8319748 0.7871383
## 0.3 1 0.8 150 0.8562335 0.8179227
## 0.3 2 0.6 50 0.9393032 0.9232044
## 0.3 2 0.6 100 0.9681996 0.9597723
## 0.3 2 0.6 150 0.9826220 0.9780190
## 0.3 2 0.8 50 0.9299252 0.9113662
## 0.3 2 0.8 100 0.9489346 0.9353973
## 0.3 2 0.8 150 0.9622872 0.9522965
## 0.3 3 0.6 50 0.9729890 0.9658206
## 0.3 3 0.6 100 0.9897053 0.9869772
## 0.3 3 0.6 150 0.9961777 0.9951652
## 0.3 3 0.8 50 0.9625926 0.9526644
## 0.3 3 0.8 100 0.9813980 0.9764666
## 0.3 3 0.8 150 0.9887369 0.9857530
## 0.4 1 0.6 50 0.8322808 0.7874994
## 0.4 1 0.6 100 0.8800325 0.8481384
## 0.4 1 0.6 150 0.9050558 0.8798415
## 0.4 1 0.8 50 0.8191830 0.7708108
## 0.4 1 0.8 100 0.8532785 0.8141549
## 0.4 1 0.8 150 0.8753448 0.8422368
## 0.4 2 0.6 50 0.9517380 0.9389608
## 0.4 2 0.6 100 0.9797676 0.9744055
## 0.4 2 0.6 150 0.9911831 0.9888476
## 0.4 2 0.8 50 0.9437360 0.9288364
## 0.4 2 0.8 100 0.9671277 0.9584268
## 0.4 2 0.8 150 0.9784420 0.9727308
## 0.4 3 0.6 50 0.9836404 0.9793028
## 0.4 3 0.6 100 0.9960755 0.9950360
## 0.4 3 0.6 150 0.9991335 0.9989040
## 0.4 3 0.8 50 0.9727852 0.9655635
## 0.4 3 0.8 100 0.9876670 0.9843982
## 0.4 3 0.8 150 0.9942413 0.9927157
##
## Tuning parameter 'gamma' was held constant at a value of 0
##
## Tuning parameter 'min_child_weight' was held constant at a value of 1
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were nrounds = 150, max_depth = 3,
## eta = 0.4, gamma = 0, colsample_bytree = 0.6 and min_child_weight = 1.
# prediction
predxgbTree <- predict(modxgbTree ,newdata=test3)
Random Forest yielded better Results.
Random Forest machine learning algorithm isapplied to the 20 test cases available in the test data .
The Submission dataframe is generated and Submission.csv file is used for submission.
Submission = data.frame(X= test$X, Predictions=predRF)
write.csv(Submission, "Submission.csv", row.names=FALSE)