The goal of this project is to predict the manner in which 6 participants that regularly use fitness collection devices did barbell lifts; correctly or incorrectly. The outcome variable of interest is ‘classe’. We want to use data from accelerometers on the belt, forearm, arm, and dumbell. To determine the best model to predict this variable with, we will try four methods; decision trees, random forests, boosting, and support vector machines. We will also test a combination of all models.
This analysis determine that Random Forests was the best prediction model with 99% accuracy and around a 1% out of sample error.
read.csv("/Users/difrankaj/Desktop/pml-testing.csv") ->testing
read.csv("/Users/difrankaj/Desktop/pml-training.csv") ->training
library("caret")
## Loading required package: ggplot2
## Loading required package: lattice
library(rattle)
## Loading required package: tibble
## Loading required package: bitops
## Rattle: A free graphical interface for data science with R.
## Version 5.5.1 Copyright (c) 2006-2021 Togaware Pty Ltd.
## Type 'rattle()' to shake, rattle, and roll your data.
dim(training)
## [1] 19622 160
# THere are 160 variables. There are 20 observations in the test set and 19,622 in the training set.
There are many NA values in the data set- we first want to remove these. Then, we will look for near-zero variance variables and remove those as well.
#Showing NA values
sum(is.na(training))
## [1] 1287472
#Over one million NA values!
#Removing columns that are majorily NA values
condition <- (colSums(is.na(training)) == 0)
training <- training[, condition]
testing <- testing[, condition]
sum(is.na(training))
## [1] 0
#This removed all NAs. We now have 93 variables to work with.
#Removing the first 7 columns that are not relevant predictors- irrelevant to the outcome. Now, 86 variables.
training <- training[, -c(1:7)]
testing <- testing[, -c(1:7)]
Removing near-zero variance variables
near <- nearZeroVar(training)
training <- training[, -near]
testing <- testing[, -near]
#We now have 53 variables.
In order to perform cross validation, we will subset the training dataset into training and an addition validation dataset.
#Create data partition using the caret package
inTrain <- createDataPartition(y = training$classe, p = 0.7, list = FALSE)
train <- training[inTrain, ]
validation <- training[-inTrain, ]
Control for cross validation
control <- trainControl(method="cv", number=5, verboseIter=FALSE)
Basic Idea: Split variables into groups based on a decision split, evaluate the homogeneity within each group, split again if necessary.
dectrees<- train(classe~., method= "rpart", trControl = control, data=train)
#Visual:
fancyRpartPlot(dectrees$finalModel)
Testing prediction
DTpred<- predict(dectrees, validation)
confusionMatrix(DTpred, factor(validation$classe))
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 1521 474 477 426 153
## B 31 383 32 188 137
## C 119 282 517 350 267
## D 0 0 0 0 0
## E 3 0 0 0 525
##
## Overall Statistics
##
## Accuracy : 0.5006
## 95% CI : (0.4877, 0.5135)
## No Information Rate : 0.2845
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.3474
##
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 0.9086 0.33626 0.50390 0.0000 0.48521
## Specificity 0.6367 0.91825 0.79049 1.0000 0.99938
## Pos Pred Value 0.4985 0.49676 0.33681 NaN 0.99432
## Neg Pred Value 0.9460 0.85217 0.88299 0.8362 0.89602
## Prevalence 0.2845 0.19354 0.17434 0.1638 0.18386
## Detection Rate 0.2585 0.06508 0.08785 0.0000 0.08921
## Detection Prevalence 0.5184 0.13101 0.26083 0.0000 0.08972
## Balanced Accuracy 0.7726 0.62725 0.64720 0.5000 0.74229
This model is approximately 50% accurate, which means it is no better than a random guess. The out of sample error is then around 0.50.
Basic Idea: Take bootsrapped sample, at each tree split build boostrapped variables in each sample, hence growing multiple trees, then vote/average these trees to create best prediction model.
rf<- train(classe~., method="rf", trControl= control, data=train)
Testing prediction
RFpred<- predict(rf, validation)
confusionMatrix(RFpred, factor(validation$classe))
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 1671 9 0 0 0
## B 2 1129 3 0 0
## C 0 1 1022 17 3
## D 0 0 1 947 3
## E 1 0 0 0 1076
##
## Overall Statistics
##
## Accuracy : 0.9932
## 95% CI : (0.9908, 0.9951)
## No Information Rate : 0.2845
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.9914
##
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 0.9982 0.9912 0.9961 0.9824 0.9945
## Specificity 0.9979 0.9989 0.9957 0.9992 0.9998
## Pos Pred Value 0.9946 0.9956 0.9799 0.9958 0.9991
## Neg Pred Value 0.9993 0.9979 0.9992 0.9966 0.9988
## Prevalence 0.2845 0.1935 0.1743 0.1638 0.1839
## Detection Rate 0.2839 0.1918 0.1737 0.1609 0.1828
## Detection Prevalence 0.2855 0.1927 0.1772 0.1616 0.1830
## Balanced Accuracy 0.9980 0.9951 0.9959 0.9908 0.9971
This model is approximately 99% accurate! The out of sample error is then around 0.01.
Basic Idea: Resample the data several times, recalculated predictions based on the previous sample, and after the resampling, average/majority vote the results for a final prediction model.
## Iter TrainDeviance ValidDeviance StepSize Improve
## 1 1.6094 nan 0.1000 0.1285
## 2 1.5233 nan 0.1000 0.0856
## 3 1.4663 nan 0.1000 0.0702
## 4 1.4204 nan 0.1000 0.0528
## 5 1.3860 nan 0.1000 0.0431
## 6 1.3582 nan 0.1000 0.0454
## 7 1.3286 nan 0.1000 0.0377
## 8 1.3046 nan 0.1000 0.0329
## 9 1.2838 nan 0.1000 0.0364
## 10 1.2601 nan 0.1000 0.0280
## 20 1.1034 nan 0.1000 0.0162
## 40 0.9318 nan 0.1000 0.0096
## 60 0.8245 nan 0.1000 0.0060
## 80 0.7435 nan 0.1000 0.0054
## 100 0.6796 nan 0.1000 0.0040
## 120 0.6288 nan 0.1000 0.0040
## 140 0.5836 nan 0.1000 0.0024
## 150 0.5641 nan 0.1000 0.0019
##
## Iter TrainDeviance ValidDeviance StepSize Improve
## 1 1.6094 nan 0.1000 0.1854
## 2 1.4887 nan 0.1000 0.1312
## 3 1.4046 nan 0.1000 0.1041
## 4 1.3381 nan 0.1000 0.0850
## 5 1.2834 nan 0.1000 0.0717
## 6 1.2380 nan 0.1000 0.0711
## 7 1.1932 nan 0.1000 0.0592
## 8 1.1563 nan 0.1000 0.0483
## 9 1.1246 nan 0.1000 0.0477
## 10 1.0940 nan 0.1000 0.0370
## 20 0.8912 nan 0.1000 0.0243
## 40 0.6846 nan 0.1000 0.0113
## 60 0.5593 nan 0.1000 0.0088
## 80 0.4689 nan 0.1000 0.0051
## 100 0.3984 nan 0.1000 0.0049
## 120 0.3457 nan 0.1000 0.0036
## 140 0.3040 nan 0.1000 0.0018
## 150 0.2881 nan 0.1000 0.0019
##
## Iter TrainDeviance ValidDeviance StepSize Improve
## 1 1.6094 nan 0.1000 0.2306
## 2 1.4600 nan 0.1000 0.1584
## 3 1.3577 nan 0.1000 0.1223
## 4 1.2800 nan 0.1000 0.1045
## 5 1.2144 nan 0.1000 0.0957
## 6 1.1537 nan 0.1000 0.0671
## 7 1.1096 nan 0.1000 0.0733
## 8 1.0637 nan 0.1000 0.0622
## 9 1.0247 nan 0.1000 0.0603
## 10 0.9864 nan 0.1000 0.0487
## 20 0.7537 nan 0.1000 0.0265
## 40 0.5302 nan 0.1000 0.0135
## 60 0.4011 nan 0.1000 0.0058
## 80 0.3221 nan 0.1000 0.0037
## 100 0.2656 nan 0.1000 0.0047
## 120 0.2226 nan 0.1000 0.0024
## 140 0.1868 nan 0.1000 0.0028
## 150 0.1721 nan 0.1000 0.0018
##
## Iter TrainDeviance ValidDeviance StepSize Improve
## 1 1.6094 nan 0.1000 0.1281
## 2 1.5238 nan 0.1000 0.0853
## 3 1.4656 nan 0.1000 0.0680
## 4 1.4211 nan 0.1000 0.0528
## 5 1.3855 nan 0.1000 0.0434
## 6 1.3559 nan 0.1000 0.0435
## 7 1.3275 nan 0.1000 0.0399
## 8 1.3020 nan 0.1000 0.0360
## 9 1.2796 nan 0.1000 0.0312
## 10 1.2589 nan 0.1000 0.0270
## 20 1.1055 nan 0.1000 0.0162
## 40 0.9324 nan 0.1000 0.0086
## 60 0.8231 nan 0.1000 0.0070
## 80 0.7460 nan 0.1000 0.0043
## 100 0.6833 nan 0.1000 0.0040
## 120 0.6310 nan 0.1000 0.0031
## 140 0.5858 nan 0.1000 0.0034
## 150 0.5659 nan 0.1000 0.0022
##
## Iter TrainDeviance ValidDeviance StepSize Improve
## 1 1.6094 nan 0.1000 0.1845
## 2 1.4891 nan 0.1000 0.1234
## 3 1.4067 nan 0.1000 0.1089
## 4 1.3389 nan 0.1000 0.0895
## 5 1.2828 nan 0.1000 0.0687
## 6 1.2385 nan 0.1000 0.0618
## 7 1.1986 nan 0.1000 0.0602
## 8 1.1608 nan 0.1000 0.0517
## 9 1.1285 nan 0.1000 0.0508
## 10 1.0960 nan 0.1000 0.0405
## 20 0.8952 nan 0.1000 0.0212
## 40 0.6785 nan 0.1000 0.0129
## 60 0.5543 nan 0.1000 0.0067
## 80 0.4663 nan 0.1000 0.0040
## 100 0.4021 nan 0.1000 0.0051
## 120 0.3511 nan 0.1000 0.0041
## 140 0.3081 nan 0.1000 0.0026
## 150 0.2892 nan 0.1000 0.0020
##
## Iter TrainDeviance ValidDeviance StepSize Improve
## 1 1.6094 nan 0.1000 0.2315
## 2 1.4610 nan 0.1000 0.1664
## 3 1.3549 nan 0.1000 0.1284
## 4 1.2746 nan 0.1000 0.1015
## 5 1.2100 nan 0.1000 0.0804
## 6 1.1564 nan 0.1000 0.0845
## 7 1.1040 nan 0.1000 0.0599
## 8 1.0659 nan 0.1000 0.0572
## 9 1.0301 nan 0.1000 0.0620
## 10 0.9918 nan 0.1000 0.0464
## 20 0.7521 nan 0.1000 0.0275
## 40 0.5242 nan 0.1000 0.0121
## 60 0.4049 nan 0.1000 0.0073
## 80 0.3248 nan 0.1000 0.0046
## 100 0.2665 nan 0.1000 0.0030
## 120 0.2207 nan 0.1000 0.0025
## 140 0.1886 nan 0.1000 0.0008
## 150 0.1747 nan 0.1000 0.0009
##
## Iter TrainDeviance ValidDeviance StepSize Improve
## 1 1.6094 nan 0.1000 0.1320
## 2 1.5228 nan 0.1000 0.0860
## 3 1.4657 nan 0.1000 0.0704
## 4 1.4208 nan 0.1000 0.0551
## 5 1.3847 nan 0.1000 0.0496
## 6 1.3525 nan 0.1000 0.0390
## 7 1.3273 nan 0.1000 0.0381
## 8 1.3025 nan 0.1000 0.0352
## 9 1.2801 nan 0.1000 0.0341
## 10 1.2580 nan 0.1000 0.0268
## 20 1.1079 nan 0.1000 0.0189
## 40 0.9355 nan 0.1000 0.0090
## 60 0.8287 nan 0.1000 0.0065
## 80 0.7479 nan 0.1000 0.0039
## 100 0.6827 nan 0.1000 0.0046
## 120 0.6310 nan 0.1000 0.0028
## 140 0.5868 nan 0.1000 0.0030
## 150 0.5677 nan 0.1000 0.0031
##
## Iter TrainDeviance ValidDeviance StepSize Improve
## 1 1.6094 nan 0.1000 0.1774
## 2 1.4897 nan 0.1000 0.1271
## 3 1.4068 nan 0.1000 0.1074
## 4 1.3389 nan 0.1000 0.0921
## 5 1.2809 nan 0.1000 0.0742
## 6 1.2334 nan 0.1000 0.0611
## 7 1.1942 nan 0.1000 0.0596
## 8 1.1577 nan 0.1000 0.0475
## 9 1.1269 nan 0.1000 0.0466
## 10 1.0971 nan 0.1000 0.0479
## 20 0.8979 nan 0.1000 0.0215
## 40 0.6851 nan 0.1000 0.0136
## 60 0.5577 nan 0.1000 0.0073
## 80 0.4652 nan 0.1000 0.0059
## 100 0.4007 nan 0.1000 0.0033
## 120 0.3493 nan 0.1000 0.0049
## 140 0.3066 nan 0.1000 0.0017
## 150 0.2897 nan 0.1000 0.0013
##
## Iter TrainDeviance ValidDeviance StepSize Improve
## 1 1.6094 nan 0.1000 0.2322
## 2 1.4596 nan 0.1000 0.1635
## 3 1.3570 nan 0.1000 0.1297
## 4 1.2748 nan 0.1000 0.0987
## 5 1.2132 nan 0.1000 0.0946
## 6 1.1536 nan 0.1000 0.0790
## 7 1.1033 nan 0.1000 0.0648
## 8 1.0613 nan 0.1000 0.0584
## 9 1.0239 nan 0.1000 0.0597
## 10 0.9868 nan 0.1000 0.0536
## 20 0.7584 nan 0.1000 0.0281
## 40 0.5318 nan 0.1000 0.0086
## 60 0.4115 nan 0.1000 0.0056
## 80 0.3266 nan 0.1000 0.0060
## 100 0.2668 nan 0.1000 0.0021
## 120 0.2254 nan 0.1000 0.0025
## 140 0.1906 nan 0.1000 0.0012
## 150 0.1776 nan 0.1000 0.0006
##
## Iter TrainDeviance ValidDeviance StepSize Improve
## 1 1.6094 nan 0.1000 0.1249
## 2 1.5255 nan 0.1000 0.0832
## 3 1.4706 nan 0.1000 0.0626
## 4 1.4272 nan 0.1000 0.0521
## 5 1.3936 nan 0.1000 0.0540
## 6 1.3601 nan 0.1000 0.0370
## 7 1.3358 nan 0.1000 0.0380
## 8 1.3114 nan 0.1000 0.0319
## 9 1.2905 nan 0.1000 0.0335
## 10 1.2676 nan 0.1000 0.0334
## 20 1.1134 nan 0.1000 0.0191
## 40 0.9405 nan 0.1000 0.0111
## 60 0.8277 nan 0.1000 0.0047
## 80 0.7514 nan 0.1000 0.0045
## 100 0.6885 nan 0.1000 0.0046
## 120 0.6355 nan 0.1000 0.0029
## 140 0.5928 nan 0.1000 0.0026
## 150 0.5719 nan 0.1000 0.0015
##
## Iter TrainDeviance ValidDeviance StepSize Improve
## 1 1.6094 nan 0.1000 0.1803
## 2 1.4892 nan 0.1000 0.1256
## 3 1.4056 nan 0.1000 0.1000
## 4 1.3421 nan 0.1000 0.0779
## 5 1.2907 nan 0.1000 0.0732
## 6 1.2436 nan 0.1000 0.0710
## 7 1.1998 nan 0.1000 0.0564
## 8 1.1622 nan 0.1000 0.0546
## 9 1.1280 nan 0.1000 0.0446
## 10 1.0998 nan 0.1000 0.0464
## 20 0.9002 nan 0.1000 0.0171
## 40 0.6893 nan 0.1000 0.0090
## 60 0.5626 nan 0.1000 0.0093
## 80 0.4697 nan 0.1000 0.0061
## 100 0.4002 nan 0.1000 0.0049
## 120 0.3510 nan 0.1000 0.0024
## 140 0.3096 nan 0.1000 0.0024
## 150 0.2906 nan 0.1000 0.0026
##
## Iter TrainDeviance ValidDeviance StepSize Improve
## 1 1.6094 nan 0.1000 0.2349
## 2 1.4622 nan 0.1000 0.1552
## 3 1.3645 nan 0.1000 0.1240
## 4 1.2852 nan 0.1000 0.0940
## 5 1.2225 nan 0.1000 0.0916
## 6 1.1653 nan 0.1000 0.0772
## 7 1.1167 nan 0.1000 0.0733
## 8 1.0710 nan 0.1000 0.0658
## 9 1.0302 nan 0.1000 0.0529
## 10 0.9962 nan 0.1000 0.0552
## 20 0.7630 nan 0.1000 0.0200
## 40 0.5356 nan 0.1000 0.0152
## 60 0.4096 nan 0.1000 0.0100
## 80 0.3265 nan 0.1000 0.0062
## 100 0.2669 nan 0.1000 0.0033
## 120 0.2218 nan 0.1000 0.0023
## 140 0.1880 nan 0.1000 0.0018
## 150 0.1749 nan 0.1000 0.0013
##
## Iter TrainDeviance ValidDeviance StepSize Improve
## 1 1.6094 nan 0.1000 0.1237
## 2 1.5240 nan 0.1000 0.0887
## 3 1.4659 nan 0.1000 0.0697
## 4 1.4203 nan 0.1000 0.0528
## 5 1.3856 nan 0.1000 0.0454
## 6 1.3551 nan 0.1000 0.0392
## 7 1.3298 nan 0.1000 0.0400
## 8 1.3036 nan 0.1000 0.0339
## 9 1.2813 nan 0.1000 0.0281
## 10 1.2628 nan 0.1000 0.0316
## 20 1.1080 nan 0.1000 0.0166
## 40 0.9394 nan 0.1000 0.0085
## 60 0.8313 nan 0.1000 0.0057
## 80 0.7493 nan 0.1000 0.0031
## 100 0.6852 nan 0.1000 0.0028
## 120 0.6337 nan 0.1000 0.0022
## 140 0.5887 nan 0.1000 0.0026
## 150 0.5697 nan 0.1000 0.0028
##
## Iter TrainDeviance ValidDeviance StepSize Improve
## 1 1.6094 nan 0.1000 0.1836
## 2 1.4900 nan 0.1000 0.1271
## 3 1.4063 nan 0.1000 0.1071
## 4 1.3385 nan 0.1000 0.0836
## 5 1.2855 nan 0.1000 0.0706
## 6 1.2400 nan 0.1000 0.0605
## 7 1.2008 nan 0.1000 0.0633
## 8 1.1621 nan 0.1000 0.0492
## 9 1.1309 nan 0.1000 0.0478
## 10 1.1003 nan 0.1000 0.0473
## 20 0.8990 nan 0.1000 0.0194
## 40 0.6803 nan 0.1000 0.0090
## 60 0.5620 nan 0.1000 0.0106
## 80 0.4649 nan 0.1000 0.0044
## 100 0.4026 nan 0.1000 0.0056
## 120 0.3477 nan 0.1000 0.0049
## 140 0.3069 nan 0.1000 0.0025
## 150 0.2874 nan 0.1000 0.0015
##
## Iter TrainDeviance ValidDeviance StepSize Improve
## 1 1.6094 nan 0.1000 0.2337
## 2 1.4599 nan 0.1000 0.1641
## 3 1.3557 nan 0.1000 0.1287
## 4 1.2728 nan 0.1000 0.0993
## 5 1.2094 nan 0.1000 0.0905
## 6 1.1538 nan 0.1000 0.0786
## 7 1.1037 nan 0.1000 0.0686
## 8 1.0607 nan 0.1000 0.0592
## 9 1.0237 nan 0.1000 0.0590
## 10 0.9863 nan 0.1000 0.0480
## 20 0.7571 nan 0.1000 0.0256
## 40 0.5300 nan 0.1000 0.0114
## 60 0.4071 nan 0.1000 0.0093
## 80 0.3241 nan 0.1000 0.0065
## 100 0.2654 nan 0.1000 0.0030
## 120 0.2181 nan 0.1000 0.0020
## 140 0.1855 nan 0.1000 0.0018
## 150 0.1716 nan 0.1000 0.0009
##
## Iter TrainDeviance ValidDeviance StepSize Improve
## 1 1.6094 nan 0.1000 0.2350
## 2 1.4611 nan 0.1000 0.1610
## 3 1.3593 nan 0.1000 0.1217
## 4 1.2829 nan 0.1000 0.1049
## 5 1.2167 nan 0.1000 0.0864
## 6 1.1609 nan 0.1000 0.0811
## 7 1.1112 nan 0.1000 0.0705
## 8 1.0674 nan 0.1000 0.0638
## 9 1.0268 nan 0.1000 0.0553
## 10 0.9921 nan 0.1000 0.0517
## 20 0.7592 nan 0.1000 0.0231
## 40 0.5315 nan 0.1000 0.0090
## 60 0.4066 nan 0.1000 0.0072
## 80 0.3218 nan 0.1000 0.0044
## 100 0.2670 nan 0.1000 0.0036
## 120 0.2205 nan 0.1000 0.0012
## 140 0.1884 nan 0.1000 0.0024
## 150 0.1745 nan 0.1000 0.0012
Testing prediction
boostpred<- predict(boost, validation)
confusionMatrix(boostpred, factor(validation$classe))
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 1644 50 0 0 0
## B 20 1063 23 1 6
## C 4 22 986 22 7
## D 5 3 16 929 13
## E 1 1 1 12 1056
##
## Overall Statistics
##
## Accuracy : 0.9648
## 95% CI : (0.9598, 0.9694)
## No Information Rate : 0.2845
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.9555
##
## Mcnemar's Test P-Value : 0.000279
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 0.9821 0.9333 0.9610 0.9637 0.9760
## Specificity 0.9881 0.9895 0.9887 0.9925 0.9969
## Pos Pred Value 0.9705 0.9551 0.9472 0.9617 0.9860
## Neg Pred Value 0.9928 0.9841 0.9917 0.9929 0.9946
## Prevalence 0.2845 0.1935 0.1743 0.1638 0.1839
## Detection Rate 0.2794 0.1806 0.1675 0.1579 0.1794
## Detection Prevalence 0.2879 0.1891 0.1769 0.1641 0.1820
## Balanced Accuracy 0.9851 0.9614 0.9748 0.9781 0.9864
This model is approximately 96% accurate!The out of sample error is then around 0.04.
Basic Idea: Maximize the margin of the classifier by use of support vectors; mapping data to a high-dimensional feature space so that data points can be categorized.
svm<- train(classe~., method="svmLinear", trControl=control, data=train)
Testing prediction
svmpred<- predict(svm, validation)
confusionMatrix(svmpred, factor(validation$classe))
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 1525 145 79 60 62
## B 40 821 100 35 132
## C 52 68 784 107 59
## D 47 19 27 717 50
## E 10 86 36 45 779
##
## Overall Statistics
##
## Accuracy : 0.7861
## 95% CI : (0.7754, 0.7965)
## No Information Rate : 0.2845
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.7282
##
## Mcnemar's Test P-Value : < 2.2e-16
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 0.9110 0.7208 0.7641 0.7438 0.7200
## Specificity 0.9178 0.9353 0.9411 0.9709 0.9631
## Pos Pred Value 0.8151 0.7278 0.7327 0.8337 0.8149
## Neg Pred Value 0.9629 0.9332 0.9497 0.9508 0.9385
## Prevalence 0.2845 0.1935 0.1743 0.1638 0.1839
## Detection Rate 0.2591 0.1395 0.1332 0.1218 0.1324
## Detection Prevalence 0.3179 0.1917 0.1818 0.1461 0.1624
## Balanced Accuracy 0.9144 0.8281 0.8526 0.8574 0.8416
This model is approximately 79% accurate. The out of sample error is then around 0.21.
Because the first model tested, decision trees, did not perform better than an average guess, we will only try to combine models 2-4.
Method: Model stacking
#First, combine the predictions from models 2-4 into one dataframe
combdf<- data.frame(RFpred, boostpred, svmpred, classe=validation$classe)
combFit<- train(classe~., method= "gam", data=combdf)
## Loading required package: mgcv
## Loading required package: nlme
## This is mgcv 1.8-40. For overview type 'help("mgcv-package")'.
#Checking accuracy:
combpred<- predict(combFit, validation)
confusionMatrix(combpred, factor(validation$classe))
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 1671 9 0 0 0
## B 3 1130 1026 964 1082
## C 0 0 0 0 0
## D 0 0 0 0 0
## E 0 0 0 0 0
##
## Overall Statistics
##
## Accuracy : 0.476
## 95% CI : (0.4631, 0.4888)
## No Information Rate : 0.2845
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.3286
##
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 0.9982 0.9921 0.0000 0.0000 0.0000
## Specificity 0.9979 0.3521 1.0000 1.0000 1.0000
## Pos Pred Value 0.9946 0.2687 NaN NaN NaN
## Neg Pred Value 0.9993 0.9946 0.8257 0.8362 0.8161
## Prevalence 0.2845 0.1935 0.1743 0.1638 0.1839
## Detection Rate 0.2839 0.1920 0.0000 0.0000 0.0000
## Detection Prevalence 0.2855 0.7145 0.0000 0.0000 0.0000
## Balanced Accuracy 0.9980 0.6721 0.5000 0.5000 0.5000
This model is extremely innacurate at approximately 48%.
The best model is random forests, with 99% accuracy. Now we will predict the classe (5 levels) on the test set.
plot(rf)
testpred<- predict(rf, testing)
testpred
## [1] B A B A A E D B A A B C B A E E A B B B
## Levels: A B C D E