Machine Learning Final Project

Executive Summary

The goal of this project is to predict the manner in which 6 participants that regularly use fitness collection devices did barbell lifts; correctly or incorrectly. The outcome variable of interest is ‘classe’. We want to use data from accelerometers on the belt, forearm, arm, and dumbell. To determine the best model to predict this variable with, we will try four methods; decision trees, random forests, boosting, and support vector machines. We will also test a combination of all models.

This analysis determine that Random Forests was the best prediction model with 99% accuracy and around a 1% out of sample error.

Exploratory Data Analysis

read.csv("/Users/difrankaj/Desktop/pml-testing.csv") ->testing
read.csv("/Users/difrankaj/Desktop/pml-training.csv") ->training
library("caret")

## Loading required package: ggplot2

## Loading required package: lattice

library(rattle)

## Loading required package: tibble

## Loading required package: bitops

## Rattle: A free graphical interface for data science with R.
## Version 5.5.1 Copyright (c) 2006-2021 Togaware Pty Ltd.
## Type 'rattle()' to shake, rattle, and roll your data.

dim(training)

## [1] 19622   160

# THere are 160 variables. There are 20 observations in the test set and 19,622 in the training set.

Data Munging

There are many NA values in the data set- we first want to remove these. Then, we will look for near-zero variance variables and remove those as well.

#Showing NA values
sum(is.na(training))

## [1] 1287472

#Over one million NA values!

#Removing columns that are majorily NA values
condition <- (colSums(is.na(training)) == 0)
training <- training[, condition]
testing <- testing[, condition]
sum(is.na(training))

## [1] 0

#This removed all NAs. We now have 93 variables to work with.

#Removing the first 7 columns that are not relevant predictors- irrelevant to the outcome. Now, 86 variables.
training <- training[, -c(1:7)]
testing <- testing[, -c(1:7)]

Removing near-zero variance variables

near <- nearZeroVar(training)
training <- training[, -near]
testing <- testing[, -near]
#We now have 53 variables.

In order to perform cross validation, we will subset the training dataset into training and an addition validation dataset.

#Create data partition using the caret package
inTrain <- createDataPartition(y = training$classe, p = 0.7, list = FALSE)
train <- training[inTrain, ]
validation <- training[-inTrain, ]

Creating and Testing Models

Control for cross validation

control <- trainControl(method="cv", number=5, verboseIter=FALSE)

Model 1: Decision Trees

Basic Idea: Split variables into groups based on a decision split, evaluate the homogeneity within each group, split again if necessary.

dectrees<- train(classe~., method= "rpart", trControl = control, data=train)
#Visual: 
fancyRpartPlot(dectrees$finalModel)

Testing prediction

DTpred<- predict(dectrees, validation)
confusionMatrix(DTpred, factor(validation$classe))

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 1521  474  477  426  153
##          B   31  383   32  188  137
##          C  119  282  517  350  267
##          D    0    0    0    0    0
##          E    3    0    0    0  525
## 
## Overall Statistics
##                                           
##                Accuracy : 0.5006          
##                  95% CI : (0.4877, 0.5135)
##     No Information Rate : 0.2845          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.3474          
##                                           
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            0.9086  0.33626  0.50390   0.0000  0.48521
## Specificity            0.6367  0.91825  0.79049   1.0000  0.99938
## Pos Pred Value         0.4985  0.49676  0.33681      NaN  0.99432
## Neg Pred Value         0.9460  0.85217  0.88299   0.8362  0.89602
## Prevalence             0.2845  0.19354  0.17434   0.1638  0.18386
## Detection Rate         0.2585  0.06508  0.08785   0.0000  0.08921
## Detection Prevalence   0.5184  0.13101  0.26083   0.0000  0.08972
## Balanced Accuracy      0.7726  0.62725  0.64720   0.5000  0.74229

This model is approximately 50% accurate, which means it is no better than a random guess. The out of sample error is then around 0.50.

Model 2: Random Forests

Basic Idea: Take bootsrapped sample, at each tree split build boostrapped variables in each sample, hence growing multiple trees, then vote/average these trees to create best prediction model.

rf<- train(classe~., method="rf", trControl= control, data=train)

Testing prediction

RFpred<- predict(rf, validation)
confusionMatrix(RFpred, factor(validation$classe))

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 1671    9    0    0    0
##          B    2 1129    3    0    0
##          C    0    1 1022   17    3
##          D    0    0    1  947    3
##          E    1    0    0    0 1076
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9932          
##                  95% CI : (0.9908, 0.9951)
##     No Information Rate : 0.2845          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.9914          
##                                           
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            0.9982   0.9912   0.9961   0.9824   0.9945
## Specificity            0.9979   0.9989   0.9957   0.9992   0.9998
## Pos Pred Value         0.9946   0.9956   0.9799   0.9958   0.9991
## Neg Pred Value         0.9993   0.9979   0.9992   0.9966   0.9988
## Prevalence             0.2845   0.1935   0.1743   0.1638   0.1839
## Detection Rate         0.2839   0.1918   0.1737   0.1609   0.1828
## Detection Prevalence   0.2855   0.1927   0.1772   0.1616   0.1830
## Balanced Accuracy      0.9980   0.9951   0.9959   0.9908   0.9971

This model is approximately 99% accurate! The out of sample error is then around 0.01.

Model 3: Gradient Boosted Trees

Basic Idea: Resample the data several times, recalculated predictions based on the previous sample, and after the resampling, average/majority vote the results for a final prediction model.

## Iter   TrainDeviance   ValidDeviance   StepSize   Improve
##      1        1.6094             nan     0.1000    0.1285
##      2        1.5233             nan     0.1000    0.0856
##      3        1.4663             nan     0.1000    0.0702
##      4        1.4204             nan     0.1000    0.0528
##      5        1.3860             nan     0.1000    0.0431
##      6        1.3582             nan     0.1000    0.0454
##      7        1.3286             nan     0.1000    0.0377
##      8        1.3046             nan     0.1000    0.0329
##      9        1.2838             nan     0.1000    0.0364
##     10        1.2601             nan     0.1000    0.0280
##     20        1.1034             nan     0.1000    0.0162
##     40        0.9318             nan     0.1000    0.0096
##     60        0.8245             nan     0.1000    0.0060
##     80        0.7435             nan     0.1000    0.0054
##    100        0.6796             nan     0.1000    0.0040
##    120        0.6288             nan     0.1000    0.0040
##    140        0.5836             nan     0.1000    0.0024
##    150        0.5641             nan     0.1000    0.0019
## 
## Iter   TrainDeviance   ValidDeviance   StepSize   Improve
##      1        1.6094             nan     0.1000    0.1854
##      2        1.4887             nan     0.1000    0.1312
##      3        1.4046             nan     0.1000    0.1041
##      4        1.3381             nan     0.1000    0.0850
##      5        1.2834             nan     0.1000    0.0717
##      6        1.2380             nan     0.1000    0.0711
##      7        1.1932             nan     0.1000    0.0592
##      8        1.1563             nan     0.1000    0.0483
##      9        1.1246             nan     0.1000    0.0477
##     10        1.0940             nan     0.1000    0.0370
##     20        0.8912             nan     0.1000    0.0243
##     40        0.6846             nan     0.1000    0.0113
##     60        0.5593             nan     0.1000    0.0088
##     80        0.4689             nan     0.1000    0.0051
##    100        0.3984             nan     0.1000    0.0049
##    120        0.3457             nan     0.1000    0.0036
##    140        0.3040             nan     0.1000    0.0018
##    150        0.2881             nan     0.1000    0.0019
## 
## Iter   TrainDeviance   ValidDeviance   StepSize   Improve
##      1        1.6094             nan     0.1000    0.2306
##      2        1.4600             nan     0.1000    0.1584
##      3        1.3577             nan     0.1000    0.1223
##      4        1.2800             nan     0.1000    0.1045
##      5        1.2144             nan     0.1000    0.0957
##      6        1.1537             nan     0.1000    0.0671
##      7        1.1096             nan     0.1000    0.0733
##      8        1.0637             nan     0.1000    0.0622
##      9        1.0247             nan     0.1000    0.0603
##     10        0.9864             nan     0.1000    0.0487
##     20        0.7537             nan     0.1000    0.0265
##     40        0.5302             nan     0.1000    0.0135
##     60        0.4011             nan     0.1000    0.0058
##     80        0.3221             nan     0.1000    0.0037
##    100        0.2656             nan     0.1000    0.0047
##    120        0.2226             nan     0.1000    0.0024
##    140        0.1868             nan     0.1000    0.0028
##    150        0.1721             nan     0.1000    0.0018
## 
## Iter   TrainDeviance   ValidDeviance   StepSize   Improve
##      1        1.6094             nan     0.1000    0.1281
##      2        1.5238             nan     0.1000    0.0853
##      3        1.4656             nan     0.1000    0.0680
##      4        1.4211             nan     0.1000    0.0528
##      5        1.3855             nan     0.1000    0.0434
##      6        1.3559             nan     0.1000    0.0435
##      7        1.3275             nan     0.1000    0.0399
##      8        1.3020             nan     0.1000    0.0360
##      9        1.2796             nan     0.1000    0.0312
##     10        1.2589             nan     0.1000    0.0270
##     20        1.1055             nan     0.1000    0.0162
##     40        0.9324             nan     0.1000    0.0086
##     60        0.8231             nan     0.1000    0.0070
##     80        0.7460             nan     0.1000    0.0043
##    100        0.6833             nan     0.1000    0.0040
##    120        0.6310             nan     0.1000    0.0031
##    140        0.5858             nan     0.1000    0.0034
##    150        0.5659             nan     0.1000    0.0022
## 
## Iter   TrainDeviance   ValidDeviance   StepSize   Improve
##      1        1.6094             nan     0.1000    0.1845
##      2        1.4891             nan     0.1000    0.1234
##      3        1.4067             nan     0.1000    0.1089
##      4        1.3389             nan     0.1000    0.0895
##      5        1.2828             nan     0.1000    0.0687
##      6        1.2385             nan     0.1000    0.0618
##      7        1.1986             nan     0.1000    0.0602
##      8        1.1608             nan     0.1000    0.0517
##      9        1.1285             nan     0.1000    0.0508
##     10        1.0960             nan     0.1000    0.0405
##     20        0.8952             nan     0.1000    0.0212
##     40        0.6785             nan     0.1000    0.0129
##     60        0.5543             nan     0.1000    0.0067
##     80        0.4663             nan     0.1000    0.0040
##    100        0.4021             nan     0.1000    0.0051
##    120        0.3511             nan     0.1000    0.0041
##    140        0.3081             nan     0.1000    0.0026
##    150        0.2892             nan     0.1000    0.0020
## 
## Iter   TrainDeviance   ValidDeviance   StepSize   Improve
##      1        1.6094             nan     0.1000    0.2315
##      2        1.4610             nan     0.1000    0.1664
##      3        1.3549             nan     0.1000    0.1284
##      4        1.2746             nan     0.1000    0.1015
##      5        1.2100             nan     0.1000    0.0804
##      6        1.1564             nan     0.1000    0.0845
##      7        1.1040             nan     0.1000    0.0599
##      8        1.0659             nan     0.1000    0.0572
##      9        1.0301             nan     0.1000    0.0620
##     10        0.9918             nan     0.1000    0.0464
##     20        0.7521             nan     0.1000    0.0275
##     40        0.5242             nan     0.1000    0.0121
##     60        0.4049             nan     0.1000    0.0073
##     80        0.3248             nan     0.1000    0.0046
##    100        0.2665             nan     0.1000    0.0030
##    120        0.2207             nan     0.1000    0.0025
##    140        0.1886             nan     0.1000    0.0008
##    150        0.1747             nan     0.1000    0.0009
## 
## Iter   TrainDeviance   ValidDeviance   StepSize   Improve
##      1        1.6094             nan     0.1000    0.1320
##      2        1.5228             nan     0.1000    0.0860
##      3        1.4657             nan     0.1000    0.0704
##      4        1.4208             nan     0.1000    0.0551
##      5        1.3847             nan     0.1000    0.0496
##      6        1.3525             nan     0.1000    0.0390
##      7        1.3273             nan     0.1000    0.0381
##      8        1.3025             nan     0.1000    0.0352
##      9        1.2801             nan     0.1000    0.0341
##     10        1.2580             nan     0.1000    0.0268
##     20        1.1079             nan     0.1000    0.0189
##     40        0.9355             nan     0.1000    0.0090
##     60        0.8287             nan     0.1000    0.0065
##     80        0.7479             nan     0.1000    0.0039
##    100        0.6827             nan     0.1000    0.0046
##    120        0.6310             nan     0.1000    0.0028
##    140        0.5868             nan     0.1000    0.0030
##    150        0.5677             nan     0.1000    0.0031
## 
## Iter   TrainDeviance   ValidDeviance   StepSize   Improve
##      1        1.6094             nan     0.1000    0.1774
##      2        1.4897             nan     0.1000    0.1271
##      3        1.4068             nan     0.1000    0.1074
##      4        1.3389             nan     0.1000    0.0921
##      5        1.2809             nan     0.1000    0.0742
##      6        1.2334             nan     0.1000    0.0611
##      7        1.1942             nan     0.1000    0.0596
##      8        1.1577             nan     0.1000    0.0475
##      9        1.1269             nan     0.1000    0.0466
##     10        1.0971             nan     0.1000    0.0479
##     20        0.8979             nan     0.1000    0.0215
##     40        0.6851             nan     0.1000    0.0136
##     60        0.5577             nan     0.1000    0.0073
##     80        0.4652             nan     0.1000    0.0059
##    100        0.4007             nan     0.1000    0.0033
##    120        0.3493             nan     0.1000    0.0049
##    140        0.3066             nan     0.1000    0.0017
##    150        0.2897             nan     0.1000    0.0013
## 
## Iter   TrainDeviance   ValidDeviance   StepSize   Improve
##      1        1.6094             nan     0.1000    0.2322
##      2        1.4596             nan     0.1000    0.1635
##      3        1.3570             nan     0.1000    0.1297
##      4        1.2748             nan     0.1000    0.0987
##      5        1.2132             nan     0.1000    0.0946
##      6        1.1536             nan     0.1000    0.0790
##      7        1.1033             nan     0.1000    0.0648
##      8        1.0613             nan     0.1000    0.0584
##      9        1.0239             nan     0.1000    0.0597
##     10        0.9868             nan     0.1000    0.0536
##     20        0.7584             nan     0.1000    0.0281
##     40        0.5318             nan     0.1000    0.0086
##     60        0.4115             nan     0.1000    0.0056
##     80        0.3266             nan     0.1000    0.0060
##    100        0.2668             nan     0.1000    0.0021
##    120        0.2254             nan     0.1000    0.0025
##    140        0.1906             nan     0.1000    0.0012
##    150        0.1776             nan     0.1000    0.0006
## 
## Iter   TrainDeviance   ValidDeviance   StepSize   Improve
##      1        1.6094             nan     0.1000    0.1249
##      2        1.5255             nan     0.1000    0.0832
##      3        1.4706             nan     0.1000    0.0626
##      4        1.4272             nan     0.1000    0.0521
##      5        1.3936             nan     0.1000    0.0540
##      6        1.3601             nan     0.1000    0.0370
##      7        1.3358             nan     0.1000    0.0380
##      8        1.3114             nan     0.1000    0.0319
##      9        1.2905             nan     0.1000    0.0335
##     10        1.2676             nan     0.1000    0.0334
##     20        1.1134             nan     0.1000    0.0191
##     40        0.9405             nan     0.1000    0.0111
##     60        0.8277             nan     0.1000    0.0047
##     80        0.7514             nan     0.1000    0.0045
##    100        0.6885             nan     0.1000    0.0046
##    120        0.6355             nan     0.1000    0.0029
##    140        0.5928             nan     0.1000    0.0026
##    150        0.5719             nan     0.1000    0.0015
## 
## Iter   TrainDeviance   ValidDeviance   StepSize   Improve
##      1        1.6094             nan     0.1000    0.1803
##      2        1.4892             nan     0.1000    0.1256
##      3        1.4056             nan     0.1000    0.1000
##      4        1.3421             nan     0.1000    0.0779
##      5        1.2907             nan     0.1000    0.0732
##      6        1.2436             nan     0.1000    0.0710
##      7        1.1998             nan     0.1000    0.0564
##      8        1.1622             nan     0.1000    0.0546
##      9        1.1280             nan     0.1000    0.0446
##     10        1.0998             nan     0.1000    0.0464
##     20        0.9002             nan     0.1000    0.0171
##     40        0.6893             nan     0.1000    0.0090
##     60        0.5626             nan     0.1000    0.0093
##     80        0.4697             nan     0.1000    0.0061
##    100        0.4002             nan     0.1000    0.0049
##    120        0.3510             nan     0.1000    0.0024
##    140        0.3096             nan     0.1000    0.0024
##    150        0.2906             nan     0.1000    0.0026
## 
## Iter   TrainDeviance   ValidDeviance   StepSize   Improve
##      1        1.6094             nan     0.1000    0.2349
##      2        1.4622             nan     0.1000    0.1552
##      3        1.3645             nan     0.1000    0.1240
##      4        1.2852             nan     0.1000    0.0940
##      5        1.2225             nan     0.1000    0.0916
##      6        1.1653             nan     0.1000    0.0772
##      7        1.1167             nan     0.1000    0.0733
##      8        1.0710             nan     0.1000    0.0658
##      9        1.0302             nan     0.1000    0.0529
##     10        0.9962             nan     0.1000    0.0552
##     20        0.7630             nan     0.1000    0.0200
##     40        0.5356             nan     0.1000    0.0152
##     60        0.4096             nan     0.1000    0.0100
##     80        0.3265             nan     0.1000    0.0062
##    100        0.2669             nan     0.1000    0.0033
##    120        0.2218             nan     0.1000    0.0023
##    140        0.1880             nan     0.1000    0.0018
##    150        0.1749             nan     0.1000    0.0013
## 
## Iter   TrainDeviance   ValidDeviance   StepSize   Improve
##      1        1.6094             nan     0.1000    0.1237
##      2        1.5240             nan     0.1000    0.0887
##      3        1.4659             nan     0.1000    0.0697
##      4        1.4203             nan     0.1000    0.0528
##      5        1.3856             nan     0.1000    0.0454
##      6        1.3551             nan     0.1000    0.0392
##      7        1.3298             nan     0.1000    0.0400
##      8        1.3036             nan     0.1000    0.0339
##      9        1.2813             nan     0.1000    0.0281
##     10        1.2628             nan     0.1000    0.0316
##     20        1.1080             nan     0.1000    0.0166
##     40        0.9394             nan     0.1000    0.0085
##     60        0.8313             nan     0.1000    0.0057
##     80        0.7493             nan     0.1000    0.0031
##    100        0.6852             nan     0.1000    0.0028
##    120        0.6337             nan     0.1000    0.0022
##    140        0.5887             nan     0.1000    0.0026
##    150        0.5697             nan     0.1000    0.0028
## 
## Iter   TrainDeviance   ValidDeviance   StepSize   Improve
##      1        1.6094             nan     0.1000    0.1836
##      2        1.4900             nan     0.1000    0.1271
##      3        1.4063             nan     0.1000    0.1071
##      4        1.3385             nan     0.1000    0.0836
##      5        1.2855             nan     0.1000    0.0706
##      6        1.2400             nan     0.1000    0.0605
##      7        1.2008             nan     0.1000    0.0633
##      8        1.1621             nan     0.1000    0.0492
##      9        1.1309             nan     0.1000    0.0478
##     10        1.1003             nan     0.1000    0.0473
##     20        0.8990             nan     0.1000    0.0194
##     40        0.6803             nan     0.1000    0.0090
##     60        0.5620             nan     0.1000    0.0106
##     80        0.4649             nan     0.1000    0.0044
##    100        0.4026             nan     0.1000    0.0056
##    120        0.3477             nan     0.1000    0.0049
##    140        0.3069             nan     0.1000    0.0025
##    150        0.2874             nan     0.1000    0.0015
## 
## Iter   TrainDeviance   ValidDeviance   StepSize   Improve
##      1        1.6094             nan     0.1000    0.2337
##      2        1.4599             nan     0.1000    0.1641
##      3        1.3557             nan     0.1000    0.1287
##      4        1.2728             nan     0.1000    0.0993
##      5        1.2094             nan     0.1000    0.0905
##      6        1.1538             nan     0.1000    0.0786
##      7        1.1037             nan     0.1000    0.0686
##      8        1.0607             nan     0.1000    0.0592
##      9        1.0237             nan     0.1000    0.0590
##     10        0.9863             nan     0.1000    0.0480
##     20        0.7571             nan     0.1000    0.0256
##     40        0.5300             nan     0.1000    0.0114
##     60        0.4071             nan     0.1000    0.0093
##     80        0.3241             nan     0.1000    0.0065
##    100        0.2654             nan     0.1000    0.0030
##    120        0.2181             nan     0.1000    0.0020
##    140        0.1855             nan     0.1000    0.0018
##    150        0.1716             nan     0.1000    0.0009
## 
## Iter   TrainDeviance   ValidDeviance   StepSize   Improve
##      1        1.6094             nan     0.1000    0.2350
##      2        1.4611             nan     0.1000    0.1610
##      3        1.3593             nan     0.1000    0.1217
##      4        1.2829             nan     0.1000    0.1049
##      5        1.2167             nan     0.1000    0.0864
##      6        1.1609             nan     0.1000    0.0811
##      7        1.1112             nan     0.1000    0.0705
##      8        1.0674             nan     0.1000    0.0638
##      9        1.0268             nan     0.1000    0.0553
##     10        0.9921             nan     0.1000    0.0517
##     20        0.7592             nan     0.1000    0.0231
##     40        0.5315             nan     0.1000    0.0090
##     60        0.4066             nan     0.1000    0.0072
##     80        0.3218             nan     0.1000    0.0044
##    100        0.2670             nan     0.1000    0.0036
##    120        0.2205             nan     0.1000    0.0012
##    140        0.1884             nan     0.1000    0.0024
##    150        0.1745             nan     0.1000    0.0012

Testing prediction

boostpred<- predict(boost, validation)
confusionMatrix(boostpred, factor(validation$classe))

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 1644   50    0    0    0
##          B   20 1063   23    1    6
##          C    4   22  986   22    7
##          D    5    3   16  929   13
##          E    1    1    1   12 1056
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9648          
##                  95% CI : (0.9598, 0.9694)
##     No Information Rate : 0.2845          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.9555          
##                                           
##  Mcnemar's Test P-Value : 0.000279        
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            0.9821   0.9333   0.9610   0.9637   0.9760
## Specificity            0.9881   0.9895   0.9887   0.9925   0.9969
## Pos Pred Value         0.9705   0.9551   0.9472   0.9617   0.9860
## Neg Pred Value         0.9928   0.9841   0.9917   0.9929   0.9946
## Prevalence             0.2845   0.1935   0.1743   0.1638   0.1839
## Detection Rate         0.2794   0.1806   0.1675   0.1579   0.1794
## Detection Prevalence   0.2879   0.1891   0.1769   0.1641   0.1820
## Balanced Accuracy      0.9851   0.9614   0.9748   0.9781   0.9864

This model is approximately 96% accurate!The out of sample error is then around 0.04.

Model 4: Support Vector Machine

Basic Idea: Maximize the margin of the classifier by use of support vectors; mapping data to a high-dimensional feature space so that data points can be categorized.

svm<- train(classe~., method="svmLinear", trControl=control, data=train)

Testing prediction

svmpred<- predict(svm, validation)
confusionMatrix(svmpred, factor(validation$classe))

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 1525  145   79   60   62
##          B   40  821  100   35  132
##          C   52   68  784  107   59
##          D   47   19   27  717   50
##          E   10   86   36   45  779
## 
## Overall Statistics
##                                           
##                Accuracy : 0.7861          
##                  95% CI : (0.7754, 0.7965)
##     No Information Rate : 0.2845          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.7282          
##                                           
##  Mcnemar's Test P-Value : < 2.2e-16       
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            0.9110   0.7208   0.7641   0.7438   0.7200
## Specificity            0.9178   0.9353   0.9411   0.9709   0.9631
## Pos Pred Value         0.8151   0.7278   0.7327   0.8337   0.8149
## Neg Pred Value         0.9629   0.9332   0.9497   0.9508   0.9385
## Prevalence             0.2845   0.1935   0.1743   0.1638   0.1839
## Detection Rate         0.2591   0.1395   0.1332   0.1218   0.1324
## Detection Prevalence   0.3179   0.1917   0.1818   0.1461   0.1624
## Balanced Accuracy      0.9144   0.8281   0.8526   0.8574   0.8416

This model is approximately 79% accurate. The out of sample error is then around 0.21.

Model 5: Combining predictors by averaging

Because the first model tested, decision trees, did not perform better than an average guess, we will only try to combine models 2-4.

Method: Model stacking

#First, combine the predictions from models 2-4 into one dataframe
combdf<- data.frame(RFpred, boostpred, svmpred, classe=validation$classe)
combFit<- train(classe~., method= "gam", data=combdf)

## Loading required package: mgcv

## Loading required package: nlme

## This is mgcv 1.8-40. For overview type 'help("mgcv-package")'.

#Checking accuracy:
combpred<- predict(combFit, validation)
confusionMatrix(combpred, factor(validation$classe))

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 1671    9    0    0    0
##          B    3 1130 1026  964 1082
##          C    0    0    0    0    0
##          D    0    0    0    0    0
##          E    0    0    0    0    0
## 
## Overall Statistics
##                                           
##                Accuracy : 0.476           
##                  95% CI : (0.4631, 0.4888)
##     No Information Rate : 0.2845          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.3286          
##                                           
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            0.9982   0.9921   0.0000   0.0000   0.0000
## Specificity            0.9979   0.3521   1.0000   1.0000   1.0000
## Pos Pred Value         0.9946   0.2687      NaN      NaN      NaN
## Neg Pred Value         0.9993   0.9946   0.8257   0.8362   0.8161
## Prevalence             0.2845   0.1935   0.1743   0.1638   0.1839
## Detection Rate         0.2839   0.1920   0.0000   0.0000   0.0000
## Detection Prevalence   0.2855   0.7145   0.0000   0.0000   0.0000
## Balanced Accuracy      0.9980   0.6721   0.5000   0.5000   0.5000

This model is extremely innacurate at approximately 48%.

The best model is random forests, with 99% accuracy. Now we will predict the classe (5 levels) on the test set.

plot(rf)

testpred<- predict(rf, testing)
testpred

##  [1] B A B A A E D B A A B C B A E E A B B B
## Levels: A B C D E

Machine Learning Final Project

AJD

2023-02-02

Executive Summary

Exploratory Data Analysis

Data Munging

Creating and Testing Models

Model 1: Decision Trees

Model 2: Random Forests

Model 3: Gradient Boosted Trees

Model 4: Support Vector Machine

Model 5: Combining predictors by averaging