Synopsis

One thing that people regularly do is quantify how much of a particular activity they do, but they rarely quantify how well they do it.The goal of this project is to predict the manner in which they did the exercise. The main strategy that will be followed is: after cleaning the data, three subsets will be created: one for training, one for testing the different models, and a last one for validation of the chosen model.

Data preparation

if (file.exists('preddata.csv') == FALSE){
        download.file('https://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv',
                      destfile = './preddata.csv',method = 'curl')     
}
if (file.exists('quiz.csv') == FALSE) {
        download.file('https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv',
                      destfile = './quiz.csv',method = 'curl')
}

preddata <- read.csv('preddata.csv')
quiz <- read.csv('quiz.csv')
library(caret)
library(rattle)
library(ElemStatLearn)
library(randomForest)
library(e1071)
library(h2o)
set.seed(68490)

We then proceed to modify the class of certain variables into factor and POSIXct.

preddata$classe <- factor(preddata$classe)
preddata$user_name <- factor(preddata$user_name)
preddata$cvtd_timestamp <- as.POSIXct(preddata$cvtd_timestamp,format='%d/%m/%Y %H:%M')

We drop the variable X, which merely indicates the number of the observation. Likewise, certain variables that happen to have a considerable number of NA are removed.

preddata <- preddata[,2:dim(preddata)[2]]
navles <- apply(preddata,2,function(x) any(is.na(x)))
preddata <- preddata[,!navles]

Now we create the data subsets.

inTrain <- createDataPartition(y=preddata$classe,p=.6,list=FALSE)
training <- preddata[inTrain,]
testing <- preddata[-inTrain,]
inTrain <- createDataPartition(y=testing$classe,list=FALSE)
validation <- preddata[inTrain,]
testing <- preddata[-inTrain,]

Finally, we remove those variables that have a strikingly low variance, and that therefore have barely no prediction power. For this purpose we use the training subset.

novar <- nearZeroVar(training)
training <- training[,-novar]
testing <- testing[,-novar]
validation <- validation[,-novar]

Model prediction

Decision Tree

modtree <- train(classe~.,method='rpart',data=training)
fancyRpartPlot(modtree$finalModel)

confusionMatrix(testing$classe,predict(modtree,testing))
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 2512   42  206    0    5
##          B 1081 1005  603    0    0
##          C 1586  108 1728    0    0
##          D 1436  568 1212    0    0
##          E  524  486  966    0 1631
## 
## Overall Statistics
##                                           
##                Accuracy : 0.438           
##                  95% CI : (0.4302, 0.4458)
##     No Information Rate : 0.4547          
##     P-Value [Acc > NIR] : 1               
##                                           
##                   Kappa : 0.3031          
##                                           
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            0.3519  0.45496   0.3665       NA   0.9969
## Specificity            0.9704  0.87517   0.8458   0.7951   0.8595
## Pos Pred Value         0.9085  0.37374   0.5050       NA   0.4522
## Neg Pred Value         0.6423  0.90746   0.7567       NA   0.9996
## Prevalence             0.4547  0.14071   0.3003   0.0000   0.1042
## Detection Rate         0.1600  0.06402   0.1101   0.0000   0.1039
## Detection Prevalence   0.1761  0.17128   0.2180   0.2049   0.2298
## Balanced Accuracy      0.6612  0.66506   0.6061       NA   0.9282

Support Vector Machine

modsvm <- svm(classe~.,data=training)
confusionMatrix(testing$classe,predict(modsvm,testing))
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 2727    8   29    0    1
##          B  218 2385   71    5   10
##          C    7  106 3270   34    5
##          D    6    2  334 2872    2
##          E    1   15   81   98 3412
## 
## Overall Statistics
##                                          
##                Accuracy : 0.9342         
##                  95% CI : (0.9302, 0.938)
##     No Information Rate : 0.2411         
##     P-Value [Acc > NIR] : < 2.2e-16      
##                                          
##                   Kappa : 0.9175         
##                                          
##  Mcnemar's Test P-Value : < 2.2e-16      
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            0.9216   0.9479   0.8639   0.9545   0.9948
## Specificity            0.9970   0.9769   0.9872   0.9729   0.9841
## Pos Pred Value         0.9863   0.8869   0.9556   0.8930   0.9459
## Neg Pred Value         0.9821   0.9899   0.9581   0.9890   0.9985
## Prevalence             0.1885   0.1603   0.2411   0.1917   0.2185
## Detection Rate         0.1737   0.1519   0.2083   0.1829   0.2173
## Detection Prevalence   0.1761   0.1713   0.2180   0.2049   0.2298
## Balanced Accuracy      0.9593   0.9624   0.9256   0.9637   0.9894

Gradient Boosted Trees

h2o.init()
predictors <- colnames(training)[1:57]
response <- colnames(training)[58]
trainsp <- as.h2o(training)
testsp <- as.h2o(testing)
modgbm <- h2o.gbm(x=predictors,y=response,training_frame = trainsp,
                  validation_frame = testsp,ntrees=100,distribution='multinomial')
predgbm <- predict(modgbm,testsp)
perf <- h2o.performance(modgbm,testsp)
h2o.confusionMatrix(modgbm,testsp)
## Confusion Matrix: Row labels: Actual class; Column labels: Predicted class
##           A    B    C    D    E  Error         Rate
## A      2765    0    0    0    0 0.0000 =  0 / 2.765
## B         2 2687    0    0    0 0.0007 =  2 / 2.689
## C         0    2 3420    0    0 0.0006 =  2 / 3.422
## D         0    0    0 3216    0 0.0000 =  0 / 3.216
## E         0    0    0    0 3607 0.0000 =  0 / 3.607
## Totals 2767 2689 3420 3216 3607 0.0003 = 4 / 15.699
1-0.0008 #Accuracy
## [1] 0.9992

Random Forest

modrf <- randomForest(classe~.,data=training)
varImpPlot(modrf)

confusionMatrix(testing$classe,predict(modrf,testing))
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 2765    0    0    0    0
##          B    0 2689    0    0    0
##          C    0    5 3417    0    0
##          D    0    0    1 3214    1
##          E    0    0    0    1 3606
## 
## Overall Statistics
##                                          
##                Accuracy : 0.9995         
##                  95% CI : (0.999, 0.9998)
##     No Information Rate : 0.2298         
##     P-Value [Acc > NIR] : < 2.2e-16      
##                                          
##                   Kappa : 0.9994         
##                                          
##  Mcnemar's Test P-Value : NA             
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            1.0000   0.9981   0.9997   0.9997   0.9997
## Specificity            1.0000   1.0000   0.9996   0.9998   0.9999
## Pos Pred Value         1.0000   1.0000   0.9985   0.9994   0.9997
## Neg Pred Value         1.0000   0.9996   0.9999   0.9999   0.9999
## Prevalence             0.1761   0.1716   0.2177   0.2048   0.2298
## Detection Rate         0.1761   0.1713   0.2177   0.2047   0.2297
## Detection Prevalence   0.1761   0.1713   0.2180   0.2049   0.2298
## Balanced Accuracy      1.0000   0.9991   0.9997   0.9998   0.9998

Chosen Model And Validation

The model that performs the best, in terms of accuracy, is the random forest. We proceed then to apply it to the validation subset.

conf<-confusionMatrix(validation$classe,predict(modrf,validation))
ac <- conf$overall[1]
conf
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 2815    0    0    0    0
##          B    1 1107    0    0    0
##          C    0    0    0    0    0
##          D    0    0    0    0    0
##          E    0    0    0    0    0
## 
## Overall Statistics
##                                      
##                Accuracy : 0.9997     
##                  95% CI : (0.9986, 1)
##     No Information Rate : 0.7178     
##     P-Value [Acc > NIR] : < 2.2e-16  
##                                      
##                   Kappa : 0.9994     
##                                      
##  Mcnemar's Test P-Value : NA         
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            0.9996   1.0000       NA       NA       NA
## Specificity            1.0000   0.9996        1        1        1
## Pos Pred Value         1.0000   0.9991       NA       NA       NA
## Neg Pred Value         0.9991   1.0000       NA       NA       NA
## Prevalence             0.7178   0.2822        0        0        0
## Detection Rate         0.7176   0.2822        0        0        0
## Detection Prevalence   0.7176   0.2824        0        0        0
## Balanced Accuracy      0.9998   0.9998       NA       NA       NA

Using validation data, the resulting accuracy of the random forest model is 0.9997451, which indicates a really precise prediction. The associated expected out of sample error is, hence, 2.5490696^{-4}.