library(caret)

## Warning: package 'caret' was built under R version 3.3.3

## Loading required package: lattice

## Loading required package: ggplot2

## Warning: package 'ggplot2' was built under R version 3.3.3

library(dplyr)

## Warning: package 'dplyr' was built under R version 3.3.3

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(ggplot2)
library(plotly)

## Warning: package 'plotly' was built under R version 3.3.3

## 
## Attaching package: 'plotly'

## The following object is masked from 'package:ggplot2':
## 
##     last_plot

## The following object is masked from 'package:stats':
## 
##     filter

## The following object is masked from 'package:graphics':
## 
##     layout

library(tree)

## Warning: package 'tree' was built under R version 3.3.3

library(rpart)
library(randomForest)

## randomForest 4.6-12

## Type rfNews() to see new features/changes/bug fixes.

## 
## Attaching package: 'randomForest'

## The following object is masked from 'package:dplyr':
## 
##     combine

## The following object is masked from 'package:ggplot2':
## 
##     margin

Getting and Loading the data

Cleaning and Tyding the sets while training the data. all of the variables (columns) will exclude them from the analysis, here we have 160 variables in both of the sets but the testing set we have 20 observations.Some of the columns here they show only identification and will have very little impact on the prediction.

As we can see there is no change in the 19622 observations and we are left with 53 variables (training-set) and 20 observations with 53 variables (testing-set)

Creating Subset - training data

For our cross validation part ww subset the training data into a real training and test set

Creating the model

These are the three methods that I’ve tried: gradient boosting, random forests, and random forests using the randomForest() functiom. The first two models behaved themselves to be quite slow, so they were disregarded and my choice went with randomForest, choosed for its speed, with very few clssification errors for training, tunning and testing. The error estimate decends to near 0, shown at the plot below.

## 
## Call:
##  randomForest(formula = classe ~ ., data = inTraining, tuneGrid = rfGrid) 
##                Type of random forest: classification
##                      Number of trees: 500
## No. of variables tried at each split: 7
## 
##         OOB estimate of  error rate: 0.43%
## Confusion matrix:
##      A    B    C    D    E  class.error
## A 4181    3    0    1    0 0.0009557945
## B   14 2832    2    0    0 0.0056179775
## C    0   11 2552    4    0 0.0058433970
## D    0    0   22 2388    2 0.0099502488
## E    0    0    1    4 2701 0.0018477458

Cross validation

the test “out of sample” data

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 1395    2    0    0    0
##          B    0  940    4    0    0
##          C    0    7  851    6    0
##          D    0    0    0  798    4
##          E    0    0    0    0  897
## 
## Overall Statistics
##                                         
##                Accuracy : 0.9953        
##                  95% CI : (0.993, 0.997)
##     No Information Rate : 0.2845        
##     P-Value [Acc > NIR] : < 2.2e-16     
##                                         
##                   Kappa : 0.9941        
##  Mcnemar's Test P-Value : NA            
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity            1.0000   0.9905   0.9953   0.9925   0.9956
## Specificity            0.9994   0.9990   0.9968   0.9990   1.0000
## Pos Pred Value         0.9986   0.9958   0.9850   0.9950   1.0000
## Neg Pred Value         1.0000   0.9977   0.9990   0.9985   0.9990
## Prevalence             0.2845   0.1935   0.1743   0.1639   0.1837
## Detection Rate         0.2845   0.1917   0.1735   0.1627   0.1829
## Detection Prevalence   0.2849   0.1925   0.1762   0.1635   0.1829
## Balanced Accuracy      0.9997   0.9948   0.9961   0.9958   0.9978

Good test has been passed with 0.9953 and a kappa of 0.9941 and excellent sensivity and specificity across the classes.

The last validation for the submission results

Test validation sample

Conclusion

All of the answers were validated as correct at the project submission page.

Prediction Assignment

Damjan Stefanovski

August 20, 2017