library(caret)
## Warning: package 'caret' was built under R version 3.3.3
## Loading required package: lattice
## Loading required package: ggplot2
## Warning: package 'ggplot2' was built under R version 3.3.3
library(dplyr)
## Warning: package 'dplyr' was built under R version 3.3.3
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(ggplot2)
library(plotly)
## Warning: package 'plotly' was built under R version 3.3.3
##
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
##
## last_plot
## The following object is masked from 'package:stats':
##
## filter
## The following object is masked from 'package:graphics':
##
## layout
library(tree)
## Warning: package 'tree' was built under R version 3.3.3
library(rpart)
library(randomForest)
## randomForest 4.6-12
## Type rfNews() to see new features/changes/bug fixes.
##
## Attaching package: 'randomForest'
## The following object is masked from 'package:dplyr':
##
## combine
## The following object is masked from 'package:ggplot2':
##
## margin
Cleaning and Tyding the sets while training the data. all of the variables (columns) will exclude them from the analysis, here we have 160 variables in both of the sets but the testing set we have 20 observations.Some of the columns here they show only identification and will have very little impact on the prediction.
As we can see there is no change in the 19622 observations and we are left with 53 variables (training-set) and 20 observations with 53 variables (testing-set)
For our cross validation part ww subset the training data into a real training and test set
These are the three methods that I’ve tried: gradient boosting, random forests, and random forests using the randomForest() functiom. The first two models behaved themselves to be quite slow, so they were disregarded and my choice went with randomForest, choosed for its speed, with very few clssification errors for training, tunning and testing. The error estimate decends to near 0, shown at the plot below.
##
## Call:
## randomForest(formula = classe ~ ., data = inTraining, tuneGrid = rfGrid)
## Type of random forest: classification
## Number of trees: 500
## No. of variables tried at each split: 7
##
## OOB estimate of error rate: 0.43%
## Confusion matrix:
## A B C D E class.error
## A 4181 3 0 1 0 0.0009557945
## B 14 2832 2 0 0 0.0056179775
## C 0 11 2552 4 0 0.0058433970
## D 0 0 22 2388 2 0.0099502488
## E 0 0 1 4 2701 0.0018477458
the test “out of sample” data
## Confusion Matrix and Statistics
##
## Reference
## Prediction A B C D E
## A 1395 2 0 0 0
## B 0 940 4 0 0
## C 0 7 851 6 0
## D 0 0 0 798 4
## E 0 0 0 0 897
##
## Overall Statistics
##
## Accuracy : 0.9953
## 95% CI : (0.993, 0.997)
## No Information Rate : 0.2845
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.9941
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: A Class: B Class: C Class: D Class: E
## Sensitivity 1.0000 0.9905 0.9953 0.9925 0.9956
## Specificity 0.9994 0.9990 0.9968 0.9990 1.0000
## Pos Pred Value 0.9986 0.9958 0.9850 0.9950 1.0000
## Neg Pred Value 1.0000 0.9977 0.9990 0.9985 0.9990
## Prevalence 0.2845 0.1935 0.1743 0.1639 0.1837
## Detection Rate 0.2845 0.1917 0.1735 0.1627 0.1829
## Detection Prevalence 0.2849 0.1925 0.1762 0.1635 0.1829
## Balanced Accuracy 0.9997 0.9948 0.9961 0.9958 0.9978
Good test has been passed with 0.9953 and a kappa of 0.9941 and excellent sensivity and specificity across the classes.
All of the answers were validated as correct at the project submission page.