library(AppliedPredictiveModeling)
library(ElemStatLearn)
library(pgmm)
library(caret)
## Loading required package: lattice
## Loading required package: ggplot2
library(e1071)
library(gbm)
## Loading required package: survival
## 
## Attaching package: 'survival'
## The following object is masked from 'package:caret':
## 
##     cluster
## Loading required package: splines
## Loading required package: parallel
## Loaded gbm 2.1.3
library(lubridate)
## 
## Attaching package: 'lubridate'
## The following object is masked from 'package:base':
## 
##     date
library(elasticnet)
## Loading required package: lars
## Loaded lars 1.2
library(randomForest)
## randomForest 4.6-14
## Type rfNews() to see new features/changes/bug fixes.
## 
## Attaching package: 'randomForest'
## The following object is masked from 'package:ggplot2':
## 
##     margin
library(rpart)
library(rpart.plot)
library(tree)
library(rattle)
## Rattle: A free graphical interface for data science with R.
## Version 5.1.0 Copyright (c) 2006-2017 Togaware Pty Ltd.
## Type 'rattle()' to shake, rattle, and roll your data.
## 
## Attaching package: 'rattle'
## The following object is masked from 'package:randomForest':
## 
##     importance

Getting and cleaning the data

Loading datasets

trainurl <- "http://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv"
train <- read.csv(url(trainurl))
testurl <- "https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv"
test <- read.csv(url(testurl))

set.seed(5463)

Partition into validation and trainingset

intrainset  <- createDataPartition(y=train$classe, p=0.8, list=FALSE)
trainset <- train[intrainset, ]
valset  <- train[-intrainset, ]

Remove variables with mainly NA’s.

trainset <- trainset[, colSums(is.na(trainset))==0]
valset <- valset[, colSums(is.na(valset))==0]

remove irrelevant first seven columns

trainset <- trainset[, -(1:7)]
valset  <- valset[, -(1:7)]
dim(trainset)
## [1] 15699    86
dim(valset)
## [1] 3923   86

remove variables with very low variances

trainset <- trainset[, -(nearZeroVar(trainset))]
valset  <- valset[, -(nearZeroVar(valset))]
dim(trainset)
## [1] 15699    53
dim(valset)
## [1] 3923   53

Model building:

Three methods will be applied to model the regressions (in the Train dataset) and the best one (the one with highest accuracy when applied to the Test dataset) will be used for the predictions in the testset. The methods are: A: Classification Tree, B: Random Forests and C: Generalized Boosted Model. Of each analysis the accuracy of the models is provided.

Method A: Classification Tree

modfit <- train(classe~., method="rpart", data=trainset)
fancyRpartPlot(modfit$finalModel, main="Classification Tree", type=1, palettes=c("Greys", "Oranges"))

Projecting on to validationset.

predict <- predict(modfit, newdata=valset)
confusionMatrix(predict, valset$classe)$overall[1]
##  Accuracy 
## 0.4866174

In order to achieve higher accuracy, I create a large number of decision trees based on bagging. Through resampling the data over and over and training for each sample a new classifier. Through voting the differences in classifiers are averaged out. In other words: to achieve a higher accuracy the random forest method is deployed. ### Method B: Random Forest We are going to check the performance of the tree on the testing data by cross validation.

modfitRF <- train(classe ~ ., data=trainset, method="rf", trControl=trainControl(method="cv", number=3, verboseIter=FALSE))

projecting on to validation set

predictRF <- predict(modfitRF, newdata=valset)
confusionMatrix(predictRF, valset$classe)$overall[1]
##  Accuracy 
## 0.9920979

Using a boosting method, building on weak classifiers, through adding a classifier at a time, so that every classifier is trained to improve the already trained ensemble. ### Method C: GBM

modfitGBM  <- train(classe ~ ., data=trainset, method = "gbm", trControl = trainControl(method = "cv", number = 3), verbose = FALSE)


predictGBM <- predict(modfitGBM, newdata=valset)
confusionMatrix(predictGBM, valset$classe)$overall[1]
##  Accuracy 
## 0.9607443

Random Forest has the highest accuracy, nearly higher than the Generalized Boosting Method. Both outperform the decision tree. Creating an ensemble model would be redundant because of the high accuracy of the random forest model, but we wil combine two predictors because of the fun of it. We make even a better model through combining GBM and RF.

predDF <- data.frame(predictRF, predictGBM, classe=valset$classe)
combRFGBM <- train(classe~., data=predDF)
combpred <- predict(combRFGBM, predDF)
confusionMatrix(combpred, valset$classe)$overall[1]
##  Accuracy 
## 0.9920979

Creating an overview of out of sample accuracy

Results <- data.frame(Model=c('DTree', 'RF', 'GBM', 'Ensemble'), Accuracy = rbind(confusionMatrix(predict, valset$classe)$overall[1], confusionMatrix(predictRF, valset$classe)$overall[1], confusionMatrix(predictGBM, valset$classe)$overall[1], confusionMatrix(combpred, valset$classe)$overall[1]))
print(Results)
##      Model  Accuracy
## 1    DTree 0.4866174
## 2       RF 0.9920979
## 3      GBM 0.9607443
## 4 Ensemble 0.9920979

Predicting in testing set.

Due to the little difference in accuracy between the ensemble and RF and due to the calculation power needed for the ensemble, I will stick with the RF-model.

predictiontest <- predict(modfitRF, newdata=test)
resultstest <- data.frame(problem_id=test$problem_id, predicted=predictiontest)
print(resultstest)
##    problem_id predicted
## 1           1         B
## 2           2         A
## 3           3         B
## 4           4         A
## 5           5         A
## 6           6         E
## 7           7         D
## 8           8         B
## 9           9         A
## 10         10         A
## 11         11         B
## 12         12         C
## 13         13         B
## 14         14         A
## 15         15         E
## 16         16         E
## 17         17         A
## 18         18         B
## 19         19         B
## 20         20         B