library(AppliedPredictiveModeling)
library(ElemStatLearn)
library(pgmm)
library(caret)
## Loading required package: lattice
## Loading required package: ggplot2
library(e1071)
library(gbm)
## Loading required package: survival
##
## Attaching package: 'survival'
## The following object is masked from 'package:caret':
##
## cluster
## Loading required package: splines
## Loading required package: parallel
## Loaded gbm 2.1.3
library(lubridate)
##
## Attaching package: 'lubridate'
## The following object is masked from 'package:base':
##
## date
library(elasticnet)
## Loading required package: lars
## Loaded lars 1.2
library(randomForest)
## randomForest 4.6-14
## Type rfNews() to see new features/changes/bug fixes.
##
## Attaching package: 'randomForest'
## The following object is masked from 'package:ggplot2':
##
## margin
library(rpart)
library(rpart.plot)
library(tree)
library(rattle)
## Rattle: A free graphical interface for data science with R.
## Version 5.1.0 Copyright (c) 2006-2017 Togaware Pty Ltd.
## Type 'rattle()' to shake, rattle, and roll your data.
##
## Attaching package: 'rattle'
## The following object is masked from 'package:randomForest':
##
## importance
trainurl <- "http://d396qusza40orc.cloudfront.net/predmachlearn/pml-training.csv"
train <- read.csv(url(trainurl))
testurl <- "https://d396qusza40orc.cloudfront.net/predmachlearn/pml-testing.csv"
test <- read.csv(url(testurl))
set.seed(5463)
intrainset <- createDataPartition(y=train$classe, p=0.8, list=FALSE)
trainset <- train[intrainset, ]
valset <- train[-intrainset, ]
trainset <- trainset[, colSums(is.na(trainset))==0]
valset <- valset[, colSums(is.na(valset))==0]
trainset <- trainset[, -(1:7)]
valset <- valset[, -(1:7)]
dim(trainset)
## [1] 15699 86
dim(valset)
## [1] 3923 86
trainset <- trainset[, -(nearZeroVar(trainset))]
valset <- valset[, -(nearZeroVar(valset))]
dim(trainset)
## [1] 15699 53
dim(valset)
## [1] 3923 53
Three methods will be applied to model the regressions (in the Train dataset) and the best one (the one with highest accuracy when applied to the Test dataset) will be used for the predictions in the testset. The methods are: A: Classification Tree, B: Random Forests and C: Generalized Boosted Model. Of each analysis the accuracy of the models is provided.
modfit <- train(classe~., method="rpart", data=trainset)
fancyRpartPlot(modfit$finalModel, main="Classification Tree", type=1, palettes=c("Greys", "Oranges"))
Projecting on to validationset.
predict <- predict(modfit, newdata=valset)
confusionMatrix(predict, valset$classe)$overall[1]
## Accuracy
## 0.4866174
In order to achieve higher accuracy, I create a large number of decision trees based on bagging. Through resampling the data over and over and training for each sample a new classifier. Through voting the differences in classifiers are averaged out. In other words: to achieve a higher accuracy the random forest method is deployed. ### Method B: Random Forest We are going to check the performance of the tree on the testing data by cross validation.
modfitRF <- train(classe ~ ., data=trainset, method="rf", trControl=trainControl(method="cv", number=3, verboseIter=FALSE))
projecting on to validation set
predictRF <- predict(modfitRF, newdata=valset)
confusionMatrix(predictRF, valset$classe)$overall[1]
## Accuracy
## 0.9920979
Using a boosting method, building on weak classifiers, through adding a classifier at a time, so that every classifier is trained to improve the already trained ensemble. ### Method C: GBM
modfitGBM <- train(classe ~ ., data=trainset, method = "gbm", trControl = trainControl(method = "cv", number = 3), verbose = FALSE)
predictGBM <- predict(modfitGBM, newdata=valset)
confusionMatrix(predictGBM, valset$classe)$overall[1]
## Accuracy
## 0.9607443
Random Forest has the highest accuracy, nearly higher than the Generalized Boosting Method. Both outperform the decision tree. Creating an ensemble model would be redundant because of the high accuracy of the random forest model, but we wil combine two predictors because of the fun of it. We make even a better model through combining GBM and RF.
predDF <- data.frame(predictRF, predictGBM, classe=valset$classe)
combRFGBM <- train(classe~., data=predDF)
combpred <- predict(combRFGBM, predDF)
confusionMatrix(combpred, valset$classe)$overall[1]
## Accuracy
## 0.9920979
Results <- data.frame(Model=c('DTree', 'RF', 'GBM', 'Ensemble'), Accuracy = rbind(confusionMatrix(predict, valset$classe)$overall[1], confusionMatrix(predictRF, valset$classe)$overall[1], confusionMatrix(predictGBM, valset$classe)$overall[1], confusionMatrix(combpred, valset$classe)$overall[1]))
print(Results)
## Model Accuracy
## 1 DTree 0.4866174
## 2 RF 0.9920979
## 3 GBM 0.9607443
## 4 Ensemble 0.9920979
Due to the little difference in accuracy between the ensemble and RF and due to the calculation power needed for the ensemble, I will stick with the RF-model.
predictiontest <- predict(modfitRF, newdata=test)
resultstest <- data.frame(problem_id=test$problem_id, predicted=predictiontest)
print(resultstest)
## problem_id predicted
## 1 1 B
## 2 2 A
## 3 3 B
## 4 4 A
## 5 5 A
## 6 6 E
## 7 7 D
## 8 8 B
## 9 9 A
## 10 10 A
## 11 11 B
## 12 12 C
## 13 13 B
## 14 14 A
## 15 15 E
## 16 16 E
## 17 17 A
## 18 18 B
## 19 19 B
## 20 20 B