Data Mining with R - Pre-Module Assignment #4

R Markdown

Performance estimation helps evaluate based on metrics of predictive performance from an unseen data distribution. In comparing to the SVM (Support Vector Machine) and RF (Random Forest) Models which estimates would provide a better approach. For the redWine data we will treat it as an Classification Model.

The goal of performance estimation is to obtain a reliable estimate of the predicted prediction error of a model on the unknown data distribution. To derive a model that has reliable estimate we must use unseen data that can be used as test set.

Another goal of performace estimation is to run and repeat the testing several times to collect a variety of scores to average the error rate and also the standard error as an estimate of the true population.

Loading Data

load("redWine.Rdata")

# Set the Library functions
library(performanceEstimation)
library(randomForest)

## randomForest 4.6-12

## Type rfNews() to see new features/changes/bug fixes.

library(e1071)
library(lattice)
library(DMwR)

## Loading required package: grid

Hold Out Method

70% Training and 30% Test Set

trPrec<-0.7 # 70% training set
sp<-sample(1:nrow(redWine),as.integer(trPrec*nrow(redWine)))

redWine$quality<-as.factor(redWine$quality) # Convert to Classification Model

train<-redWine[sp,] # Training Set
test<-redWine[-sp,] # Test Set


modelSVM <- svm(quality ~ ., train) # default radial kernel
modelRF <- randomForest(quality ~ ., train, ntree=750,importance=T)

pred_svm<-predict(modelSVM, test, type='class')
pred_RF<-predict(modelRF, test,type='class')

classificationMetrics(test$quality,pred_svm)

##       acc       err    microF    macroF  macroRec macroPrec 
## 0.5958333 0.4041667 0.5958333       NaN 0.2701593       NaN

classificationMetrics(test$quality,pred_RF)

##       acc       err    microF    macroF  macroRec macroPrec 
## 0.6604167 0.3395833 0.6604167       NaN 0.3059113       NaN

The Error rate for Radial Kernel for Support Vector Machine has a rate of 39.2% which is a higher rate compared to 31.6% for Random Forest

Hold-Out set with Iterations

res <- performanceEstimation(
  PredTask(quality ~ .,redWine),
    workflowVariants(learner=c("svm","randomForest")),
  EstimationTask(metrics="err",method=Holdout(nReps=10,hldSz=0.3))  )

## 
## 
## ##### PERFORMANCE ESTIMATION USING  HOLD OUT  #####
## 
## ** PREDICTIVE TASK :: redWine.quality
## 
## ++ MODEL/WORKFLOW :: svm 
## Task for estimating  err  using
##  10 x 70 % / 30 % Holdout
##   Run with seed =  1234 
## Iteration :  1  2  3  4  5  6  7  8  9  10
## 
## 
## ++ MODEL/WORKFLOW :: randomForest 
## Task for estimating  err  using
##  10 x 70 % / 30 % Holdout
##   Run with seed =  1234 
## Iteration :  1  2  3  4  5  6  7  8  9  10

Selection of Model

summary(res)

## 
## == Summary of a  Hold Out Performance Estimation Experiment ==
## 
## Task for estimating  err  using
##  10 x 70 % / 30 % Holdout
##   Run with seed =  1234 
## 
## * Predictive Tasks ::  redWine.quality
## * Workflows  ::  svm, randomForest 
## 
## -> Task:  redWine.quality
##   *Workflow: svm 
##                err
## avg     0.36534447
## std     0.01530967
## med     0.36638831
## iqr     0.01774530
## min     0.34237996
## max     0.38622129
## invalid 0.00000000
## 
##   *Workflow: randomForest 
##                 err
## avg     0.318162839
## std     0.011186378
## med     0.315240084
## iqr     0.007306889
## min     0.304801670
## max     0.340292276
## invalid 0.000000000

rankWorkflows(res,2)

## $redWine.quality
## $redWine.quality$err
##       Workflow  Estimate
## 1 randomForest 0.3181628
## 2          svm 0.3653445

topPerformers(res) # Choose Top Performing model

## $redWine.quality
##         Workflow Estimate
## err randomForest    0.318

Plot the Cross Validation (Error Rate) Mean Squared Error Comparison

plot(res)

The average of 10 iterations for the Hold-Out Set of 30% been Test set shows the error rate to be lower at 36.5% for SVM and same error rate for 31.8% for Random Forest.

K-Fold Cross Validation

res <- performanceEstimation(
  PredTask(quality ~ .,redWine),
    workflowVariants(learner=c("svm","randomForest")),
  EstimationTask(metrics="err",method=CV(nReps=1,nFolds=10))  )

## 
## 
## ##### PERFORMANCE ESTIMATION USING  CROSS VALIDATION  #####
## 
## ** PREDICTIVE TASK :: redWine.quality
## 
## ++ MODEL/WORKFLOW :: svm 
## Task for estimating  err  using
##  1 x 10 - Fold Cross Validation
##   Run with seed =  1234 
## Iteration :**********
## 
## 
## ++ MODEL/WORKFLOW :: randomForest 
## Task for estimating  err  using
##  1 x 10 - Fold Cross Validation
##   Run with seed =  1234 
## Iteration :**********

Using 10-Fold Cross Validation with a set of n-observations it will randomly split into 10 non-overlapping groups as an validation set. It will iterate through each of the 10 Folds and averaging the test error rate out.

Selection of Model

summary(res)

## 
## == Summary of a  Cross Validation Performance Estimation Experiment ==
## 
## Task for estimating  err  using
##  1 x 10 - Fold Cross Validation
##   Run with seed =  1234 
## 
## * Predictive Tasks ::  redWine.quality
## * Workflows  ::  svm, randomForest 
## 
## -> Task:  redWine.quality
##   *Workflow: svm 
##                err
## avg     0.37232704
## std     0.03325347
## med     0.36163522
## iqr     0.03144654
## min     0.33333333
## max     0.44654088
## invalid 0.00000000
## 
##   *Workflow: randomForest 
##                err
## avg     0.27987421
## std     0.03857101
## med     0.26729560
## iqr     0.01886792
## min     0.24528302
## max     0.36477987
## invalid 0.00000000

rankWorkflows(res,2)

## $redWine.quality
## $redWine.quality$err
##       Workflow  Estimate
## 1 randomForest 0.2798742
## 2          svm 0.3723270

topPerformers(res) # Choose Top Performing model

## $redWine.quality
##         Workflow Estimate
## err randomForest     0.28

Plot the Cross Validation (Error Rate) Model Comparison

plot(res)

BootStrapping

res <- performanceEstimation(
  PredTask(quality ~ .,redWine),
    workflowVariants(learner=c("svm","randomForest")),
  EstimationTask(metrics="err",method=Bootstrap(nReps=100)) )

## 
## 
## ##### PERFORMANCE ESTIMATION USING  BOOTSTRAP  #####
## 
## ** PREDICTIVE TASK :: redWine.quality
## 
## ++ MODEL/WORKFLOW :: svm 
## Task for estimating  err  using
## 100  repetitions of  e0  Bootstrap experiment
##   Run with seed =  1234 
## Iteration :  1  2  3  4  5  6  7  8  9  10  11  12  13  14  15  16  17  18  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53  54  55  56  57  58  59  60  61  62  63  64  65  66  67  68  69  70  71  72  73  74  75  76  77  78  79  80  81  82  83  84  85  86  87  88  89  90  91  92  93  94  95  96  97  98  99  100
## 
## 
## ++ MODEL/WORKFLOW :: randomForest 
## Task for estimating  err  using
## 100  repetitions of  e0  Bootstrap experiment
##   Run with seed =  1234 
## Iteration :  1  2  3  4  5  6  7  8  9  10  11  12  13  14  15  16  17  18  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53  54  55  56  57  58  59  60  61  62  63  64  65  66  67  68  69  70  71  72  73  74  75  76  77  78  79  80  81  82  83  84  85  86  87  88  89  90  91  92  93  94  95  96  97  98  99  100

The bootstrapping method is training a model on a random sample set of size n with replacement. So several observations would be used as training dataset for the number of iterations.

Data Mining with R - Pre-Module Assignment #4

Study Group 2

22/07/2017

R Markdown