Data Mining with R - Pre-Module Assignment #4

R Markdown

Performance estimation helps to evaluate model performance based on metrics of predictive performance from an unseen data distribution. The goal is to obtain a reliable estimate of the predicted prediction error of a model on the unknown data distribution. To obtain a model capable of producing reliable estimates we must use unseen data as the test set. Another goal of performance estimation is to run and repeat the testing several times to collect multiple scores to average the error rate and also the standard error as an estimate of the true population.

We would like to compare the performance of the SVM (Support Vector Machine) with the RF (Random Forest) to see which model has superior predictive capability. For the redWine data we will treat it as an Classification Model.

Loading Data

load("redWine.Rdata")

# Set the Library functions
library(performanceEstimation)
library(randomForest)

## randomForest 4.6-12

## Type rfNews() to see new features/changes/bug fixes.

library(e1071)
library(lattice)
library(DMwR)

## Loading required package: grid

redWine$quality<-as.factor(redWine$quality) # Convert to Classification Model

Converting the target variable to qualitative before ingesting into the models.

Hold Out Method

70% Training and 30% Test Set

set.seed(1234)
trPrec<-0.7 # 70% training set
sp<-sample(1:nrow(redWine),as.integer(trPrec*nrow(redWine)))

train<-redWine[sp,] # Training Set
test<-redWine[-sp,] # Test Set

Assignment 3 Model Code

set.seed(1234)
modelSVM <- svm(quality ~ ., train,cost=1,gamma=0.5) # Parameters based on tune.svm() in Assignment 3
modelRF <- randomForest(quality ~ ., train, ntree=750,importance=T)

print("Classification Error Rate Metrics: SVM")

## [1] "Classification Error Rate Metrics: SVM"

predictSVM <- predict(modelSVM, test)
(cmSVM <- table(predictSVM, test$quality)) # confusion matrix

##           
## predictSVM   3   4   5   6   7   8
##          3   0   0   0   0   0   0
##          4   0   0   0   0   0   0
##          5   1  11 153  44   6   0
##          6   0   7  52 126  33   6
##          7   0   0   1  10  27   3
##          8   0   0   0   0   0   0

100*(1-sum(diag(cmSVM))/sum(cmSVM)) # the error rate

## [1] 36.25

print("Classification Error Rate Metrics: RandomForest")

## [1] "Classification Error Rate Metrics: RandomForest"

predictRF <- predict(modelRF, test)
(cmRF <- table(predictRF, test$quality))

##          
## predictRF   3   4   5   6   7   8
##         3   0   0   0   0   0   0
##         4   0   0   0   0   0   0
##         5   1   9 162  38   3   0
##         6   0   9  44 135  30   5
##         7   0   0   0   7  33   3
##         8   0   0   0   0   0   1

100*(1-sum(diag(cmRF))/sum(cmRF))

## [1] 31.04167

Based on the tuning.svm() output from Assignment 3 the best model using cost = 1 and gamma = 0.5 was found using grid search hyperparameters. Using the holdout method, which consists of randomly dividing the available data sample in two subsets (training set and test/validation set), we find that the error rate for the ‘Radial’ Kernel for Support Vector Machine is 36.25%, which is a higher rate compared to 31.04% for Random Forest. The ‘Radial’ kernel was the best SVM model based on the tuning parameter but it does not perform better than the Random Forest.

The holdout method leads to one single estimate from a single split of the data. It is also more commonly applied to larger datasets (more than a few thousands of cases). To clearly identify the model of choice, we also use other methods of performance validation. Below are the steps using performance estimation to determine the model that we would likely choose.

From the below performance estimation the group has choosen the best tuned paramter for SVM ‘radial’ kernel of cost = 1, gamma = 0.5 and a random value of cost = 10, gamma = 0.9 for model comparison. As for the Random Forest the parameters tested against would be the number of trees which is choosen from ranges of 750, 1000 and 2000.

Sub-sampling set with Iterations

workflow - SVM and random forest
estimation - 30% holdout data with 10 iterations
measurement - Error Rate
Holdout.seed - Set to default 1234

res_sub <- performanceEstimation(
  PredTask(quality ~ .,redWine),
    c(workflowVariants(learner=c("svm"),learner.pars=list(cost=c(1,10),gamma=c(0.5,0.9)) ),
    workflowVariants(learner=c("randomForest"),learner.pars=list(ntree=c(750,1000,2000))) ),
  EstimationTask(metrics="err",method=Holdout(nReps=10,hldSz=0.3))  )

## 
## 
## ##### PERFORMANCE ESTIMATION USING  HOLD OUT  #####
## 
## ** PREDICTIVE TASK :: redWine.quality
## 
## ++ MODEL/WORKFLOW :: svm.v1 
## Task for estimating  err  using
##  10 x 70 % / 30 % Holdout
##   Run with seed =  1234 
## Iteration :  1  2  3  4  5  6  7  8  9  10
## 
## 
## ++ MODEL/WORKFLOW :: svm.v2 
## Task for estimating  err  using
##  10 x 70 % / 30 % Holdout
##   Run with seed =  1234 
## Iteration :  1  2  3  4  5  6  7  8  9  10
## 
## 
## ++ MODEL/WORKFLOW :: svm.v3 
## Task for estimating  err  using
##  10 x 70 % / 30 % Holdout
##   Run with seed =  1234 
## Iteration :  1  2  3  4  5  6  7  8  9  10
## 
## 
## ++ MODEL/WORKFLOW :: svm.v4 
## Task for estimating  err  using
##  10 x 70 % / 30 % Holdout
##   Run with seed =  1234 
## Iteration :  1  2  3  4  5  6  7  8  9  10
## 
## 
## ++ MODEL/WORKFLOW :: randomForest.v1 
## Task for estimating  err  using
##  10 x 70 % / 30 % Holdout
##   Run with seed =  1234 
## Iteration :  1  2  3  4  5  6  7  8  9  10
## 
## 
## ++ MODEL/WORKFLOW :: randomForest.v2 
## Task for estimating  err  using
##  10 x 70 % / 30 % Holdout
##   Run with seed =  1234 
## Iteration :  1  2  3  4  5  6  7  8  9  10
## 
## 
## ++ MODEL/WORKFLOW :: randomForest.v3 
## Task for estimating  err  using
##  10 x 70 % / 30 % Holdout
##   Run with seed =  1234 
## Iteration :  1  2  3  4  5  6  7  8  9  10

Selection of Model

summary(res_sub) # Summary of the Workflow output with Error Statistics

## 
## == Summary of a  Hold Out Performance Estimation Experiment ==
## 
## Task for estimating  err  using
##  10 x 70 % / 30 % Holdout
##   Run with seed =  1234 
## 
## * Predictive Tasks ::  redWine.quality
## * Workflows  ::  svm.v1, svm.v2, svm.v3, svm.v4, randomForest.v1, randomForest.v2, randomForest.v3 
## 
## -> Task:  redWine.quality
##   *Workflow: svm.v1 
##                err
## avg     0.35386221
## std     0.01357440
## med     0.35177453
## iqr     0.02035491
## min     0.33402923
## max     0.37787056
## invalid 0.00000000
## 
##   *Workflow: svm.v2 
##                err
## avg     0.36200418
## std     0.01993224
## med     0.35908142
## iqr     0.01931106
## min     0.32776618
## max     0.39874739
## invalid 0.00000000
## 
##   *Workflow: svm.v3 
##                err
## avg     0.35240084
## std     0.01915416
## med     0.34655532
## iqr     0.01983299
## min     0.32985386
## max     0.39039666
## invalid 0.00000000
## 
##   *Workflow: svm.v4 
##                 err
## avg     0.367432150
## std     0.017687216
## med     0.361169102
## iqr     0.005741127
## min     0.350730689
## max     0.409185804
## invalid 0.000000000
## 
##   *Workflow: randomForest.v1 
##                 err
## avg     0.316701461
## std     0.011563157
## med     0.314196242
## iqr     0.007306889
## min     0.304801670
## max     0.340292276
## invalid 0.000000000
## 
##   *Workflow: randomForest.v2 
##                 err
## avg     0.317118998
## std     0.012961205
## med     0.314196242
## iqr     0.007306889
## min     0.304801670
## max     0.342379958
## invalid 0.000000000
## 
##   *Workflow: randomForest.v3 
##                 err
## avg     0.317327766
## std     0.013093172
## med     0.313152401
## iqr     0.006784969
## min     0.304801670
## max     0.342379958
## invalid 0.000000000

rankWorkflows(res_sub,1) # Choose top 2 Ranked Worked Flow estimates

## $redWine.quality
## $redWine.quality$err
##          Workflow  Estimate
## 1 randomForest.v1 0.3167015

topPerformers(res_sub) # Choose Top Performing model

## $redWine.quality
##            Workflow Estimate
## err randomForest.v1    0.317

Plot the Random Sub-Sampling (Error Rate) Comparison

plot(res_sub)

The average of 10 iterations for the random subsampling set is defined in (nReps), which allows one to specify the number of random train/test splits. It is commonly used for very large datasets by splitting to 70% training and 30% for testing. The risk of using this method is when datasets are small and there is not enough test set to validate on.

From the results the 30% Test set shows the error rate to be an average of at 35-36.7% for SVM and 31.6-31.7% error rate for Random Forest. The output shows that the Random Forest has significantly performed the SVM.

K-Fold Cross Validation

workflow - SVM and random forest
estimation - Cross Validation - 10 folds
measurement - Error Rate
CV.seed - Set to default 1234

res_cv <- performanceEstimation(
  PredTask(quality ~ .,redWine),
    c(workflowVariants(learner=c("svm"),learner.pars=list(cost=c(1,10),gamma=c(0.5,0.9)) ),
    workflowVariants(learner=c("randomForest"),learner.pars=list(ntree=c(750,1000,2000))) ),
  EstimationTask(metrics="err",method=CV(nReps=1,nFolds=10))  )

## 
## 
## ##### PERFORMANCE ESTIMATION USING  CROSS VALIDATION  #####
## 
## ** PREDICTIVE TASK :: redWine.quality
## 
## ++ MODEL/WORKFLOW :: svm.v1 
## Task for estimating  err  using
##  1 x 10 - Fold Cross Validation
##   Run with seed =  1234 
## Iteration :**********
## 
## 
## ++ MODEL/WORKFLOW :: svm.v2 
## Task for estimating  err  using
##  1 x 10 - Fold Cross Validation
##   Run with seed =  1234 
## Iteration :**********
## 
## 
## ++ MODEL/WORKFLOW :: svm.v3 
## Task for estimating  err  using
##  1 x 10 - Fold Cross Validation
##   Run with seed =  1234 
## Iteration :**********
## 
## 
## ++ MODEL/WORKFLOW :: svm.v4 
## Task for estimating  err  using
##  1 x 10 - Fold Cross Validation
##   Run with seed =  1234 
## Iteration :**********
## 
## 
## ++ MODEL/WORKFLOW :: randomForest.v1 
## Task for estimating  err  using
##  1 x 10 - Fold Cross Validation
##   Run with seed =  1234 
## Iteration :**********
## 
## 
## ++ MODEL/WORKFLOW :: randomForest.v2 
## Task for estimating  err  using
##  1 x 10 - Fold Cross Validation
##   Run with seed =  1234 
## Iteration :**********
## 
## 
## ++ MODEL/WORKFLOW :: randomForest.v3 
## Task for estimating  err  using
##  1 x 10 - Fold Cross Validation
##   Run with seed =  1234 
## Iteration :**********

The 10-Fold Cross Validation with a set of n-observations randomly splits the dataset into 10 non-overlapping groups, with 1 used as the validation set. This will be repeated 10 times and each time a different test set will be used and the average of the k individual scores will be obtained. It is commonly used with averaged-sized datasets containing a few hundred to a few thousand instances.

Selection of Model

summary(res_cv) # Summary of the Workflow output with Error Statistics

## 
## == Summary of a  Cross Validation Performance Estimation Experiment ==
## 
## Task for estimating  err  using
##  1 x 10 - Fold Cross Validation
##   Run with seed =  1234 
## 
## * Predictive Tasks ::  redWine.quality
## * Workflows  ::  svm.v1, svm.v2, svm.v3, svm.v4, randomForest.v1, randomForest.v2, randomForest.v3 
## 
## -> Task:  redWine.quality
##   *Workflow: svm.v1 
##                err
## avg     0.34088050
## std     0.04201249
## med     0.33333333
## iqr     0.06289308
## min     0.29559748
## max     0.42767296
## invalid 0.00000000
## 
##   *Workflow: svm.v2 
##                err
## avg     0.33584906
## std     0.05119770
## med     0.32389937
## iqr     0.04874214
## min     0.27672956
## max     0.45283019
## invalid 0.00000000
## 
##   *Workflow: svm.v3 
##                err
## avg     0.32201258
## std     0.04923725
## med     0.31761006
## iqr     0.06446541
## min     0.25786164
## max     0.42138365
## invalid 0.00000000
## 
##   *Workflow: svm.v4 
##                err
## avg     0.32641509
## std     0.04112970
## med     0.31446541
## iqr     0.03773585
## min     0.28301887
## max     0.42138365
## invalid 0.00000000
## 
##   *Workflow: randomForest.v1 
##                err
## avg     0.27987421
## std     0.04078632
## med     0.26729560
## iqr     0.02358491
## min     0.23899371
## max     0.37735849
## invalid 0.00000000
## 
##   *Workflow: randomForest.v2 
##                err
## avg     0.28050314
## std     0.03979917
## med     0.27358491
## iqr     0.02987421
## min     0.24528302
## max     0.37106918
## invalid 0.00000000
## 
##   *Workflow: randomForest.v3 
##                err
## avg     0.28050314
## std     0.03845117
## med     0.27358491
## iqr     0.03616352
## min     0.23899371
## max     0.35849057
## invalid 0.00000000

rankWorkflows(res_cv,2) # Choose top 2 Ranked Worked Flow estimates

## $redWine.quality
## $redWine.quality$err
##          Workflow  Estimate
## 1 randomForest.v1 0.2798742
## 2 randomForest.v2 0.2805031

topPerformers(res_cv) # Choose Top Performing model

## $redWine.quality
##            Workflow Estimate
## err randomForest.v1     0.28

Plot the Cross Validation (Error Rate) Model Comparison

plot(res_cv)

The results show that the error rates for both SVM and Random Forest have been reduced significantly compared to Hold-out or Sub-Sampling. The SVM error rate ranges from 32.6% - 34% and Random Forest error rate ranges from 27.9% - 28%.

BootStrapping

workflow - SVM and random forest cost = (1,10) , gamma= (0.5,0.9), number of trees = (750,1000,2000)
estimation - Bootstrapping , 100 iterations , type = “0.632”
measurement - Error Rate
Bootstrap.seed - Random seed number set to default 1234

res_boot <- performanceEstimation(
  PredTask(quality ~ .,redWine),
    c(workflowVariants(learner=c("svm"),learner.pars=list(cost=c(1,10),gamma=c(0.5,0.9)) ),
    workflowVariants(learner=c("randomForest"),learner.pars=list(ntree=c(750,1000,2000))) ),
  EstimationTask(metrics="err",method=Bootstrap(nReps=100,type="0.632")) ) # Bootstrap sample of 63.2% of the rows from original dataset

## 
## 
## ##### PERFORMANCE ESTIMATION USING  BOOTSTRAP  #####
## 
## ** PREDICTIVE TASK :: redWine.quality
## 
## ++ MODEL/WORKFLOW :: svm.v1 
## Task for estimating  err  using
## 100  repetitions of  .632  Bootstrap experiment
##   Run with seed =  1234 
## Iteration :  1  2  3  4  5  6  7  8  9  10  11  12  13  14  15  16  17  18  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53  54  55  56  57  58  59  60  61  62  63  64  65  66  67  68  69  70  71  72  73  74  75  76  77  78  79  80  81  82  83  84  85  86  87  88  89  90  91  92  93  94  95  96  97  98  99  100
## 
## 
## ++ MODEL/WORKFLOW :: svm.v2 
## Task for estimating  err  using
## 100  repetitions of  .632  Bootstrap experiment
##   Run with seed =  1234 
## Iteration :  1  2  3  4  5  6  7  8  9  10  11  12  13  14  15  16  17  18  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53  54  55  56  57  58  59  60  61  62  63  64  65  66  67  68  69  70  71  72  73  74  75  76  77  78  79  80  81  82  83  84  85  86  87  88  89  90  91  92  93  94  95  96  97  98  99  100
## 
## 
## ++ MODEL/WORKFLOW :: svm.v3 
## Task for estimating  err  using
## 100  repetitions of  .632  Bootstrap experiment
##   Run with seed =  1234 
## Iteration :  1  2  3  4  5  6  7  8  9  10  11  12  13  14  15  16  17  18  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53  54  55  56  57  58  59  60  61  62  63  64  65  66  67  68  69  70  71  72  73  74  75  76  77  78  79  80  81  82  83  84  85  86  87  88  89  90  91  92  93  94  95  96  97  98  99  100
## 
## 
## ++ MODEL/WORKFLOW :: svm.v4 
## Task for estimating  err  using
## 100  repetitions of  .632  Bootstrap experiment
##   Run with seed =  1234 
## Iteration :  1  2  3  4  5  6  7  8  9  10  11  12  13  14  15  16  17  18  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53  54  55  56  57  58  59  60  61  62  63  64  65  66  67  68  69  70  71  72  73  74  75  76  77  78  79  80  81  82  83  84  85  86  87  88  89  90  91  92  93  94  95  96  97  98  99  100
## 
## 
## ++ MODEL/WORKFLOW :: randomForest.v1 
## Task for estimating  err  using
## 100  repetitions of  .632  Bootstrap experiment
##   Run with seed =  1234 
## Iteration :  1  2  3  4  5  6  7  8  9  10  11  12  13  14  15  16  17  18  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53  54  55  56  57  58  59  60  61  62  63  64  65  66  67  68  69  70  71  72  73  74  75  76  77  78  79  80  81  82  83  84  85  86  87  88  89  90  91  92  93  94  95  96  97  98  99  100
## 
## 
## ++ MODEL/WORKFLOW :: randomForest.v2 
## Task for estimating  err  using
## 100  repetitions of  .632  Bootstrap experiment
##   Run with seed =  1234 
## Iteration :  1  2  3  4  5  6  7  8  9  10  11  12  13  14  15  16  17  18  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53  54  55  56  57  58  59  60  61  62  63  64  65  66  67  68  69  70  71  72  73  74  75  76  77  78  79  80  81  82  83  84  85  86  87  88  89  90  91  92  93  94  95  96  97  98  99  100
## 
## 
## ++ MODEL/WORKFLOW :: randomForest.v3 
## Task for estimating  err  using
## 100  repetitions of  .632  Bootstrap experiment
##   Run with seed =  1234 
## Iteration :  1  2  3  4  5  6  7  8  9  10  11  12  13  14  15  16  17  18  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53  54  55  56  57  58  59  60  61  62  63  64  65  66  67  68  69  70  71  72  73  74  75  76  77  78  79  80  81  82  83  84  85  86  87  88  89  90  91  92  93  94  95  96  97  98  99  100

The bootstrapping method trains a model on k random sample set of size N with replacement. So several observations would be used as training dataset for the number of iterations. It will have the same size as the dataset but could contain “repeatable” rows from the redWine dataset. On average the boostrap method will contain at least 63% of the original rows from the dataset. The remaining 36-37% is considered the resubstitution estimate, which is the error rate obtained by the training set. Bootstrap estimates are typically used on a large number of repetitions of the random sampling which is currently set at the recommended number of 100 if redWine observations were below 100. It is considered the best approach for estimating performance for very small datasets.

Selection of Model

summary(res_boot) # Summary of the Workflow output with Error Statistics

## 
## == Summary of a  Bootstrap Performance Estimation Experiment ==
## 
## Task for estimating  err  using
## 100  repetitions of  .632  Bootstrap experiment
##   Run with seed =  1234 
## 
## * Predictive Tasks ::  redWine.quality
## * Workflows  ::  svm.v1, svm.v2, svm.v3, svm.v4, randomForest.v1, randomForest.v2, randomForest.v3 
## 
## -> Task:  redWine.quality
##   *Workflow: svm.v1 
##                err
## avg     0.36663848
## std     0.01788081
## med     0.36486332
## iqr     0.02445405
## min     0.31588133
## max     0.40847458
## invalid 0.00000000
## 
##   *Workflow: svm.v2 
##                err
## avg     0.37709808
## std     0.01833263
## med     0.37628897
## iqr     0.02386107
## min     0.33500000
## max     0.42470389
## invalid 0.00000000
## 
##   *Workflow: svm.v3 
##                err
## avg     0.36559053
## std     0.01739826
## med     0.36684992
## iqr     0.02403445
## min     0.31228070
## max     0.40677966
## invalid 0.00000000
## 
##   *Workflow: svm.v4 
##                err
## avg     0.37225575
## std     0.01680280
## med     0.37165021
## iqr     0.02028048
## min     0.33676976
## max     0.41029900
## invalid 0.00000000
## 
##   *Workflow: randomForest.v1 
##                err
## avg     0.32602125
## std     0.01693073
## med     0.32588136
## iqr     0.02284020
## min     0.28272251
## max     0.36236934
## invalid 0.00000000
## 
##   *Workflow: randomForest.v2 
##                err
## avg     0.32675389
## std     0.01687302
## med     0.32664958
## iqr     0.02131221
## min     0.28272251
## max     0.36236934
## invalid 0.00000000
## 
##   *Workflow: randomForest.v3 
##                err
## avg     0.32712732
## std     0.01730283
## med     0.32711848
## iqr     0.02170427
## min     0.27923211
## max     0.36062718
## invalid 0.00000000

rankWorkflows(res_boot,2) # Choose top 2 Ranked Worked Flow estimates

## $redWine.quality
## $redWine.quality$err
##          Workflow  Estimate
## 1 randomForest.v1 0.3260213
## 2 randomForest.v2 0.3267539

topPerformers(res_boot) # Choose Top Performing model

## $redWine.quality
##            Workflow Estimate
## err randomForest.v1    0.326

Plot the Bootstrapping (Error Rate) Model Comparison

plot(res_boot)

The output for Bootstrapping using 100 repetitions with 63% of observations used as training set shows that the Random Forest performs the best with an error rate of around 32 - 33%. This compares better than the SVM, whose error rates from the 4 models vary from 36% - 37%.

Summary

From the performance estimation on the redWine dataset using the estimated error scores from multiple scores we can see that, on average, the Random Forest Classification Model outperforms the Support Vector Machine ‘Radial’ Kernel Model. This was already shown in Assignment 3 but using Performance Estimation we can now validate it visually and numerically.

Random Forest provides a better solution to the redWine dataset by only looking at sqrt(p) predictors in this case there are 11 predictors. A random sample of p predictors is chosen as each split in the tree. This elevates the very strong predictor in the data set and uncorrelates the tree, which leads to a reduction in variance compared to other models. Decision Trees are easier to understand and can be logically visual to a non-technical audience unlike the SVM. SVM requires a more technical understanding and more tuning in hyperparameters a model of best fit.

When using performance estimation methods (Sub-Sampling, Cross-Validation and Bootstrapping), the output of the Error rates can be plotted against each model for comparison. The benefits of each method varies in degree to the data size of the problem being analyzed.

Using the K-Fold Cross validation we get the lowest error rate. CV is considered the most common methods to estimate the predictive estimate performance of a model and is generally used for average-sized datasets (less than a few thousand cases). As such, CV is the most appropriate method for this redWine dataset.

Data Mining with R - Pre-Module Assignment #4

Study Group 2

22/07/2017

R Markdown

Loading Data

Hold Out Method

Sub-sampling set with Iterations

K-Fold Cross Validation

BootStrapping

Summary