This data set is Obtained from and is also available from the UCI machine learning repository,
Of the two data sets, of which one is related to red and and the other to white vinho
verde wine samples. I have chosen to analyse the " Red variety “” which is a variant
of the Portuguese “Vinho Verde” wine.
Please include this citation if you plan to use this database:
P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis. Modeling wine preferences by data mining from physicochemical properties. In Decision Support Systems, Elsevier, 47(4):547-553, 2009.
Definition of physiochemical: (of or pertaining to both physical and chemical properties, changes, and reactions. of or according to physical chemistry.)
For more information, read [Cortez et al., 2009]. Input variables (based on physicochemical tests):
Content: (Column Names)
1 - fixed acidity 2 - volatile acidity 3 - citric acid 4 - residual sugar 5 - chlorides 6 - free sulfur dioxide 7 - total sulfur dioxide 8 - density 9 - pH 10 - sulphates 11 - alcohol 12 - quality (score between 0 and 10)
Objective: Use machine learning to determine which physiochemical
properties make a wine 'good'
For a complete look at each ensembles accuracy, please see the conclusion
at the end of this report.
Overview
## Loading required package: mlr
## Loading required package: ParamHelpers
## Loading required package: parallelMap
## Loading required package: parallel
## Loading required package: tictoc
## randomForest 4.6-14
## Type rfNews() to see new features/changes/bug fixes.
## -- Attaching packages ---------------------------------- tidyverse 1.2.1 --
## v ggplot2 2.2.1 v purrr 0.2.4
## v tibble 1.4.2 v dplyr 0.7.5
## v tidyr 0.8.1 v stringr 1.3.1
## v readr 1.1.1 v forcats 0.3.0
## -- Conflicts ------------------------------------- tidyverse_conflicts() --
## x dplyr::combine() masks randomForest::combine()
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
## x ggplot2::margin() masks randomForest::margin()
## fixed.acidity volatile.acidity citric.acid residual.sugar
## Min. : 4.60 Min. :0.1200 Min. :0.000 Min. : 0.900
## 1st Qu.: 7.10 1st Qu.:0.3900 1st Qu.:0.090 1st Qu.: 1.900
## Median : 7.90 Median :0.5200 Median :0.260 Median : 2.200
## Mean : 8.32 Mean :0.5278 Mean :0.271 Mean : 2.539
## 3rd Qu.: 9.20 3rd Qu.:0.6400 3rd Qu.:0.420 3rd Qu.: 2.600
## Max. :15.90 Max. :1.5800 Max. :1.000 Max. :15.500
## chlorides free.sulfur.dioxide total.sulfur.dioxide
## Min. :0.01200 Min. : 1.00 Min. : 6.00
## 1st Qu.:0.07000 1st Qu.: 7.00 1st Qu.: 22.00
## Median :0.07900 Median :14.00 Median : 38.00
## Mean :0.08747 Mean :15.87 Mean : 46.47
## 3rd Qu.:0.09000 3rd Qu.:21.00 3rd Qu.: 62.00
## Max. :0.61100 Max. :72.00 Max. :289.00
## density pH sulphates alcohol
## Min. :0.9901 Min. :2.740 Min. :0.3300 Min. : 8.40
## 1st Qu.:0.9956 1st Qu.:3.210 1st Qu.:0.5500 1st Qu.: 9.50
## Median :0.9968 Median :3.310 Median :0.6200 Median :10.20
## Mean :0.9967 Mean :3.311 Mean :0.6581 Mean :10.42
## 3rd Qu.:0.9978 3rd Qu.:3.400 3rd Qu.:0.7300 3rd Qu.:11.10
## Max. :1.0037 Max. :4.010 Max. :2.0000 Max. :14.90
## quality
## Min. :3.000
## 1st Qu.:5.000
## Median :6.000
## Mean :5.636
## 3rd Qu.:6.000
## Max. :8.000
## ############################# ## Section 1. Data Preperation ## #############################
## Calculate the boundaries for each column
## then apply using the 1.5 * IQR rule, remove outliers
remove_outliers <- function(x, na.rm = TRUE, ...)
{
qnt <- quantile(x, probs=c(.25, .75), na.rm = na.rm, ...)
H <- 1.5 * IQR(x, na.rm = na.rm)
y <- x
y[x < (qnt[1] - H)] <- NA
y[x > (qnt[2] + H)] <- NA
y
}
## use the above formula to remove outliers from each column
fixed.acidity <- remove_outliers(Wine$fixed.acidity)
volatile.acidity <- remove_outliers(Wine$volatile.acidity)
citric.acid <- remove_outliers(Wine$citric.acid)
residual.sugar <- remove_outliers(Wine$residual.sugar)
chlorides <- remove_outliers(Wine$chlorides)
free.sulfur.dioxide <- remove_outliers(Wine$free.sulfur.dioxide)
total.sulfur.dioxide <- remove_outliers(Wine$total.sulfur.dioxide)
density <- remove_outliers(Wine$density)
pH <- remove_outliers(Wine$pH)
sulphates <- remove_outliers(Wine$sulphates)
alcohol <- remove_outliers(Wine$alcohol)
quality <- (Wine$quality) ## remove_outliers
## Combine all columns back into a Data Frame
Wine1 <- cbind.data.frame(fixed.acidity,volatile.acidity,citric.acid,
residual.sugar,chlorides,free.sulfur.dioxide,
total.sulfur.dioxide,density,pH,
sulphates,alcohol,quality)
## Check for anomalies and missing values
sum(is.na(Wine1))
## [1] 573
#sum(is.na(Wine1r))
## [1] 0
## [1] 141
## [1] 0
#normalize the variables
Wine1d <- normalizeFeatures(Wine1c,method = "standardize")
#head(Wine1d)
## [1] 81
## [1] 0
##
## min -1.4738585
## lower quartile -0.7698727
## median -0.1978841
## upper quartile 0.5501009
## max 2.4860621
##
## min -1.6553630
## lower quartile -0.8063988
## median -0.1999958
## upper quartile 0.5276878
## max 2.4681775
##
## min -2.1642912
## lower quartile -0.7034142
## median -0.2420846
## upper quartile 0.6036863
## max 2.5258929
## ################################# ## Section 2. Algorithm Application ## #################################
# Shuffle up the newest Data Frame
Wine4 <- Wine2[ sample(nrow(Wine2)),]
head(Wine4) ## Check the rows are shuffled
## fixed.acidity volatile.acidity citric.acid residual.sugar chlorides
## 799 1.3725689 -0.9332074 0.8985385 0.9992855 0.1362148
## 382 0.5267980 -0.2617874 0.3766523 1.8009430 -0.8877200
## 983 -0.8571907 1.3252055 -1.1890062 0.4648472 -1.8328907
## 900 0.1423567 2.2102592 -0.2032212 -1.9401253 0.6875644
## 948 -0.9340790 0.8979382 -1.3629683 -0.6040295 0.2149790
## 72 -1.3954086 0.6537854 0.4346397 -1.1384678 0.8450928
## free.sulfur.dioxide total.sulfur.dioxide density pH
## 799 1.61921325 1.12208941 0.58845798 -1.40647741
## 382 -0.56383759 -0.50587796 -0.01881794 -0.03151286
## 983 -0.92767940 -1.03386737 -0.91949684 0.93905035
## 900 -0.92767940 -0.37388060 -1.24701643 -1.81087875
## 948 0.89152964 -0.02188766 0.39740488 0.53464901
## 72 0.04256542 1.07809029 0.32234831 1.01993062
## sulphates alcohol quality
## 799 -0.3181873 -0.5911795 -0.8610351
## 382 0.4400627 0.8689884 1.7964138
## 983 0.0609377 1.1818815 0.4676894
## 900 -1.5503437 -0.3825841 -0.8610351
## 948 1.1035316 -0.6954772 -0.8610351
## 72 1.5774379 -1.1126680 -0.8610351
#str(Wine4) ## Check the structure hasn't changed
# A visual check that the row numbers are all randomly selected,
# Row Numbers are the first column below
par(mfrow = c(1,1)) ## Reset columns to original setting
par(mar=c(5,4,4,2)) ## Reset margins to Original setting
Wine5 <- as.data.frame(Wine4,row.names = NULL, optional = FALSE)
#str(Wine5)
#head(Wine5) ## Check the rows are shuffled
Wine5$quality <- as.factor(Wine5$quality)
#str(Wine5) #check that this has been changed to factor
## 1) Define the task
## Specify the type of analysis (e.g. classification) and provide data and response variable
task = makeClassifTask(data = Wine5, target = "quality")
## 2) Define the learner
## Choose a specific algorithm (e.g. lda : linear discriminant analysis)
#K Nearest Neighbour
#classif.featureless
#classif.knn
#classif.lda
#classif.ada
#classif.AdaBag
#classif.randomForest
#classif.rda
#classif.ada
#classif.bst
lrn = makeLearner("classif.lda") ############### Model 1
set.seed(1234)
n = nrow(Wine5)
train.set = sample(n, size = 0.8*n)
test.set = setdiff(1:n, train.set)
#n
## 3) Fit the model
## Train the learner on the task using a random subset of the data as training set
model = train(lrn, task, subset = train.set )#train.set
## 4) Make predictions
## Predict values of the response variable for new observations by the trained model
## using the other part of the data as test set
pred = predict(model, task = task, subset = test.set)
## 5) Evaluate the learner
## Calculate the mean misclassification error and accuracy
performance(pred, measures = list(mmce, acc))
## mmce acc
## 0.3299492 0.6700508
acc_lda <- round(( 1 - (performance(pred, measures = list(mmce))))*100,2)
#cat(paste0(" Model accuracy for mlr classif.lda is : ", acc_lda ," %")) # 1 lda
#####----------cross validation for classif.lda--------everything below is for testing
## use 5-fold cross-validation type:
## 5-fold cross-validation Resample description: cross-validation with 3 iterations.
rdesc = makeResampleDesc("CV", iters = 5)
rdesc
## Resample description: cross-validation with 5 iterations.
## Predict: test
## Stratification: FALSE
## Calculate the performance
r = resample("classif.lda",task , rdesc) #bh.task regr.lm
## Resampling: cross-validation
## Measures: mmce
## [Resample] iter 1: 0.4489796
## [Resample] iter 2: 0.3826531
## [Resample] iter 3: 0.3654822
## [Resample] iter 4: 0.4263959
## [Resample] iter 5: 0.4213198
##
## Aggregated Result: mmce.test.mean=0.4089661
##
r
## Resample Result
## Task: Wine5
## Learner: classif.lda
## Aggr perf: mmce.test.mean=0.4089661
## Runtime: 0.060045
#mse.test.mean
r$aggr
## mmce.test.mean
## 0.4089661
############### Model 2
pred = getRRPredictions(r)
#pred
list(pred)
## [[1]]
## Resampled Prediction for:
## Resample description: cross-validation with 5 iterations.
## Predict: test
## Stratification: FALSE
## predict.type: response
## threshold:
## time (mean): 0.01
## id truth response iter set
## 1 9 0.467689382506315 -0.861035077041974 1 test
## 2 11 0.467689382506315 -0.861035077041974 1 test
## 3 16 -0.861035077041974 0.467689382506315 1 test
## 4 17 1.7964138420546 1.7964138420546 1 test
## 5 18 -0.861035077041974 -0.861035077041974 1 test
## 6 27 0.467689382506315 0.467689382506315 1 test
## ... (#rows: 983, #cols: 5)
#performance(pred, measures = list(mmce))
acc_lda_cv <- round(( 1 - (performance(pred, measures = list(mmce))))*100,2)
#acc_lda_cv
#cat(paste0(" Model accuracy for mlr classif.lda 5-fold cross-validation is : ", acc_lda_cv ," %")) # 2 lda cv
## predicted
## true -3.51848399613855 -2.18975953659026
## -3.51848399613855 0 0
## -2.18975953659026 0 0
## -0.861035077041974 1 0
## 0.467689382506315 0 2
## 1.7964138420546 0 0
## 3.12513830160289 0 0
## -err.- 1 2
## predicted
## true -0.861035077041974 0.467689382506315 1.7964138420546
## -3.51848399613855 2 0 0
## -2.18975953659026 22 6 0
## -0.861035077041974 291 111 6
## 0.467689382506315 139 243 42
## 1.7964138420546 4 60 47
## 3.12513830160289 0 6 1
## -err.- 167 183 49
## predicted
## true 3.12513830160289 -err.-
## -3.51848399613855 0 2
## -2.18975953659026 0 28
## -0.861035077041974 0 118
## 0.467689382506315 0 183
## 1.7964138420546 0 64
## 3.12513830160289 0 7
## -err.- 0 402
## mmce acc
## 0.4365482 0.5634518
## mmce acc
## 0.3451777 0.6548223
## mmce acc
## 0.2436548 0.7563452
## Resample description: cross-validation with 5 iterations.
## Predict: test
## Stratification: FALSE
## Resampling: cross-validation
## Measures: mmce
## [Resample] iter 1: 0.2959184
## [Resample] iter 2: 0.2893401
## [Resample] iter 3: 0.3045685
## [Resample] iter 4: 0.2842640
## [Resample] iter 5: 0.2908163
##
## Aggregated Result: mmce.test.mean=0.2929815
##
## Resample Result
## Task: Wine5
## Learner: classif.randomForest
## Aggr perf: mmce.test.mean=0.2929815
## Runtime: 2.68957
## mmce.test.mean
## 0.2929815
## [[1]]
## Resampled Prediction for:
## Resample description: cross-validation with 5 iterations.
## Predict: test
## Stratification: FALSE
## predict.type: response
## threshold:
## time (mean): 0.01
## id truth response iter set
## 1 1 -0.861035077041974 -0.861035077041974 1 test
## 2 4 -0.861035077041974 -0.861035077041974 1 test
## 3 11 0.467689382506315 0.467689382506315 1 test
## 4 13 0.467689382506315 0.467689382506315 1 test
## 5 17 1.7964138420546 1.7964138420546 1 test
## 6 20 0.467689382506315 0.467689382506315 1 test
## ... (#rows: 983, #cols: 5)
###-------------- Resample
par(mfrow = c(1,1))
## Make predictions on both training and test sets
rdesc = makeResampleDesc("Holdout", predict = "both")
# resample to see if our errors remain in the same ballpark
r = resample("classif.randomForest", task, rdesc, show.info = FALSE)
#r
predList = getRRPredictionList(r) ############### Model 7
predList
## $train
## $train$`1`
## Prediction: 655 observations
## predict.type: response
## threshold:
## time: 0.02
## id truth response
## 301 301 -0.861035077041974 -0.861035077041974
## 155 155 0.467689382506315 0.467689382506315
## 675 675 0.467689382506315 0.467689382506315
## 463 463 1.7964138420546 1.7964138420546
## 368 368 -0.861035077041974 -0.861035077041974
## 764 764 0.467689382506315 0.467689382506315
## ... (#rows: 655, #cols: 3)
##
##
## $test
## $test$`1`
## Prediction: 328 observations
## predict.type: response
## threshold:
## time: 0.02
## id truth response
## 789 789 -0.861035077041974 0.467689382506315
## 423 423 0.467689382506315 0.467689382506315
## 887 887 1.7964138420546 0.467689382506315
## 206 206 1.7964138420546 0.467689382506315
## 794 794 0.467689382506315 1.7964138420546
## 724 724 0.467689382506315 0.467689382506315
## ... (#rows: 328, #cols: 3)
#Below we calculate the mean misclassification error (mmce) on the training and the test data sets.
mmceTrainMean = setAggregation(mmce, train.mean)
rdesc = makeResampleDesc("CV", iters = 5, predict = "both")
r = resample("classif.randomForest", task, rdesc, measures = list(mmce, mmceTrainMean))#classif.rpart
## Resampling: cross-validation
## Measures: mmce.train mmce.test
## [Resample] iter 1: 0.0000000 0.2908163
## [Resample] iter 2: 0.0000000 0.3654822
## [Resample] iter 3: 0.0000000 0.2653061
## [Resample] iter 4: 0.0000000 0.2944162
## [Resample] iter 5: 0.0000000 0.2791878
##
## Aggregated Result: mmce.test.mean=0.2990417,mmce.train.mean=0.0000000
##
#r
M <- unlist(r$aggr)
#M[1]
acc_ranFor_resample <- round( (1-M[1])*100,2)
#acc_ranFor_resample <- round(( 1 - (performance(pred, measures = list(mmce))))*100,2)
#acc_lda_cv
#cat(paste0(" Model accuracy for mlr randomForest Resample is : ", acc_ranFor_resample ," %")) # 7 random #Forest resample
#############-------------- using spFSR
#head(Wine5) ## a before check use wine4
data <- Wine5
#head(data) ## an after check
## last column is the target variable Y
Y <- data %>% pull(quality)
## other columns make up the feature matrix
X <- data %>% select(-quality)
## set the MLR classification task
my.task <- makeClassifTask(data = cbind(X, Y), target = "Y")
# View(my.task) ## have a wee peek to see whats what
# str(my.task)
## set the performance measure
my.measure <- mmce ## mean misclassification error
## set the wrapper classification algorithm
my.wrapper <- makeLearner("classif.knn", k = 1)
## you can try other algorithms as well
### my.wrapper <- makeLearner("classif.rpart", minsplit = 5, cp = 0, xval = 0)
### my.wrapper <- makeLearner("classif.svm")
### my.wrapper <- makeLearner("classif.naiveBayes")
################################################ ############### Model 8
### compute performance with full set of features
my.rdesc <- makeResampleDesc("RepCV", folds = 3, reps = 3) ###folds = 5 <- changed the folds 3
repcv.full <- resample(my.wrapper,
my.task,
my.rdesc,
measures = my.measure)
## Resampling: repeated cross-validation
## Measures: mmce
## [Resample] iter 1: 0.3963415
## [Resample] iter 2: 0.4146341
## [Resample] iter 3: 0.3394495
## [Resample] iter 4: 0.3597561
## [Resample] iter 5: 0.3932927
## [Resample] iter 6: 0.4495413
## [Resample] iter 7: 0.3993902
## [Resample] iter 8: 0.3700306
## [Resample] iter 9: 0.3689024
##
## Aggregated Result: mmce.test.mean=0.3879265
##
result.full.mean <- mean(repcv.full$measures.test[[2]])
cat('Repeated CV error with full set of features =', 100 * round(result.full.mean, 3))
## Repeated CV error with full set of features = 38.8
acc_spFSR <- round( 100 - (round(result.full.mean, 3))*100)
#acc_rf_small
#cat(paste0(" Model accuracy for spFSR k nearest neighbour is : ", acc_spFSR ," %")) # 8 spFSR knn
## set the wrapper classification algorithm
my.wrapper <- makeLearner("classif.rpart") #, k = 1)
## you can try other algorithms as well
### my.wrapper <- makeLearner("classif.rpart", minsplit = 5, cp = 0, xval = 0)
### my.wrapper <- makeLearner("classif.svm")
### my.wrapper <- makeLearner("classif.naiveBayes")
################################################
### compute performance with full set of features
my.rdesc <- makeResampleDesc("RepCV", folds = 3, reps = 3) ###folds = 5 <- changed the folds 3
repcv.full <- resample(my.wrapper, ############### Model 9
my.task,
my.rdesc,
measures = my.measure)
## Resampling: repeated cross-validation
## Measures: mmce
## [Resample] iter 1: 0.3993902
## [Resample] iter 2: 0.4237805
## [Resample] iter 3: 0.4250765
## [Resample] iter 4: 0.4525994
## [Resample] iter 5: 0.4146341
## [Resample] iter 6: 0.4207317
## [Resample] iter 7: 0.4268293
## [Resample] iter 8: 0.4176829
## [Resample] iter 9: 0.4373089
##
## Aggregated Result: mmce.test.mean=0.4242259
##
result.full.mean <- mean(repcv.full$measures.test[[2]])
cat('Repeated CV error with full set of features =', 100 * round(result.full.mean, 3))
## Repeated CV error with full set of features = 42.4
acc_rf_small <- round(( 1 - (result.full.mean))*100,2)
# acc_rf_small <- round(( 1 - (performance(pred, measures = list(mmce))))*100,2)
#acc_rf_small
#cat(paste0(" Model accuracy for spFSR rpart is : ", acc_rf_small ," %")) # 9 spFSR rpart
## # A tibble: 11 x 2
## rowname quality
## <chr> <dbl>
## 1 fixed.acidity 0.146
## 2 volatile.acidity -0.359
## 3 citric.acid 0.246
## 4 residual.sugar 0.0445
## 5 chlorides -0.147
## 6 free.sulfur.dioxide 0.00573
## 7 total.sulfur.dioxide -0.184
## 8 density -0.212
## 9 pH -0.109
## 10 sulphates 0.447
## 11 alcohol 0.486
## [1] " "
## [1] "volatile.acidity"
## [1] " "
## [1] " "
## [1] " "
## [1] " "
## [1] " "
## [1] " "
## [1] " "
## [1] "sulphates"
## [1] "alcohol"
#set.seed(1234)
Wine8 <- as.data.frame(Wine8 %>% ## omit missing values
na.omit() %>%
select_if(is.numeric))
#head(Wine8)
## Check for anomalies and missing values
sum(is.na(Wine8))
## [1] 0
#normalize the variables
Wine8 <- normalizeFeatures(Wine8,method = "standardize")
## 'data.frame': 983 obs. of 4 variables:
## $ volatile.acidity: num -0.933 -0.262 1.325 2.21 0.898 ...
## $ sulphates : num -0.3182 0.4401 0.0609 -1.5503 1.1035 ...
## $ alcohol : num -0.591 0.869 1.182 -0.383 -0.695 ...
## $ quality : Factor w/ 6 levels "-3.51848399613855",..: 3 5 4 3 3 3 4 3 4 4 ...
## mmce acc
## 0.3451777 0.6548223
## Resampling: repeated cross-validation
## Measures: mmce
## [Resample] iter 1: 0.4603659
## [Resample] iter 2: 0.4207317
## [Resample] iter 3: 0.3914373
## [Resample] iter 4: 0.4115854
## [Resample] iter 5: 0.4097859
## [Resample] iter 6: 0.4054878
## [Resample] iter 7: 0.4176829
## [Resample] iter 8: 0.4054878
## [Resample] iter 9: 0.3883792
##
## Aggregated Result: mmce.test.mean=0.4123271
##
## Repeated CV error with full set of features = 41.2
## Resample description: cross-validation with 5 iterations.
## Predict: test
## Stratification: FALSE
## Resampling: cross-validation
## Measures: mmce
## [Resample] iter 1: 0.3520408
## [Resample] iter 2: 0.4438776
## [Resample] iter 3: 0.3553299
## [Resample] iter 4: 0.5126904
## [Resample] iter 5: 0.4213198
##
## Aggregated Result: mmce.test.mean=0.4170517
##
## Resample Result
## Task: Wine5
## Learner: classif.rpart
## Aggr perf: mmce.test.mean=0.4170517
## Runtime: 0.196188
## mmce.test.mean
## 0.4170517
## [[1]]
## Resampled Prediction for:
## Resample description: cross-validation with 5 iterations.
## Predict: test
## Stratification: FALSE
## predict.type: response
## threshold:
## time (mean): 0.00
## id truth response iter set
## 1 13 0.467689382506315 0.467689382506315 1 test
## 2 17 1.7964138420546 1.7964138420546 1 test
## 3 18 -0.861035077041974 -0.861035077041974 1 test
## 4 21 0.467689382506315 0.467689382506315 1 test
## 5 24 0.467689382506315 0.467689382506315 1 test
## 6 25 0.467689382506315 0.467689382506315 1 test
## ... (#rows: 983, #cols: 5)
## predicted
## true -3.51848399613855 -2.18975953659026
## -3.51848399613855 0 0
## -2.18975953659026 0 0
## -0.861035077041974 0 0
## 0.467689382506315 0 0
## 1.7964138420546 0 0
## 3.12513830160289 0 0
## -err.- 0 0
## predicted
## true -0.861035077041974 0.467689382506315 1.7964138420546
## -3.51848399613855 2 0 0
## -2.18975953659026 24 4 0
## -0.861035077041974 281 125 3
## 0.467689382506315 142 253 31
## 1.7964138420546 8 64 39
## 3.12513830160289 0 5 2
## -err.- 176 198 36
## predicted
## true 3.12513830160289 -err.-
## -3.51848399613855 0 2
## -2.18975953659026 0 28
## -0.861035077041974 0 128
## 0.467689382506315 0 173
## 1.7964138420546 0 72
## 3.12513830160289 0 7
## -err.- 0 410
par(mfrow = c(1,1)) ## Reset columns to original setting
par(mar=c(5,4,4,2)) ## Reset margins to Original setting
###############################################
##### I couldn't load caret library with the rest up top as this has a 'train' call similar to
##### mlr that is why its been loaded here as the caret 'train, over rides the mlr
##### 'train' call
#---------Try a different approach-----------with the algorithm random forest
#-- and switch back to wine 4 (the cleaned version of data set)
library(caret) # hyperparameter tuning
## Loading required package: lattice
##
## Attaching package: 'caret'
## The following object is masked from 'package:purrr':
##
## lift
## The following object is masked from 'package:mlr':
##
## train
set.seed(1234)
# train/test split P is proportion I'll need to increase for my data set
training_indexs <- createDataPartition(Wine4$quality, p = .8, list = F)
training <- Wine4[training_indexs, ]
testing <- Wine4[-training_indexs, ]
# get predictors
predictors <- training %>% select(-quality) %>% as.matrix()
output <- training$quality
#library(randomForest) # for our model
# train a random forest model
model <- randomForest(x = predictors, y = output,
ntree = 20) # number of trees
# as the number of trees are increased
# the % variance does decrease slightly
# check out the details
#model
par(mfrow = c(1,1)) ## Reset columns to original setting
par(mar=c(5,4,4,2)) ## Reset margins to Original setting
plot(model, col='red')
print(model) ## summerize the model
##
## Call:
## randomForest(x = predictors, y = output, ntree = 20)
## Type of random forest: regression
## Number of trees: 20
## No. of variables tried at each split: 3
##
## Mean of squared residuals: 0.6181932
## % Var explained: 39.06
# check out our model's root mean squared error on the held out test data
rmse(predict(model, testing), testing$quality) #getting around 76.37035 % here usually
## [1] 0.7350491
acc_caret <- ( 1 - (rmse(predict(model, testing), testing$quality))) ############### Model 13
acc_caret <- round(( 1 - acc_caret)*100, 2)
#cat(paste0(" Model accuracy for caret randomForest is : ", acc_caret ," %")) # 13 caret
## [1] 0.7484778
## Model 1. Model accuracy for mlr classif.lda is : 67.01 %
## Model 2. Model accuracy for mlr classif.lda 5-fold cross-validation is : 59.1 %
## Model 3. Model accuracy for mlr naive bayes is : 56.35 %
## Model 4. Model accuracy for mlr k nearest neighbour is : 65.48 %
## Model 5. Model accuracy for mlr randomForest is : 75.63 %
## Model 6. Model accuracy for mlr randomForest 5-fold cross-validation is : 70.7 %
## Model 7. Model accuracy for mlr randomForest Resample is : 70.1 %
## Model 8. Model accuracy for spFSR k nearest neighbour is : 61 %
## Model 9. Model accuracy for spFSR rpart is : 57.58 %
## Model 10. Model accuracy for mlr k nearest neighbour 4 columns is : 65.48 %
## Model 11. Model accuracy for spFSR rpart 4 columns is : 58.8 %
## Model 12. Model accuracy for spFSR randomForest, 4 columns, 5-fold cross-validation is : 58.3 %
## Model 13. Model accuracy for caret randomForest is : 73.5 %
## Model 14. Model accuracy for caret randomForest 10-fold cross validation is : 74.85 %
## The section below is the above section, grouped together for easier comparison of Like vs Like
## mlr lda (linear discriminant analysis)
## Model 1. Model accuracy for mlr classif.lda is : 67.01 %
## Model 2. Model accuracy for mlr classif.lda 5-fold cross-validation is : 59.1 %
## mlr naive Bayes
## Model 3. Model accuracy for mlr naive bayes is : 56.35 %
## mlr k nearest neighbour (knn) also includes reduced columns mlr knn and spFSR knn
## Model 4. Model accuracy for mlr k nearest neighbour is : 65.48 %
## Model 10. Model accuracy for mlr k nearest neighbour 4 columns is : 65.48 %
## Model 8. Model accuracy for spFSR k nearest neighbour is : 61 %
## mlr randomForest (rf) also includes caret (rf)
## Model 5. Model accuracy for mlr randomForest is : 75.63 %
## Model 6. Model accuracy for mlr randomForest 5-fold cross-validation is : 70.7 %
## Model 7. Model accuracy for mlr randomForest Resample is : 70.1 %
## Model 13. Model accuracy for caret randomForest is : 73.5 %
## Model 14. Model accuracy for caret randomForest 10-fold cross validation is : 74.85 %
## spFSR rpart and reduced columns spFSR rpart
## Model 9. Model accuracy for spFSR rpart is : 57.58 %
## Model 11. Model accuracy for spFSR rpart 4 columns is : 58.8 %
## Model 12. Model accuracy for spFSR rpart, 4 columns, 5-fold cross-validation is : 58.3 %
The best performing model for my particular data set seemed to be caret followed closely by mlr. Numerous trailing was undertaken and it should be noted that at each rerun a random subset would be chosen which would then be tested for accuracy in each ensemble.
There were cases of occasionally over fitting, but not so many cases on under fitting the model. It should be noted that I couldn’t find a way to assess whether caret was using the same subset as all the others, simply because at the partitioning section, while I could see the structure of the caret subset I couldn’t see the structure of the mlr or spFSR subset.
At all stages of outlier removal for my data set there was a distinct lack of any form of pattern emergence, I can only assume from this that my data set was seriously noisy. Algorithm performances are stated below :
Lowest accuracy -----> naive bayes mlr (from less than or equal to 55% acc)
-----> lda mlr (from 55% - 61% acc)
-----> rpart spFSR (from 58% - 62% acc)
-----> knn spFSR (from 60% - 66% acc)
-----> knn mlr (from 60% - 66% acc)
-----> random Forest mlr (from 68% - 71% acc)
Highest accuracy -----> random Forest caret (from 75% - 82% acc)
(of note: In my last run, mlr actually scored highest, generally though caret scored highest)
NOTE: This was only for a small test of no more than 40 runs (at least 40 new subsets for testing) And should not be considered absolute, this is only what appeared to happen for my data set. Of interest, spFSR and mlr were scoring about the same for accuracy for knn.
It would have been nice had I have been able to get spFSR working correctly to then trail it against mlr and caret using the same algorithms. personally I expected the algorithms to all be similar in their results for similar algorithms, so why caret out performed I’m not sure. I would have liked to have trialed a multiple regression algorithm and also have trialed a logarithmic algorithm, but my report was beginning to get big, at almost double the size of the expected reports I therefore decided to stop.
From all of the above, my goto choices would be caret then either mlr or spFSR.
So, what was all this about ? Could I use any of these model’s as a way of predicting a semi decent wine ? I believe I could, as most models fell within a 60% plus accuracy rating which for me mean’s for each three bottles I chose hopefully at least 2 would be good or better if they contained the required physio-chemical’s. From a wine producers standpoint, if I knew the coefficients for each of the physio-chemical’s from the best model and I were able to alter one of more of the physio-chemical’s, I could produce better wines more often, and thus increase my profit margin.
ERRORS (The only error I could not overcome was in the spFSR feature selection part, an example is given below)
iter value st.dev num.ft best.value
spFSR could not get past this error, and as I could not alter any spFSR parameters because of this, the section was dropped from my assessment. I was therefore unable to use the Feature Selection part of spFSR.