This data set is Obtained from and is also available from the UCI machine learning repository,
Of the two data sets, of which one is related to red and and the other to white vinho
verde wine samples. I have chosen to analyse the " Red variety “” which is a variant
of the Portuguese “Vinho Verde” wine.
Please include this citation if you plan to use this database:
P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis. Modeling wine preferences by data mining from physicochemical properties. In Decision Support Systems, Elsevier, 47(4):547-553, 2009.
Definition of physiochemical: (of or pertaining to both physical and chemical properties, changes, and reactions. of or according to physical chemistry.)
For more information, read [Cortez et al., 2009]. Input variables (based on physicochemical tests):
Content: (Column Names)
1 - fixed acidity 2 - volatile acidity 3 - citric acid 4 - residual sugar 5 - chlorides 6 - free sulfur dioxide 7 - total sulfur dioxide 8 - density 9 - pH 10 - sulphates 11 - alcohol 12 - quality (score between 0 and 10)
Objective: Use machine learning to determine which physiochemical
properties make a wine 'good'
For a complete look at each ensembles accuracy, please see the conclusion
at the end of this report.
Overview
# rfNews()
# libraries
library(spFSR)
## Loading required package: mlr
## Loading required package: ParamHelpers
## Loading required package: parallelMap
## Loading required package: parallel
## Loading required package: tictoc
library(randomForest) # for our model
## randomForest 4.6-14
## Type rfNews() to see new features/changes/bug fixes.
library(tidyverse) # general utility functions
## -- Attaching packages ---------------------------------- tidyverse 1.2.1 --
## v ggplot2 2.2.1 v purrr 0.2.4
## v tibble 1.4.2 v dplyr 0.7.5
## v tidyr 0.8.1 v stringr 1.3.1
## v readr 1.1.1 v forcats 0.3.0
## -- Conflicts ------------------------------------- tidyverse_conflicts() --
## x dplyr::combine() masks randomForest::combine()
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
## x ggplot2::margin() masks randomForest::margin()
library(Metrics) # handy evaluation functions
library(rpart)
library(readr)
library(dplyr)
library(mlr)
library(corrr)
knitr::opts_chunk$set(echo = TRUE)
Wine <- read.csv("Wine.csv", colClasses = "numeric")
summary(Wine)
## fixed.acidity volatile.acidity citric.acid residual.sugar
## Min. : 4.60 Min. :0.1200 Min. :0.000 Min. : 0.900
## 1st Qu.: 7.10 1st Qu.:0.3900 1st Qu.:0.090 1st Qu.: 1.900
## Median : 7.90 Median :0.5200 Median :0.260 Median : 2.200
## Mean : 8.32 Mean :0.5278 Mean :0.271 Mean : 2.539
## 3rd Qu.: 9.20 3rd Qu.:0.6400 3rd Qu.:0.420 3rd Qu.: 2.600
## Max. :15.90 Max. :1.5800 Max. :1.000 Max. :15.500
## chlorides free.sulfur.dioxide total.sulfur.dioxide
## Min. :0.01200 Min. : 1.00 Min. : 6.00
## 1st Qu.:0.07000 1st Qu.: 7.00 1st Qu.: 22.00
## Median :0.07900 Median :14.00 Median : 38.00
## Mean :0.08747 Mean :15.87 Mean : 46.47
## 3rd Qu.:0.09000 3rd Qu.:21.00 3rd Qu.: 62.00
## Max. :0.61100 Max. :72.00 Max. :289.00
## density pH sulphates alcohol
## Min. :0.9901 Min. :2.740 Min. :0.3300 Min. : 8.40
## 1st Qu.:0.9956 1st Qu.:3.210 1st Qu.:0.5500 1st Qu.: 9.50
## Median :0.9968 Median :3.310 Median :0.6200 Median :10.20
## Mean :0.9967 Mean :3.311 Mean :0.6581 Mean :10.42
## 3rd Qu.:0.9978 3rd Qu.:3.400 3rd Qu.:0.7300 3rd Qu.:11.10
## Max. :1.0037 Max. :4.010 Max. :2.0000 Max. :14.90
## quality
## Min. :3.000
## 1st Qu.:5.000
## Median :6.000
## Mean :5.636
## 3rd Qu.:6.000
## Max. :8.000
par(mar=c(5,10,4,2)+.1)
boxplot(Wine, horizontal = TRUE, las = 1 , outline=TRUE , ## I have used this as a way to
col=heat.colors(12),cex.axis = 0.8 )#, pos=2 ) ## to indicate any outliers
par(mfrow = c(1,1)) ## Reset columns to original setting
par(mar=c(5,4,4,2)) ## Reset margins to Original setting
## Calculate the boundaries for each column
## then apply using the 1.5 * IQR rule, remove outliers
remove_outliers <- function(x, na.rm = TRUE, ...)
{
qnt <- quantile(x, probs=c(.25, .75), na.rm = na.rm, ...)
H <- 1.5 * IQR(x, na.rm = na.rm)
y <- x
y[x < (qnt[1] - H)] <- NA
y[x > (qnt[2] + H)] <- NA
y
}
## use the above formula to remove outliers from each column
fixed.acidity <- remove_outliers(Wine$fixed.acidity)
volatile.acidity <- remove_outliers(Wine$volatile.acidity)
citric.acid <- remove_outliers(Wine$citric.acid)
residual.sugar <- remove_outliers(Wine$residual.sugar)
chlorides <- remove_outliers(Wine$chlorides)
free.sulfur.dioxide <- remove_outliers(Wine$free.sulfur.dioxide)
total.sulfur.dioxide <- remove_outliers(Wine$total.sulfur.dioxide)
density <- remove_outliers(Wine$density)
pH <- remove_outliers(Wine$pH)
sulphates <- remove_outliers(Wine$sulphates)
alcohol <- remove_outliers(Wine$alcohol)
quality <- (Wine$quality) ## remove_outliers
## Combine all columns back into a Data Frame
Wine1 <- cbind.data.frame(fixed.acidity,volatile.acidity,citric.acid,
residual.sugar,chlorides,free.sulfur.dioxide,
total.sulfur.dioxide,density,pH,
sulphates,alcohol,quality)
## Check for anomalies and missing values
sum(is.na(Wine1))
## [1] 573
#sum(is.na(Wine1r))
## Remove rows with missing values on columns specified
## And select column if it is numeric
Wine1a <- as.data.frame(Wine1 %>%
na.omit() %>%
select_if(is.numeric))
## Check for anomalies and missing values
sum(is.na(Wine1a))
## [1] 0
#str(Wine1a)
## Visualize new data set
par(mar=c(5,10,4,2)+.1)
boxplot(Wine1a, horizontal = TRUE, las = 1 , outline=TRUE , ## I have used this as a way to
col=heat.colors(12),cex.axis = 0.8 )#, pos=2 ) ## to indicate any outliers
par(mfrow = c(1,1)) ## Reset columns to original setting
par(mar=c(5,4,4,2)) ## Reset margins to Original setting
## use the above formula remove outliers from each column ***(for a 2nd time)***
fixed.acidity <- remove_outliers(Wine1a$fixed.acidity)
volatile.acidity <- remove_outliers(Wine1a$volatile.acidity)
citric.acid <- remove_outliers(Wine1a$citric.acid)
residual.sugar <- remove_outliers(Wine1a$residual.sugar)
chlorides <- remove_outliers(Wine1a$chlorides)
free.sulfur.dioxide <- remove_outliers(Wine1a$free.sulfur.dioxide)
total.sulfur.dioxide <- remove_outliers(Wine1a$total.sulfur.dioxide)
density <- remove_outliers(Wine1a$density)
pH <- remove_outliers(Wine1a$pH)
sulphates <- remove_outliers(Wine1a$sulphates)
alcohol <- remove_outliers(Wine1a$alcohol)
quality <- (Wine1a$quality) ## remove_outliers
#quality <- ifelse(Wine$quality <= 5 , 0 ,
# ifelse(Wine$quality > 5, 1,0)) ## Change quality to binary
## Combine all columns back into a Data Frame
Wine1b <- cbind.data.frame(fixed.acidity,volatile.acidity,citric.acid,
residual.sugar,chlorides,free.sulfur.dioxide,
total.sulfur.dioxide,density,pH,
sulphates,alcohol,quality)
## Check for anomalies and missing values
sum(is.na(Wine1b))
## [1] 141
## Remove rows with missing values on columns specified
## And select column if it is numeric
Wine1c <- as.data.frame(Wine1b %>%
na.omit() %>%
select_if(is.numeric))
## Check for anomalies and missing values
sum(is.na(Wine1c))
## [1] 0
#str(Wine1c)
## Visualize new data set
par(mar=c(5,10,4,2)+.1)
boxplot(Wine1c, horizontal = TRUE, las = 1 , outline=TRUE , ## I have used this as a way to
col=heat.colors(12),cex.axis = 0.8 )#, pos=2 ) ## to indicate any outliers
par(mfrow = c(1,1)) ## Reset columns to original setting
par(mar=c(5,4,4,2)) ## Reset margins to Original setting
#normalize the variables
Wine1d <- normalizeFeatures(Wine1c,method = "standardize")
#head(Wine1d)
## Visualize new data set
par(mar=c(5,10,4,2)+.1)
boxplot(Wine1d, horizontal = TRUE, las = 1 , outline=TRUE , ## I have used this as a way to
col=heat.colors(12),cex.axis = 0.8 )#, pos=2 ) ## to indicate any outliers
par(mfrow = c(1,1)) ## Reset columns to original setting
par(mar=c(5,4,4,2)) ## Reset margins to Original setting
## use the above formula remove outliers from each column ***(for a 3rd time)***
fixed.acidity <- remove_outliers(Wine1c$fixed.acidity)
volatile.acidity <- remove_outliers(Wine1c$volatile.acidity)
citric.acid <- remove_outliers(Wine1c$citric.acid)
residual.sugar <- remove_outliers(Wine1c$residual.sugar)
chlorides <- remove_outliers(Wine1c$chlorides)
free.sulfur.dioxide <- remove_outliers(Wine1c$free.sulfur.dioxide)
total.sulfur.dioxide <- remove_outliers(Wine1c$total.sulfur.dioxide)
density <- remove_outliers(Wine1c$density)
pH <- remove_outliers(Wine1c$pH)
sulphates <- remove_outliers(Wine1c$sulphates)
alcohol <- remove_outliers(Wine1c$alcohol)
quality <- (Wine1c$quality) ## remove_outliers
## Combine all columns back into a Data Frame
Wine1e <- cbind.data.frame(fixed.acidity,volatile.acidity,citric.acid,
residual.sugar,chlorides,free.sulfur.dioxide,
total.sulfur.dioxide,density,pH,
sulphates,alcohol,quality)
## Check for anomalies and missing values
sum(is.na(Wine1e))
## [1] 81
#sum(is.na(Wine1r))
## Remove rows with missing values on columns specified
## And select column if it is numeric
Wine1f <- as.data.frame(Wine1e %>%
na.omit() %>%
select_if(is.numeric))
## Check for anomalies and missing values
sum(is.na(Wine1f))
## [1] 0
#str(Wine1f)
## Visualize new data set
par(mar=c(5,10,4,2)+.1)
boxplot(Wine1f, horizontal = TRUE, las = 1 , outline=TRUE , ## I have used this as a way to
col=heat.colors(12),cex.axis = 0.8 )#, pos=2 ) ## to indicate any outliers
par(mfrow = c(1,1)) ## Reset columns to original setting
par(mar=c(5,4,4,2)) ## Reset margins to Original setting
#normalize the variables
Wine2 <- normalizeFeatures(Wine1f,method = "standardize")
#head(Wine2)
## Visualize new data set
par(mar=c(5,10,4,2)+.1)
boxplot(Wine2, horizontal = TRUE, las = 1 , outline=TRUE , ## I have used this as a way to
col=rainbow(12),cex.axis = 0.8 )#, pos=2 ) ## to indicate any outliers
par(mfrow = c(1,1)) ## Reset columns to original setting
par(mar=c(5,4,4,2)) ## Reset margins to Original setting
par(mfrow = c(3,1))
##Clean up specific columns in Wine2
J <- boxplot(Wine2$total.sulfur.dioxide,las = 1,horizontal = TRUE, col = 'purple',xlab = "total.sulfur.dioxide",cex.lab=1.5)
K <- boxplot(Wine2$free.sulfur.dioxide,las = 1,horizontal = TRUE, col = 'purple',xlab = "free.sulfur.dioxide",cex.lab=1.5)
L <- boxplot(Wine2$fixed.acidity,las = 1,horizontal = TRUE, col = 'purple',xlab = "fixed.acidity",cex.lab=1.5)
par(mfrow = c(1,1))
par(mfrow = c(1,3))
mytable <- J$stats
mytable1 <- K$stats
mytable2 <- L$stats
colnames(mytable)<-J$names
colnames(mytable1)<-K$names
colnames(mytable2)<-L$names
rownames(mytable)<-c('min','lower quartile','median','upper quartile','max')
rownames(mytable1)<-c('min','lower quartile','median','upper quartile','max')
rownames(mytable2)<-c('min','lower quartile','median','upper quartile','max')
mytable
##
## min -1.4738585
## lower quartile -0.7698727
## median -0.1978841
## upper quartile 0.5501009
## max 2.4860621
mytable1
##
## min -1.6553630
## lower quartile -0.8063988
## median -0.1999958
## upper quartile 0.5276878
## max 2.4681775
mytable2
##
## min -2.1642912
## lower quartile -0.7034142
## median -0.2420846
## upper quartile 0.6036863
## max 2.5258929
par(mfrow = c(1,1))
# Remove upper values beyond max for this variable in Wine data set
Wine2 <- capLargeValues(Wine2,cols = c("total.sulfur.dioxide"),threshold = 2.5)
Wine2 <- capLargeValues(Wine2,cols = c("free.sulfur.dioxide"),threshold = 2.5)
Wine2 <- capLargeValues(Wine2,cols = c("fixed.acidity"),threshold = 2.55)
# M <- boxplot(Wine2$sulphates,las = 1,horizontal = TRUE, col = 'purple') # check boxplot has changed
par(mfrow = c(3,1))
##Clean up specific columns in Wine2
J <- boxplot(Wine2$total.sulfur.dioxide,las = 1,horizontal = TRUE, col = 'purple',xlab = "total.sulfur.dioxide",cex.lab=1.5)
K <- boxplot(Wine2$free.sulfur.dioxide,las = 1,horizontal = TRUE, col = 'purple',xlab = "free.sulfur.dioxide",cex.lab=1.5)
L <- boxplot(Wine2$fixed.acidity,las = 1,horizontal = TRUE, col = 'purple',xlab = "fixed.acidity",cex.lab=1.5)
par(mfrow = c(1,1))
## Visualize new data set
par(mar=c(5,10,4,2)+.1)
boxplot(Wine2, horizontal = TRUE, las = 1 , outline=TRUE , ## I have used this as a way to
col=rainbow(12),cex.axis = 0.8 )#, pos=2 ) ## to indicate any outliers
par(mfrow = c(1,1)) ## Reset columns to original setting
par(mar=c(5,4,4,2)) ## Reset margins to Original setting heat.colors(12)
# Shuffle up the newest Data Frame
Wine4 <- Wine2[ sample(nrow(Wine2)),]
head(Wine4) ## Check the rows are shuffled
## fixed.acidity volatile.acidity citric.acid residual.sugar chlorides
## 408 -0.7803025 0.7758618 -1.2469936 -0.06959115 -0.88772003
## 626 -0.7034142 -1.1163220 0.2606776 -0.33681031 0.37250749
## 72 -1.3954086 0.6537854 0.4346397 -1.13846779 0.84509281
## 901 -0.6265259 -0.4449019 -0.4931580 -0.33681031 0.05745061
## 95 -0.8571907 -0.2007492 -0.7830947 0.46484716 -0.25760627
## 18 -1.3185203 -0.8111311 -0.4351706 -1.94012526 0.21497905
## free.sulfur.dioxide total.sulfur.dioxide density pH
## 408 -0.80639880 -0.6378753 0.5270481 1.10081088
## 626 1.98305506 1.2540868 0.3496416 0.93905035
## 72 0.04256542 1.0780903 0.3223483 1.01993062
## 901 -0.56383759 -0.5058780 -0.2303410 -0.35503393
## 95 -0.32127639 -0.4178797 -0.3599842 0.69640955
## 18 -0.44255699 -0.7258735 -0.6329172 0.04936741
## sulphates alcohol quality
## 408 0.4400627 -0.9040726 0.4676894
## 626 -1.3607812 -1.0083703 -0.8610351
## 72 1.5774379 -1.1126680 -0.8610351
## 901 -1.0764374 -0.4868818 0.4676894
## 95 -0.4129686 -0.1739887 0.4676894
## 18 -0.6025311 -1.1126680 -0.8610351
#str(Wine4) ## Check the structure hasn't changed
# A visual check that the row numbers are all randomly selected,
# Row Numbers are the first column below
par(mfrow = c(1,1)) ## Reset columns to original setting
par(mar=c(5,4,4,2)) ## Reset margins to Original setting
Wine5 <- as.data.frame(Wine4,row.names = NULL, optional = FALSE)
#str(Wine5)
#head(Wine5) ## Check the rows are shuffled
Wine5$quality <- as.factor(Wine5$quality)
#str(Wine5) #check that this has been changed to factor
## 1) Define the task
## Specify the type of analysis (e.g. classification) and provide data and response variable
task = makeClassifTask(data = Wine5, target = "quality")
## 2) Define the learner
## Choose a specific algorithm (e.g. lda : linear discriminant analysis)
#K Nearest Neighbour
#classif.featureless
#classif.knn
#classif.lda
#classif.ada
#classif.AdaBag
#classif.randomForest
#classif.rda
#classif.ada
#classif.bst
lrn = makeLearner("classif.lda") ############### Model 1
set.seed(1234)
n = nrow(Wine5)
train.set = sample(n, size = 0.8*n)
test.set = setdiff(1:n, train.set)
#n
## 3) Fit the model
## Train the learner on the task using a random subset of the data as training set
model = train(lrn, task, subset = train.set )#train.set
## 4) Make predictions
## Predict values of the response variable for new observations by the trained model
## using the other part of the data as test set
pred = predict(model, task = task, subset = test.set)
## 5) Evaluate the learner
## Calculate the mean misclassification error and accuracy
performance(pred, measures = list(mmce, acc))
## mmce acc
## 0.4111675 0.5888325
acc_lda <- round(( 1 - (performance(pred, measures = list(mmce))))*100,2)
#cat(paste0(" Model accuracy for mlr classif.lda is : ", acc_lda ," %")) # 1 lda
#####----------cross validation for classif.lda--------everything below is for testing
## use 5-fold cross-validation type:
## 5-fold cross-validation Resample description: cross-validation with 3 iterations.
rdesc = makeResampleDesc("CV", iters = 5)
rdesc
## Resample description: cross-validation with 5 iterations.
## Predict: test
## Stratification: FALSE
## Calculate the performance
r = resample("classif.lda",task , rdesc) #bh.task regr.lm
## Resampling: cross-validation
## Measures: mmce
## [Resample] iter 1: 0.3826531
## [Resample] iter 2: 0.3756345
## [Resample] iter 3: 0.4234694
## [Resample] iter 4: 0.4517766
## [Resample] iter 5: 0.3857868
##
## Aggregated Result: mmce.test.mean=0.4038641
##
r
## Resample Result
## Task: Wine5
## Learner: classif.lda
## Aggr perf: mmce.test.mean=0.4038641
## Runtime: 0.0480461
#mse.test.mean
r$aggr
## mmce.test.mean
## 0.4038641
############### Model 2
pred = getRRPredictions(r)
#pred
list(pred)
## [[1]]
## Resampled Prediction for:
## Resample description: cross-validation with 5 iterations.
## Predict: test
## Stratification: FALSE
## predict.type: response
## threshold:
## time (mean): 0.00
## id truth response iter set
## 1 8 -0.861035077041974 -0.861035077041974 1 test
## 2 17 -0.861035077041974 -0.861035077041974 1 test
## 3 34 0.467689382506315 0.467689382506315 1 test
## 4 38 -0.861035077041974 -0.861035077041974 1 test
## 5 51 -0.861035077041974 -0.861035077041974 1 test
## 6 54 -0.861035077041974 0.467689382506315 1 test
## ... (#rows: 983, #cols: 5)
#performance(pred, measures = list(mmce))
acc_lda_cv <- round(( 1 - (performance(pred, measures = list(mmce))))*100,2)
#acc_lda_cv
#cat(paste0(" Model accuracy for mlr classif.lda 5-fold cross-validation is : ", acc_lda_cv ," %")) # 2 lda cv
calculateConfusionMatrix(pred) # <-- works fine , going with the next one tho, takes
## predicted
## true -3.51848399613855 -2.18975953659026
## -3.51848399613855 0 0
## -2.18975953659026 0 0
## -0.861035077041974 0 0
## 0.467689382506315 0 1
## 1.7964138420546 0 0
## 3.12513830160289 0 0
## -err.- 0 1
## predicted
## true -0.861035077041974 0.467689382506315 1.7964138420546
## -3.51848399613855 2 0 0
## -2.18975953659026 23 5 0
## -0.861035077041974 285 120 4
## 0.467689382506315 134 252 39
## 1.7964138420546 5 57 49
## 3.12513830160289 0 5 2
## -err.- 164 187 45
## predicted
## true 3.12513830160289 -err.-
## -3.51848399613855 0 2
## -2.18975953659026 0 28
## -0.861035077041974 0 124
## 0.467689382506315 0 174
## 1.7964138420546 0 62
## 3.12513830160289 0 7
## -err.- 0 397
# up to much space otherwise
#conf.matrix = calculateConfusionMatrix(pred, relative = TRUE)
#conf.matrix
##--------------------------- try the algorithm naive bayes
## 2) Define the learner
## Choose a specific algorithm (e.g. linear discriminant analysis)
lrn = makeLearner("classif.naiveBayes") #classif.lda #classif.randomForest
## 3) Fit the model
## Train the learner on the task using a random subset of the data as training set
model = train(lrn, task, subset = train.set )#train.set
## 4) Make predictions
## Predict values of the response variable for new observations by the trained model
## using the other part of the data as test set
pred = predict(model, task = task, subset = test.set)
############### Model 3
## 5) Evaluate the learner
## Calculate the mean misclassification error and accuracy
performance(pred, measures = list(mmce, acc))
## mmce acc
## 0.4314721 0.5685279
#performance(pred, measures = list(mmce))
acc_naive <- round(( 1 - (performance(pred, measures = list(mmce))))*100,2)
#acc_lda_cv
#cat(paste0(" Model accuracy for mlr naive bayes is : ", acc_naive ," %")) # 3 naive bayes
##--------------------------- try the algorithm nearest neighbour
lrn = makeLearner("classif.knn")
## 3) Fit the model
## Train the learner on the task using a random subset of the data as training set
model = train(lrn, task, subset = train.set )#train.set
## 4) Make predictions
## Predict values of the response variable for new observations by the trained model
## using the other part of the data as test set
pred = predict(model, task = task, subset = test.set)
############### Model 4
## 5) Evaluate the learner
## Calculate the mean misclassification error and accuracy
performance(pred, measures = list(mmce, acc))
## mmce acc
## 0.3857868 0.6142132
#performance(pred, measures = list(mmce))
acc_knn <- round(( 1 - (performance(pred, measures = list(mmce))))*100,2)
#acc_knn
#cat(paste0(" Model accuracy mlr k nearest neighbour is : ", acc_knn ," %")) # 4 knn
##--------------------------- try the algorithm randomForest
lrn = makeLearner("classif.randomForest")
## 3) Fit the model
## Train the learner on the task using a random subset of the data as training set
model = train(lrn, task, subset = train.set )#train.set
## 4) Make predictions
## Predict values of the response variable for new observations by the trained model
## using the other part of the data as test set
pred = predict(model, task = task, subset = test.set) ############### Model 5
## 5) Evaluate the learner
## Calculate the mean misclassification error and accuracy
performance(pred, measures = list(mmce, acc))
## mmce acc
## 0.3096447 0.6903553
#performance(pred, measures = list(mmce))
acc_ranFor <- round(( 1 - (performance(pred, measures = list(mmce))))*100,2)
#acc_lda_cv
#cat(paste0(" Model accuracy for mlr randomForest is : ", acc_ranFor ," %")) # 5 random Forest
#####----------cross validation for classif.randomForest--------everything below is for testing
## use 5-fold cross-validation type:
## 3-fold cross-validation Resample description: cross-validation with 3 iterations.
rdesc = makeResampleDesc("CV", iters = 5)
rdesc
## Resample description: cross-validation with 5 iterations.
## Predict: test
## Stratification: FALSE
## Calculate the performance
r = resample("classif.randomForest",task , rdesc) #bh.task regr.lm
## Resampling: cross-validation
## Measures: mmce
## [Resample] iter 1: 0.2690355
## [Resample] iter 2: 0.2639594
## [Resample] iter 3: 0.3316327
## [Resample] iter 4: 0.3418367
## [Resample] iter 5: 0.3147208
##
## Aggregated Result: mmce.test.mean=0.3042370
##
r
## Resample Result
## Task: Wine5
## Learner: classif.randomForest
## Aggr perf: mmce.test.mean=0.3042370
## Runtime: 2.5995
#mse.test.mean
r$aggr ############### Model 6
## mmce.test.mean
## 0.304237
#r$measures.test ## same as r above
#r$pred
# Resampled Prediction for:
# Resample description: subsampling with 5 iterations and 0.80 split rate.
# Predict: test
pred = getRRPredictions(r)
#pred
list(pred)
## [[1]]
## Resampled Prediction for:
## Resample description: cross-validation with 5 iterations.
## Predict: test
## Stratification: FALSE
## predict.type: response
## threshold:
## time (mean): 0.01
## id truth response iter set
## 1 9 -0.861035077041974 -0.861035077041974 1 test
## 2 11 0.467689382506315 0.467689382506315 1 test
## 3 15 -0.861035077041974 -0.861035077041974 1 test
## 4 17 -0.861035077041974 -0.861035077041974 1 test
## 5 25 -0.861035077041974 -0.861035077041974 1 test
## 6 28 -0.861035077041974 -0.861035077041974 1 test
## ... (#rows: 983, #cols: 5)
#performance(pred, measures = list(mmce))
acc_ranFor_cv <- round(( 1 - (performance(pred, measures = list(mmce))))*100,2)
#acc_lda_cv
#cat(paste0(" Model accuracy for mlr randomForest 5-fold cross-validation is : ", acc_ranFor_cv ," %")) #6 #random Forest cv
###-------------- Resample
par(mfrow = c(1,1))
## Make predictions on both training and test sets
rdesc = makeResampleDesc("Holdout", predict = "both")
# resample to see if our errors remain in the same ballpark
r = resample("classif.randomForest", task, rdesc, show.info = FALSE)
#r
predList = getRRPredictionList(r) ############### Model 7
predList
## $train
## $train$`1`
## Prediction: 655 observations
## predict.type: response
## threshold:
## time: 0.01
## id truth response
## 158 158 -0.861035077041974 -0.861035077041974
## 288 288 1.7964138420546 1.7964138420546
## 749 749 -0.861035077041974 -0.861035077041974
## 304 304 1.7964138420546 1.7964138420546
## 334 334 -0.861035077041974 -0.861035077041974
## 360 360 -0.861035077041974 -0.861035077041974
## ... (#rows: 655, #cols: 3)
##
##
## $test
## $test$`1`
## Prediction: 328 observations
## predict.type: response
## threshold:
## time: 0.01
## id truth response
## 400 400 -0.861035077041974 -0.861035077041974
## 145 145 0.467689382506315 -0.861035077041974
## 300 300 0.467689382506315 0.467689382506315
## 861 861 -0.861035077041974 -0.861035077041974
## 450 450 1.7964138420546 0.467689382506315
## 408 408 -0.861035077041974 -0.861035077041974
## ... (#rows: 328, #cols: 3)
#Below we calculate the mean misclassification error (mmce) on the training and the test data sets.
mmceTrainMean = setAggregation(mmce, train.mean)
rdesc = makeResampleDesc("CV", iters = 5, predict = "both")
r = resample("classif.randomForest", task, rdesc, measures = list(mmce, mmceTrainMean))#classif.rpart
## Resampling: cross-validation
## Measures: mmce.train mmce.test
## [Resample] iter 1: 0.0000000 0.2959184
## [Resample] iter 2: 0.0000000 0.3502538
## [Resample] iter 3: 0.0000000 0.3299492
## [Resample] iter 4: 0.0000000 0.2944162
## [Resample] iter 5: 0.0000000 0.2857143
##
## Aggregated Result: mmce.test.mean=0.3112504,mmce.train.mean=0.0000000
##
#r
M <- unlist(r$aggr)
#M[1]
acc_ranFor_resample <- round( (1-M[1])*100,2)
#acc_ranFor_resample <- round(( 1 - (performance(pred, measures = list(mmce))))*100,2)
#acc_lda_cv
#cat(paste0(" Model accuracy for mlr randomForest Resample is : ", acc_ranFor_resample ," %")) # 7 random #Forest resample
#############-------------- using spFSR
#head(Wine5) ## a before check use wine4
data <- Wine5
#head(data) ## an after check
## last column is the target variable Y
Y <- data %>% pull(quality)
## other columns make up the feature matrix
X <- data %>% select(-quality)
## set the MLR classification task
my.task <- makeClassifTask(data = cbind(X, Y), target = "Y")
# View(my.task) ## have a wee peek to see whats what
# str(my.task)
## set the performance measure
my.measure <- mmce ## mean misclassification error
## set the wrapper classification algorithm
my.wrapper <- makeLearner("classif.knn", k = 1)
## you can try other algorithms as well
### my.wrapper <- makeLearner("classif.rpart", minsplit = 5, cp = 0, xval = 0)
### my.wrapper <- makeLearner("classif.svm")
### my.wrapper <- makeLearner("classif.naiveBayes")
################################################ ############### Model 8
### compute performance with full set of features
my.rdesc <- makeResampleDesc("RepCV", folds = 3, reps = 3) ###folds = 5 <- changed the folds 3
repcv.full <- resample(my.wrapper,
my.task,
my.rdesc,
measures = my.measure)
## Resampling: repeated cross-validation
## Measures: mmce
## [Resample] iter 1: 0.3902439
## [Resample] iter 2: 0.4268293
## [Resample] iter 3: 0.3302752
## [Resample] iter 4: 0.3841463
## [Resample] iter 5: 0.4359756
## [Resample] iter 6: 0.3975535
## [Resample] iter 7: 0.4298780
## [Resample] iter 8: 0.3792049
## [Resample] iter 9: 0.3871951
##
## Aggregated Result: mmce.test.mean=0.3957002
##
result.full.mean <- mean(repcv.full$measures.test[[2]])
cat('Repeated CV error with full set of features =', 100 * round(result.full.mean, 3))
## Repeated CV error with full set of features = 39.6
acc_spFSR <- round( 100 - (round(result.full.mean, 3))*100)
#acc_rf_small
#cat(paste0(" Model accuracy for spFSR k nearest neighbour is : ", acc_spFSR ," %")) # 8 spFSR knn
## set the wrapper classification algorithm
my.wrapper <- makeLearner("classif.rpart") #, k = 1)
## you can try other algorithms as well
### my.wrapper <- makeLearner("classif.rpart", minsplit = 5, cp = 0, xval = 0)
### my.wrapper <- makeLearner("classif.svm")
### my.wrapper <- makeLearner("classif.naiveBayes")
################################################
### compute performance with full set of features
my.rdesc <- makeResampleDesc("RepCV", folds = 3, reps = 3) ###folds = 5 <- changed the folds 3
repcv.full <- resample(my.wrapper, ############### Model 9
my.task,
my.rdesc,
measures = my.measure)
## Resampling: repeated cross-validation
## Measures: mmce
## [Resample] iter 1: 0.3932927
## [Resample] iter 2: 0.4678899
## [Resample] iter 3: 0.4085366
## [Resample] iter 4: 0.4281346
## [Resample] iter 5: 0.4207317
## [Resample] iter 6: 0.4298780
## [Resample] iter 7: 0.4298780
## [Resample] iter 8: 0.4085366
## [Resample] iter 9: 0.4923547
##
## Aggregated Result: mmce.test.mean=0.4310259
##
result.full.mean <- mean(repcv.full$measures.test[[2]])
cat('Repeated CV error with full set of features =', 100 * round(result.full.mean, 3))
## Repeated CV error with full set of features = 43.1
acc_rf_small <- round(( 1 - (result.full.mean))*100,2)
# acc_rf_small <- round(( 1 - (performance(pred, measures = list(mmce))))*100,2)
#acc_rf_small
#cat(paste0(" Model accuracy for spFSR rpart is : ", acc_rf_small ," %")) # 9 spFSR rpart
#Wine %>% correlate() %>% focus(quality) ## <- used to compare to the original unaltered data set
Wine4 %>% correlate() %>% focus(quality)
## # A tibble: 11 x 2
## rowname quality
## <chr> <dbl>
## 1 fixed.acidity 0.146
## 2 volatile.acidity -0.359
## 3 citric.acid 0.246
## 4 residual.sugar 0.0445
## 5 chlorides -0.147
## 6 free.sulfur.dioxide 0.00573
## 7 total.sulfur.dioxide -0.184
## 8 density -0.212
## 9 pH -0.109
## 10 sulphates 0.447
## 11 alcohol 0.486
par(mfrow = c(1,1)) ## Reset columns to original setting
par(mar=c(5,4,4,2)) ## Reset margins to Original setting
## [1] " "
## [1] "volatile.acidity"
## [1] " "
## [1] " "
## [1] " "
## [1] " "
## [1] " "
## [1] " "
## [1] " "
## [1] "sulphates"
## [1] "alcohol"
## Visualize new data set
par(mar=c(5,10,4,2)+.1)
boxplot(Wine8, horizontal = TRUE, las = 1 , outline=TRUE , ## I have used this as a way to
col=rainbow(8),cex.axis = 0.8 )#, pos=2 ) ## to indicate any outliers
par(mfrow = c(1,1)) ## Reset columns to original setting
par(mar=c(5,4,4,2)) ## Reset margins to Original setting
#set.seed(1234)
Wine8 <- as.data.frame(Wine8 %>% ## omit missing values
na.omit() %>%
select_if(is.numeric))
#head(Wine8)
## Check for anomalies and missing values
sum(is.na(Wine8))
## [1] 0
#normalize the variables
Wine8 <- normalizeFeatures(Wine8,method = "standardize")
Wine8$quality <- as.factor(Wine8$quality)
str(Wine8)
## 'data.frame': 983 obs. of 4 variables:
## $ volatile.acidity: num 0.776 -1.116 0.654 -0.445 -0.201 ...
## $ sulphates : num 0.44 -1.361 1.577 -1.076 -0.413 ...
## $ alcohol : num -0.904 -1.008 -1.113 -0.487 -0.174 ...
## $ quality : Factor w/ 6 levels "-3.51848399613855",..: 4 3 3 4 4 3 5 3 3 2 ...
set.seed(1234)
n = nrow(Wine8)
train.set = sample(n, size = 0.8*n)
test.set = setdiff(1:n, train.set)
#n
## 3) Fit the model
## Train the learner on the task using a random subset of the data as training set
model = train(lrn, task, subset = train.set )#train.set
## 4) Make predictions
## Predict values of the response variable for new observations by the trained model
## using the other part of the data as test set
pred = predict(model, task = task, subset = test.set)
##--------------------------- try the algorithm randomForest
lrn = makeLearner("classif.knn") # classif.knn classif.randomForest
## 3) Fit the model
## Train the learner on the task using a random subset of the data as training set
model = train(lrn, task, subset = train.set )#train.set
## 4) Make predictions
## Predict values of the response variable for new observations by the trained model
## using the other part of the data as test set
pred = predict(model, task = task, subset = test.set) ############### Model 10
## 5) Evaluate the learner
## Calculate the mean misclassification error and accuracy
performance(pred, measures = list(mmce, acc))
## mmce acc
## 0.3857868 0.6142132
N <- unlist(performance(pred))
#N[1]
acc_knn_small <- round(( 1 - (N[1]))*100,2)
#acc_knn_small
#cat(paste0(" Model accuracy for k nearest neighbour 4 columns is : ", acc_knn_small ," %")) # 10 knn small
#############-------------- using spFSR
#head(Wine5) ## a before check use wine4
data <- Wine8
#head(data) ## an after check
## last column is the target variable Y
Y <- data %>% pull(quality)
## other columns make up the feature matrix
X <- data %>% select(-quality)
## set the MLR classification task
my.task <- makeClassifTask(data = cbind(X, Y), target = "Y")
# View(my.task) ## have a wee peek to see whats what
# str(my.task)
## set the performance measure
my.measure <- mmce ## mean misclassification error
## set the wrapper classification algorithm
## my.wrapper <- makeLearner("classif.knn", k = 1) # classif.randomForest classif.knn
## you can try other algorithms as well
my.wrapper <- makeLearner("classif.rpart", minsplit = 5, cp = 0, xval = 0)
### my.wrapper <- makeLearner("classif.svm")
### my.wrapper <- makeLearner("classif.naiveBayes")
################################################
### compute performance with full set of features
my.rdesc <- makeResampleDesc("RepCV", folds = 3, reps = 3) ###folds = 5 <- changed the folds 3
repcv.full <- resample(my.wrapper,
my.task, ############### Model 11
my.rdesc,
measures = my.measure)
## Resampling: repeated cross-validation
## Measures: mmce
## [Resample] iter 1: 0.4342508
## [Resample] iter 2: 0.4176829
## [Resample] iter 3: 0.4512195
## [Resample] iter 4: 0.4495413
## [Resample] iter 5: 0.4207317
## [Resample] iter 6: 0.4268293
## [Resample] iter 7: 0.4725610
## [Resample] iter 8: 0.4006116
## [Resample] iter 9: 0.4359756
##
## Aggregated Result: mmce.test.mean=0.4343782
##
result.full.mean <- mean(repcv.full$measures.test[[2]])
cat('Repeated CV error with full set of features =', 100 * round(result.full.mean, 3))
## Repeated CV error with full set of features = 43.4
acc_rp_smallspFSR <- (1 - (1 * round(result.full.mean, 3)) )* 100
#acc
#performance(pred, measures = list(my.measure, acc))
#cat("Current working dir: ", acc)
#cat(paste0(" Model accuracy spFSR rpart is : ", acc_rp_smallspFSR ," %")) # 11 rpart cv small
par(mfrow = c(1,1)) ## Reset columns to original setting
par(mar=c(5,4,4,2)) ## Reset margins to Original setting
#####----------cross validation for classif.randomForest--------everything below is for testing
## use 5-fold cross-validation type:
## 3-fold cross-validation Resample description: cross-validation with 3 iterations.
rdesc = makeResampleDesc("CV", iters = 5)
rdesc
## Resample description: cross-validation with 5 iterations.
## Predict: test
## Stratification: FALSE
## Calculate the performance
r = resample("classif.rpart",task , rdesc) #bh.task regr.lm classif.randomForest
## Resampling: cross-validation
## Measures: mmce
## [Resample] iter 1: 0.3877551
## [Resample] iter 2: 0.4263959
## [Resample] iter 3: 0.3826531
## [Resample] iter 4: 0.4416244
## [Resample] iter 5: 0.4517766
##
## Aggregated Result: mmce.test.mean=0.4180410
##
r
## Resample Result
## Task: Wine5
## Learner: classif.rpart
## Aggr perf: mmce.test.mean=0.4180410
## Runtime: 0.065068
#mse.test.mean
r$aggr
## mmce.test.mean
## 0.418041
#r$measures.test ## same as r above
#r$pred
# Resampled Prediction for:
# Resample description: subsampling with 5 iterations and 0.80 split rate.
# Predict: test
############### Model 12
pred = getRRPredictions(r)
#pred
list(pred)
## [[1]]
## Resampled Prediction for:
## Resample description: cross-validation with 5 iterations.
## Predict: test
## Stratification: FALSE
## predict.type: response
## threshold:
## time (mean): 0.00
## id truth response iter set
## 1 5 0.467689382506315 0.467689382506315 1 test
## 2 6 -0.861035077041974 -0.861035077041974 1 test
## 3 12 -0.861035077041974 0.467689382506315 1 test
## 4 24 0.467689382506315 0.467689382506315 1 test
## 5 29 0.467689382506315 -0.861035077041974 1 test
## 6 33 0.467689382506315 -0.861035077041974 1 test
## ... (#rows: 983, #cols: 5)
## 5) Evaluate the learner
## Calculate the mean misclassification error and accuracy
## performance(pred, measures = list(mmce, acc))
acc_rf_smallspFSRresample <- ( round( (1 - r$aggr), 3)) *100
#cat(paste0(" Model accuracy for spFSR randomForest, 4 columns, 5-fold cross-validation is : ",
# acc_rf_smallspFSRresample ," %"),sep="\n") # 12 random Forest cv small
## predicted
## true -3.51848399613855 -2.18975953659026
## -3.51848399613855 0 0
## -2.18975953659026 0 0
## -0.861035077041974 0 0
## 0.467689382506315 0 0
## 1.7964138420546 0 0
## 3.12513830160289 0 0
## -err.- 0 0
## predicted
## true -0.861035077041974 0.467689382506315 1.7964138420546
## -3.51848399613855 2 0 0
## -2.18975953659026 21 7 0
## -0.861035077041974 290 112 7
## 0.467689382506315 148 235 43
## 1.7964138420546 13 51 47
## 3.12513830160289 0 5 2
## -err.- 184 175 52
## predicted
## true 3.12513830160289 -err.-
## -3.51848399613855 0 2
## -2.18975953659026 0 28
## -0.861035077041974 0 119
## 0.467689382506315 0 191
## 1.7964138420546 0 64
## 3.12513830160289 0 7
## -err.- 0 411
par(mfrow = c(1,1)) ## Reset columns to original setting
par(mar=c(5,4,4,2)) ## Reset margins to Original setting
###############################################
##### I couldn't load caret library with the rest up top as this has a 'train' call similar to
##### mlr that is why its been loaded here as the caret 'train, over rides the mlr
##### 'train' call
#---------Try a different approach-----------with the algorithm random forest
#-- and switch back to wine 4 (the cleaned version of data set)
library(caret) # hyperparameter tuning
## Loading required package: lattice
##
## Attaching package: 'caret'
## The following object is masked from 'package:purrr':
##
## lift
## The following object is masked from 'package:mlr':
##
## train
set.seed(1234)
# train/test split P is proportion I'll need to increase for my data set
training_indexs <- createDataPartition(Wine4$quality, p = .8, list = F)
training <- Wine4[training_indexs, ]
testing <- Wine4[-training_indexs, ]
# get predictors
predictors <- training %>% select(-quality) %>% as.matrix()
output <- training$quality
#library(randomForest) # for our model
# train a random forest model
model <- randomForest(x = predictors, y = output,
ntree = 20) # number of trees
# as the number of trees are increased
# the % variance does decrease slightly
# check out the details
#model
par(mfrow = c(1,1)) ## Reset columns to original setting
par(mar=c(5,4,4,2)) ## Reset margins to Original setting
plot(model, col='red')
print(model) ## summerize the model
##
## Call:
## randomForest(x = predictors, y = output, ntree = 20)
## Type of random forest: regression
## Number of trees: 20
## No. of variables tried at each split: 3
##
## Mean of squared residuals: 0.599516
## % Var explained: 40.97
# check out our model's root mean squared error on the held out test data
rmse(predict(model, testing), testing$quality) #getting around 76.37035 % here usually
## [1] 0.7009149
acc_caret <- ( 1 - (rmse(predict(model, testing), testing$quality))) ############### Model 13
acc_caret <- round(( 1 - acc_caret)*100, 2)
#cat(paste0(" Model accuracy for caret randomForest is : ", acc_caret ," %")) # 13 caret
##----------Cross Validation by adding a control to the model
#set up the control parameter to train with the 10-fold cross validation in 3 repetitions:
control = trainControl(method="repeatedcv", number=10, repeats=3)
#Then, you can train the classification model on telecom churn data with rpart:
model <- randomForest(x = predictors, y = output,
ntree = 20, trControl=control) # number of trees
############### Model 14
#Check model for any improvements
rmse(predict(model, testing), testing$quality)
## [1] 0.7068028
acc_caret_cv <- ( 1 - (rmse(predict(model, testing), testing$quality)))
acc_caret_cv <- round(( 1 - acc_caret_cv)*100, 2)
#cat(paste0(" Model accuracy for caret randomForest 10-fold cross validation is : ", acc_caret_cv ," %")) # 14 # caret
cat(paste0("Model 1. Model accuracy for mlr classif.lda is : ", acc_lda ," %"),sep="\n") # 1 lda
## Model 1. Model accuracy for mlr classif.lda is : 58.88 %
cat(paste0("Model 2. Model accuracy for mlr classif.lda 5-fold cross-validation is : ", acc_lda_cv ," %"),sep="\n") # 2 # lda cv
## Model 2. Model accuracy for mlr classif.lda 5-fold cross-validation is : 59.61 %
cat(paste0("Model 3. Model accuracy for mlr naive bayes is : ", acc_naive ," %"),sep="\n") # 3 naive bayes
## Model 3. Model accuracy for mlr naive bayes is : 56.85 %
cat(paste0("Model 4. Model accuracy for mlr k nearest neighbour is : ", acc_knn ," %"),sep="\n") # 4 knn
## Model 4. Model accuracy for mlr k nearest neighbour is : 61.42 %
cat(paste0("Model 5. Model accuracy for mlr randomForest is : ", acc_ranFor ," %"),sep="\n") # 5 random Forest
## Model 5. Model accuracy for mlr randomForest is : 69.04 %
cat(paste0("Model 6. Model accuracy for mlr randomForest 5-fold cross-validation is : ", acc_ranFor_cv ," %"),sep="\n") # #6 rf cv
## Model 6. Model accuracy for mlr randomForest 5-fold cross-validation is : 69.58 %
cat(paste0("Model 7. Model accuracy for mlr randomForest Resample is : ", acc_ranFor_resample ," %"),sep="\n") # 7 rf #resample
## Model 7. Model accuracy for mlr randomForest Resample is : 68.87 %
cat(paste0("Model 8. Model accuracy for spFSR k nearest neighbour is : ", acc_spFSR ," %"),sep="\n") # 8 spFSR knn
## Model 8. Model accuracy for spFSR k nearest neighbour is : 60 %
cat(paste0("Model 9. Model accuracy for spFSR rpart is : ", acc_rf_small ," %"),sep="\n") # 9 spFSR rpart
## Model 9. Model accuracy for spFSR rpart is : 56.9 %
cat(paste0("Model 10. Model accuracy for mlr k nearest neighbour 4 columns is : ", acc_knn_small ," %"),sep="\n") # 10 #knn small
## Model 10. Model accuracy for mlr k nearest neighbour 4 columns is : 61.42 %
cat(paste0("Model 11. Model accuracy for spFSR rpart 4 columns is : ", acc_rp_smallspFSR ," %"),sep="\n") # 11 rpart #cv small
## Model 11. Model accuracy for spFSR rpart 4 columns is : 56.6 %
cat(paste0("Model 12. Model accuracy for spFSR randomForest, 4 columns, 5-fold cross-validation is : ",
acc_rf_smallspFSRresample ," %"),sep="\n") # 12 random Forest cv small
## Model 12. Model accuracy for spFSR randomForest, 4 columns, 5-fold cross-validation is : 58.2 %
cat(paste0("Model 13. Model accuracy for caret randomForest is : ", acc_caret ," %"),sep="\n") # 13 caret
## Model 13. Model accuracy for caret randomForest is : 70.09 %
cat(paste0("Model 14. Model accuracy for caret randomForest 10-fold cross validation is : ",
acc_caret_cv ," %")) # 14 # caret
## Model 14. Model accuracy for caret randomForest 10-fold cross validation is : 70.68 %
##################################################################
cat(paste0(),sep="\n")
cat(paste0(),sep="\n")
cat(paste0(),sep="\n")
cat(paste0(" The section below is the above section, grouped together for easier comparison of Like vs Like "),sep="\n")
## The section below is the above section, grouped together for easier comparison of Like vs Like
cat(paste0(),sep="\n")
cat(paste0(),sep="\n")
###################################################################
cat(paste0(" mlr lda (linear discriminant analysis)"),sep="\n")
## mlr lda (linear discriminant analysis)
cat(paste0(),sep="\n")
cat(paste0(" Model 1. Model accuracy for mlr classif.lda is : ", acc_lda ," %"),sep="\n") # 1 lda
## Model 1. Model accuracy for mlr classif.lda is : 58.88 %
cat(paste0(" Model 2. Model accuracy for mlr classif.lda 5-fold cross-validation is : ", acc_lda_cv ," %"),sep="\n") # 2 # lda cv
## Model 2. Model accuracy for mlr classif.lda 5-fold cross-validation is : 59.61 %
cat(paste0(),sep="\n")
cat(paste0(" mlr naive Bayes "),sep="\n")
## mlr naive Bayes
cat(paste0(),sep="\n")
cat(paste0(" Model 3. Model accuracy for mlr naive bayes is : ", acc_naive ," %"),sep="\n") # 3 naive bayes
## Model 3. Model accuracy for mlr naive bayes is : 56.85 %
cat(paste0(),sep="\n")
cat(paste0(" mlr k nearest neighbour (knn) also includes reduced columns mlr knn and spFSR knn"),sep="\n")
## mlr k nearest neighbour (knn) also includes reduced columns mlr knn and spFSR knn
cat(paste0(),sep="\n")
cat(paste0(" Model 4. Model accuracy for mlr k nearest neighbour is : ", acc_knn ," %"),sep="\n") # 4 knn
## Model 4. Model accuracy for mlr k nearest neighbour is : 61.42 %
cat(paste0(" Model 10. Model accuracy for mlr k nearest neighbour 4 columns is : ", acc_knn_small ," %"),sep="\n") # 10 #knn small
## Model 10. Model accuracy for mlr k nearest neighbour 4 columns is : 61.42 %
cat(paste0(" Model 8. Model accuracy for spFSR k nearest neighbour is : ", acc_spFSR ," %"),sep="\n") # 8 spFSR knn
## Model 8. Model accuracy for spFSR k nearest neighbour is : 60 %
cat(paste0(),sep="\n")
cat(paste0(" mlr randomForest (rf) also includes caret (rf) "),sep="\n")
## mlr randomForest (rf) also includes caret (rf)
cat(paste0(),sep="\n")
cat(paste0(" Model 5. Model accuracy for mlr randomForest is : ", acc_ranFor ," %"),sep="\n") # 5 random Forest
## Model 5. Model accuracy for mlr randomForest is : 69.04 %
cat(paste0(" Model 6. Model accuracy for mlr randomForest 5-fold cross-validation is : ", acc_ranFor_cv ," %"),sep="\n") # #6 rf cv
## Model 6. Model accuracy for mlr randomForest 5-fold cross-validation is : 69.58 %
cat(paste0(" Model 7. Model accuracy for mlr randomForest Resample is : ", acc_ranFor_resample ," %"),sep="\n") # 7 rf #resample
## Model 7. Model accuracy for mlr randomForest Resample is : 68.87 %
cat(paste0(" Model 13. Model accuracy for caret randomForest is : ", acc_caret ," %"),sep="\n") # 13 caret
## Model 13. Model accuracy for caret randomForest is : 70.09 %
cat(paste0(" Model 14. Model accuracy for caret randomForest 10-fold cross validation is : ",
acc_caret_cv ," %"),sep="\n") # 14 # caret
## Model 14. Model accuracy for caret randomForest 10-fold cross validation is : 70.68 %
cat(paste0(),sep="\n")
cat(paste0(" spFSR rpart and reduced columns spFSR rpart "),sep="\n")
## spFSR rpart and reduced columns spFSR rpart
cat(paste0(),sep="\n")
cat(paste0(" Model 9. Model accuracy for spFSR rpart is : ", acc_rf_small ," %"),sep="\n") # 9 spFSR rpart
## Model 9. Model accuracy for spFSR rpart is : 56.9 %
cat(paste0(" Model 11. Model accuracy for spFSR rpart 4 columns is : ", acc_rp_smallspFSR ," %"),sep="\n") # 11 rpart #cv small
## Model 11. Model accuracy for spFSR rpart 4 columns is : 56.6 %
cat(paste0(" Model 12. Model accuracy for spFSR rpart, 4 columns, 5-fold cross-validation is : ",
acc_rf_smallspFSRresample ," %"),sep="\n") # 12 random Forest cv small
## Model 12. Model accuracy for spFSR rpart, 4 columns, 5-fold cross-validation is : 58.2 %
The best performing model for my particular data set seemed to be caret followed closely by mlr. Numerous trailing was undertaken and it should be noted that at each rerun a random subset would be chosen which would then be tested for accuracy in each ensemble.
There were cases of occasionally over fitting, but not so many cases on under fitting the model. It should be noted that I couldn’t find a way to assess whether caret was using the same subset as all the others, simply because at the partitioning section, while I could see the structure of the caret subset I couldn’t see the structure of the mlr or spFSR subset.
At all stages of outlier removal for my data set there was a distinct lack of any form of pattern emergence, I can only assume from this that my data set was seriously noisy. Algorithm performances are stated below :
Lowest accuracy -----> naive bayes mlr (from less than or equal to 55% acc)
-----> lda mlr (from 55% - 61% acc)
-----> rpart spFSR (from 58% - 62% acc)
-----> knn spFSR (from 60% - 66% acc)
-----> knn mlr (from 60% - 66% acc)
-----> random Forest mlr (from 68% - 71% acc)
Highest accuracy -----> random Forest caret (from 75% - 82% acc)
(of note: In my last run, mlr actually scored highest, generally though caret scored highest)
NOTE: This was only for a small test of no more than 40 runs (at least 40 new subsets for testing) And should not be considered absolute, this is only what appeared to happen for my data set. Of interest, spFSR and mlr were scoring about the same for accuracy for knn.
It would have been nice had I have been able to get spFSR working correctly to then trail it against mlr and caret using the same algorithms. personally I expected the algorithms to all be similar in their results for similar algorithms, so why caret out performed I’m not sure. I would have liked to have trialed a multiple regression algorithm and also have trialed a logarithmic algorithm, but my report was beginning to get big, at almost double the size of the expected reports I therefore decided to stop.
From all of the above, my goto choices would be caret then either mlr or spFSR.
So, what was all this about ? Could I use any of these model’s as a way of predicting a semi decent wine ? I believe I could, as most models fell within a 60% plus accuracy rating which for me mean’s for each three bottles I chose hopefully at least 2 would be good or better if they contained the required physio-chemical’s. From a wine producers standpoint, if I knew the coefficients for each of the physio-chemical’s from the best model and I were able to alter one of more of the physio-chemical’s, I could produce better wines more often, and thus increase my profit margin.
ERRORS (The only error I could not overcome was in the spFSR feature selection part, an example is given below)
iter value st.dev num.ft best.value
spFSR could not get past this error, and as I could not alter any spFSR parameters because of this, the section was dropped from my assessment. I was therefore unable to use the Feature Selection part of spFSR.