This data set is Obtained from and is also available from the UCI machine learning repository,

wine+quality

Of the two data sets, of which one is related to red and and the other to white vinho
verde wine samples. I have chosen to analyse the " Red variety “” which is a variant
of the Portuguese “Vinho Verde” wine.

As per site Citation Request:

Please include this citation if you plan to use this database:

P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis. Modeling wine preferences by data mining from physicochemical properties. In Decision Support Systems, Elsevier, 47(4):547-553, 2009.

######################## Phase 2

For phase 2 , Machine Learning will be used to determine a model that could

be used to predict the ideal chemical properties that determine

a good wine, based on the data provided from the Red Wine.csv

Definition of physiochemical: (of or pertaining to both physical and chemical properties, changes, and reactions. of or according to physical chemistry.)

Wine Quality in this data set is classified as being, if

equal to “6” = Good , equal to “7” = Good Plus ,

if equal to “8” = Very Good.

The highest value in this data set is “8”

For more information, read [Cortez et al., 2009]. Input variables (based on physicochemical tests):

The Red Wine.csv contains columns in the order as listed below

Content: (Column Names)

1 - fixed acidity 2 - volatile acidity 3 - citric acid 4 - residual sugar 5 - chlorides 6 - free sulfur dioxide 7 - total sulfur dioxide 8 - density 9 - pH 10 - sulphates 11 - alcohol 12 - quality (score between 0 and 10)

12 - quality <== this will be the target variable

  Objective: Use machine learning to determine which physiochemical 
           properties make a wine 'good'
           
           
For a complete look at each ensembles accuracy, please see the conclusion 
at the end of this report.

Overview

Almost 50% of this report is dedicated to data cleaning, normalization

and structure assessment. Once I became satisfied that the data was cleaned sufficiently

I then shuffled to randomize the selection before applying the following algorithms.

The algorithms trialled are :

1. mlr lda (linear discriminant analysis),

2. mlr Naive Bayes

3. mlr k nearest neighbour and spFSR k nearest neighbour

4. mlr Random Forest , spFSR random Forest and caret random Forest

5. spFSR rpart

I also as part of this examination trialled reducing features, by reducing the

number of columns used, as I wished to see if this change in approach

would increase my model accuracy. To acheive this I had written a small program

to select only those columns that have a correlation of less than -0.3 or greater than 0.3

14 Models were produced from the above which includes resampling and cross validation.

There are three sections to this study

Section 1. Data Preperation

Section 2. Algorithm Application

Section 3. Summary

# rfNews()
# libraries 

library(spFSR)

## Loading required package: mlr

## Loading required package: ParamHelpers

## Loading required package: parallelMap

## Loading required package: parallel

## Loading required package: tictoc

library(randomForest) # for our model

## randomForest 4.6-14

## Type rfNews() to see new features/changes/bug fixes.

library(tidyverse)    # general utility functions

## -- Attaching packages ---------------------------------- tidyverse 1.2.1 --

## v ggplot2 2.2.1     v purrr   0.2.4
## v tibble  1.4.2     v dplyr   0.7.5
## v tidyr   0.8.1     v stringr 1.3.1
## v readr   1.1.1     v forcats 0.3.0

## -- Conflicts ------------------------------------- tidyverse_conflicts() --
## x dplyr::combine()  masks randomForest::combine()
## x dplyr::filter()   masks stats::filter()
## x dplyr::lag()      masks stats::lag()
## x ggplot2::margin() masks randomForest::margin()

library(Metrics)      # handy evaluation functions
library(rpart)
library(readr)
library(dplyr)
library(mlr)
library(corrr)

knitr::opts_chunk$set(echo = TRUE)

Wine <- read.csv("Wine.csv", colClasses = "numeric")
summary(Wine)

##  fixed.acidity   volatile.acidity  citric.acid    residual.sugar  
##  Min.   : 4.60   Min.   :0.1200   Min.   :0.000   Min.   : 0.900  
##  1st Qu.: 7.10   1st Qu.:0.3900   1st Qu.:0.090   1st Qu.: 1.900  
##  Median : 7.90   Median :0.5200   Median :0.260   Median : 2.200  
##  Mean   : 8.32   Mean   :0.5278   Mean   :0.271   Mean   : 2.539  
##  3rd Qu.: 9.20   3rd Qu.:0.6400   3rd Qu.:0.420   3rd Qu.: 2.600  
##  Max.   :15.90   Max.   :1.5800   Max.   :1.000   Max.   :15.500  
##    chlorides       free.sulfur.dioxide total.sulfur.dioxide
##  Min.   :0.01200   Min.   : 1.00       Min.   :  6.00      
##  1st Qu.:0.07000   1st Qu.: 7.00       1st Qu.: 22.00      
##  Median :0.07900   Median :14.00       Median : 38.00      
##  Mean   :0.08747   Mean   :15.87       Mean   : 46.47      
##  3rd Qu.:0.09000   3rd Qu.:21.00       3rd Qu.: 62.00      
##  Max.   :0.61100   Max.   :72.00       Max.   :289.00      
##     density             pH          sulphates         alcohol     
##  Min.   :0.9901   Min.   :2.740   Min.   :0.3300   Min.   : 8.40  
##  1st Qu.:0.9956   1st Qu.:3.210   1st Qu.:0.5500   1st Qu.: 9.50  
##  Median :0.9968   Median :3.310   Median :0.6200   Median :10.20  
##  Mean   :0.9967   Mean   :3.311   Mean   :0.6581   Mean   :10.42  
##  3rd Qu.:0.9978   3rd Qu.:3.400   3rd Qu.:0.7300   3rd Qu.:11.10  
##  Max.   :1.0037   Max.   :4.010   Max.   :2.0000   Max.   :14.90  
##     quality     
##  Min.   :3.000  
##  1st Qu.:5.000  
##  Median :6.000  
##  Mean   :5.636  
##  3rd Qu.:6.000  
##  Max.   :8.000

View this data set

par(mar=c(5,10,4,2)+.1)      
boxplot(Wine, horizontal = TRUE, las = 1 , outline=TRUE ,  ##  I have used this as a way to 
        col=heat.colors(12),cex.axis = 0.8 )#, pos=2 )     ##  to indicate any outliers

par(mfrow = c(1,1))   ##  Reset columns to original setting
par(mar=c(5,4,4,2))   ##  Reset margins to Original setting

Section 1. Data Preperation

The following program will be used to remove outliers

## Calculate the boundaries for each column 
## then apply using the 1.5 * IQR rule, remove outliers

remove_outliers <- function(x, na.rm = TRUE, ...) 
{
  qnt <- quantile(x, probs=c(.25, .75), na.rm = na.rm, ...)
  H <- 1.5 * IQR(x, na.rm = na.rm)
  y <- x
  y[x < (qnt[1] - H)] <- NA
  y[x > (qnt[2] + H)] <- NA
  y
}

## use the above formula to remove outliers from each column

fixed.acidity <-  remove_outliers(Wine$fixed.acidity)
volatile.acidity <-  remove_outliers(Wine$volatile.acidity)
citric.acid <-  remove_outliers(Wine$citric.acid)
residual.sugar <-  remove_outliers(Wine$residual.sugar)

chlorides <-  remove_outliers(Wine$chlorides)
free.sulfur.dioxide <-  remove_outliers(Wine$free.sulfur.dioxide)
total.sulfur.dioxide <-  remove_outliers(Wine$total.sulfur.dioxide)
density <-  remove_outliers(Wine$density)

pH <-  remove_outliers(Wine$pH)
sulphates <-  remove_outliers(Wine$sulphates)
alcohol <-  remove_outliers(Wine$alcohol)
quality <-  (Wine$quality)   ##  remove_outliers

##  Combine all columns back into a Data Frame
Wine1 <- cbind.data.frame(fixed.acidity,volatile.acidity,citric.acid,
                          residual.sugar,chlorides,free.sulfur.dioxide,
                          total.sulfur.dioxide,density,pH,
                          sulphates,alcohol,quality)

##   Check for anomalies and missing values 
sum(is.na(Wine1))

## [1] 573

#sum(is.na(Wine1r))

Remove NA values and sum check

##      Remove rows with missing values on columns specified
##    And select column if it is numeric
Wine1a <- as.data.frame(Wine1 %>%
                         na.omit() %>% 
                         select_if(is.numeric))

##   Check for anomalies and missing values 
sum(is.na(Wine1a))

## [1] 0

#str(Wine1a)

View the new data set

##   Visualize new data set
par(mar=c(5,10,4,2)+.1) 
boxplot(Wine1a, horizontal = TRUE, las = 1 , outline=TRUE ,  ##  I have used this as a way to 
        col=heat.colors(12),cex.axis = 0.8 )#, pos=2 )     ##  to indicate any outliers

par(mfrow = c(1,1))   ##  Reset columns to original setting
par(mar=c(5,4,4,2))   ##  Reset margins to Original setting

Apply outlier removal program again (2nd time)

## use the above formula remove outliers from each column ***(for a 2nd time)***

fixed.acidity <-  remove_outliers(Wine1a$fixed.acidity)
volatile.acidity <-  remove_outliers(Wine1a$volatile.acidity)
citric.acid <-  remove_outliers(Wine1a$citric.acid)
residual.sugar <-  remove_outliers(Wine1a$residual.sugar)

chlorides <-  remove_outliers(Wine1a$chlorides)
free.sulfur.dioxide <-  remove_outliers(Wine1a$free.sulfur.dioxide)
total.sulfur.dioxide <-  remove_outliers(Wine1a$total.sulfur.dioxide)
density <-  remove_outliers(Wine1a$density)

pH <-  remove_outliers(Wine1a$pH)
sulphates <-  remove_outliers(Wine1a$sulphates)
alcohol <-  remove_outliers(Wine1a$alcohol)
quality <-  (Wine1a$quality)   ##  remove_outliers
#quality <-  ifelse(Wine$quality <= 5 , 0 , 
#                   ifelse(Wine$quality > 5, 1,0))  ## Change quality to binary

##  Combine all columns back into a Data Frame
Wine1b <- cbind.data.frame(fixed.acidity,volatile.acidity,citric.acid,
                          residual.sugar,chlorides,free.sulfur.dioxide,
                          total.sulfur.dioxide,density,pH,
                          sulphates,alcohol,quality)

##   Check for anomalies and missing values 
sum(is.na(Wine1b))

## [1] 141

Remove NA values and sum check

##      Remove rows with missing values on columns specified
##    And select column if it is numeric
Wine1c <- as.data.frame(Wine1b %>%
                         na.omit() %>% 
                         select_if(is.numeric))


##   Check for anomalies and missing values 
sum(is.na(Wine1c))

## [1] 0

#str(Wine1c)

View the new data set

##   Visualize new data set
par(mar=c(5,10,4,2)+.1) 
boxplot(Wine1c, horizontal = TRUE, las = 1 , outline=TRUE ,  ##  I have used this as a way to 
        col=heat.colors(12),cex.axis = 0.8 )#, pos=2 )     ##  to indicate any outliers

par(mfrow = c(1,1))   ##  Reset columns to original setting
par(mar=c(5,4,4,2))   ##  Reset margins to Original setting

Normalize this newest data set

#normalize the variables
Wine1d <- normalizeFeatures(Wine1c,method = "standardize")
#head(Wine1d)

View the new data set and check for outliers

##   Visualize new data set
par(mar=c(5,10,4,2)+.1) 
boxplot(Wine1d, horizontal = TRUE, las = 1 , outline=TRUE ,  ##  I have used this as a way to 
        col=heat.colors(12),cex.axis = 0.8 )#, pos=2 )     ##  to indicate any outliers

par(mfrow = c(1,1))   ##  Reset columns to original setting
par(mar=c(5,4,4,2))   ##  Reset margins to Original setting

Apply my outlier removal program again (3rd time), I’m not using the above normalized data set

## use the above formula remove outliers from each column ***(for a 3rd time)***

fixed.acidity <-  remove_outliers(Wine1c$fixed.acidity)
volatile.acidity <-  remove_outliers(Wine1c$volatile.acidity)
citric.acid <-  remove_outliers(Wine1c$citric.acid)
residual.sugar <-  remove_outliers(Wine1c$residual.sugar)

chlorides <-  remove_outliers(Wine1c$chlorides)
free.sulfur.dioxide <-  remove_outliers(Wine1c$free.sulfur.dioxide)
total.sulfur.dioxide <-  remove_outliers(Wine1c$total.sulfur.dioxide)
density <-  remove_outliers(Wine1c$density)

pH <-  remove_outliers(Wine1c$pH)
sulphates <-  remove_outliers(Wine1c$sulphates)
alcohol <-  remove_outliers(Wine1c$alcohol)
quality <-  (Wine1c$quality)   ##  remove_outliers

##  Combine all columns back into a Data Frame
Wine1e <- cbind.data.frame(fixed.acidity,volatile.acidity,citric.acid,
                          residual.sugar,chlorides,free.sulfur.dioxide,
                          total.sulfur.dioxide,density,pH,
                          sulphates,alcohol,quality)

##   Check for anomalies and missing values 
sum(is.na(Wine1e))

## [1] 81

#sum(is.na(Wine1r))

Remove NA values and sum check

##      Remove rows with missing values on columns specified
##    And select column if it is numeric
Wine1f <- as.data.frame(Wine1e %>%
                         na.omit() %>% 
                         select_if(is.numeric))

##   Check for anomalies and missing values 
sum(is.na(Wine1f))

## [1] 0

#str(Wine1f)

View the new data set

##   Visualize new data set
par(mar=c(5,10,4,2)+.1) 
boxplot(Wine1f, horizontal = TRUE, las = 1 , outline=TRUE ,  ##  I have used this as a way to 
        col=heat.colors(12),cex.axis = 0.8 )#, pos=2 )       ##  to indicate any outliers

par(mfrow = c(1,1))   ##  Reset columns to original setting
par(mar=c(5,4,4,2))   ##  Reset margins to Original setting

Normalize this newest data set

#normalize the variables
Wine2 <- normalizeFeatures(Wine1f,method = "standardize")
#head(Wine2)

View the new data set, and look for outliers

##   Visualize new data set
par(mar=c(5,10,4,2)+.1) 
boxplot(Wine2, horizontal = TRUE, las = 1 , outline=TRUE ,  ##  I have used this as a way to 
        col=rainbow(12),cex.axis = 0.8 )#, pos=2 )          ##  to indicate any outliers

par(mfrow = c(1,1))   ##  Reset columns to original setting
par(mar=c(5,4,4,2))   ##  Reset margins to Original setting

This part shows the columns selected that continually have outliers

par(mfrow = c(3,1))
##Clean up specific columns in Wine2
J <- boxplot(Wine2$total.sulfur.dioxide,las = 1,horizontal = TRUE, col = 'purple',xlab = "total.sulfur.dioxide",cex.lab=1.5)
K <- boxplot(Wine2$free.sulfur.dioxide,las = 1,horizontal = TRUE, col = 'purple',xlab = "free.sulfur.dioxide",cex.lab=1.5)
L <- boxplot(Wine2$fixed.acidity,las = 1,horizontal = TRUE, col = 'purple',xlab = "fixed.acidity",cex.lab=1.5)

par(mfrow = c(1,1))

Find the lower and upper limits for the three selected columns

par(mfrow = c(1,3))
mytable <- J$stats
mytable1 <- K$stats
mytable2 <- L$stats

colnames(mytable)<-J$names
colnames(mytable1)<-K$names
colnames(mytable2)<-L$names
rownames(mytable)<-c('min','lower quartile','median','upper quartile','max')
rownames(mytable1)<-c('min','lower quartile','median','upper quartile','max')
rownames(mytable2)<-c('min','lower quartile','median','upper quartile','max')

mytable

##                          
## min            -1.4738585
## lower quartile -0.7698727
## median         -0.1978841
## upper quartile  0.5501009
## max             2.4860621

mytable1

##                          
## min            -1.6553630
## lower quartile -0.8063988
## median         -0.1999958
## upper quartile  0.5276878
## max             2.4681775

mytable2

##                          
## min            -2.1642912
## lower quartile -0.7034142
## median         -0.2420846
## upper quartile  0.6036863
## max             2.5258929

par(mfrow = c(1,1))

Then remove the outliers

# Remove upper values beyond max for this variable in Wine data set
Wine2 <- capLargeValues(Wine2,cols = c("total.sulfur.dioxide"),threshold = 2.5)
Wine2 <- capLargeValues(Wine2,cols = c("free.sulfur.dioxide"),threshold = 2.5)
Wine2 <- capLargeValues(Wine2,cols = c("fixed.acidity"),threshold = 2.55)
#   M <- boxplot(Wine2$sulphates,las = 1,horizontal = TRUE, col = 'purple') # check boxplot has changed

View the new columns, and look for outliers

par(mfrow = c(3,1))
##Clean up specific columns in Wine2
J <- boxplot(Wine2$total.sulfur.dioxide,las = 1,horizontal = TRUE, col = 'purple',xlab = "total.sulfur.dioxide",cex.lab=1.5)
K <- boxplot(Wine2$free.sulfur.dioxide,las = 1,horizontal = TRUE, col = 'purple',xlab = "free.sulfur.dioxide",cex.lab=1.5)
L <- boxplot(Wine2$fixed.acidity,las = 1,horizontal = TRUE, col = 'purple',xlab = "fixed.acidity",cex.lab=1.5)

par(mfrow = c(1,1))

Check the result visually with all columns added back together, there are still outliers

but I’m ok with this data set at this point

##   Visualize new data set
par(mar=c(5,10,4,2)+.1) 
boxplot(Wine2, horizontal = TRUE, las = 1 , outline=TRUE ,  ##  I have used this as a way to 
        col=rainbow(12),cex.axis = 0.8 )#, pos=2 )      ##  to indicate any outliers

par(mfrow = c(1,1))   ##  Reset columns to original setting
par(mar=c(5,4,4,2))   ##  Reset margins to Original setting heat.colors(12)

Section 2. Algorithm Application

Shuffle up the cleaned data set and check row numbers are randomly chosen

# Shuffle up the newest Data Frame 
Wine4 <- Wine2[ sample(nrow(Wine2)),]
head(Wine4) ## Check the rows are shuffled

##     fixed.acidity volatile.acidity citric.acid residual.sugar   chlorides
## 408    -0.7803025        0.7758618  -1.2469936    -0.06959115 -0.88772003
## 626    -0.7034142       -1.1163220   0.2606776    -0.33681031  0.37250749
## 72     -1.3954086        0.6537854   0.4346397    -1.13846779  0.84509281
## 901    -0.6265259       -0.4449019  -0.4931580    -0.33681031  0.05745061
## 95     -0.8571907       -0.2007492  -0.7830947     0.46484716 -0.25760627
## 18     -1.3185203       -0.8111311  -0.4351706    -1.94012526  0.21497905
##     free.sulfur.dioxide total.sulfur.dioxide    density          pH
## 408         -0.80639880           -0.6378753  0.5270481  1.10081088
## 626          1.98305506            1.2540868  0.3496416  0.93905035
## 72           0.04256542            1.0780903  0.3223483  1.01993062
## 901         -0.56383759           -0.5058780 -0.2303410 -0.35503393
## 95          -0.32127639           -0.4178797 -0.3599842  0.69640955
## 18          -0.44255699           -0.7258735 -0.6329172  0.04936741
##      sulphates    alcohol    quality
## 408  0.4400627 -0.9040726  0.4676894
## 626 -1.3607812 -1.0083703 -0.8610351
## 72   1.5774379 -1.1126680 -0.8610351
## 901 -1.0764374 -0.4868818  0.4676894
## 95  -0.4129686 -0.1739887  0.4676894
## 18  -0.6025311 -1.1126680 -0.8610351

#str(Wine4)  ## Check the structure hasn't changed

# A visual check that the row numbers are all randomly selected,
#  Row Numbers are the first column below
par(mfrow = c(1,1))   ##  Reset columns to original setting
par(mar=c(5,4,4,2))   ##  Reset margins to Original setting

Preparation of shuffled data set for use in mlr , spFSR

Wine5 <- as.data.frame(Wine4,row.names = NULL, optional = FALSE)
#str(Wine5)
#head(Wine5) ## Check the rows are shuffled 

Wine5$quality <-  as.factor(Wine5$quality)
#str(Wine5)  #check that this has been changed to factor

## 1) Define the task
## Specify the type of analysis (e.g. classification) and provide data and response variable
task = makeClassifTask(data = Wine5, target = "quality")

############# Model 1 lda : linear discriminant analysis mlr

## 2) Define the learner
## Choose a specific algorithm (e.g. lda :  linear discriminant analysis)
#K Nearest Neighbour  
#classif.featureless  
#classif.knn
#classif.lda  
#classif.ada
#classif.AdaBag
#classif.randomForest
#classif.rda 
#classif.ada 
#classif.bst
lrn = makeLearner("classif.lda")                                  ###############  Model 1

set.seed(1234)

n = nrow(Wine5)
train.set = sample(n, size = 0.8*n)
test.set = setdiff(1:n, train.set)
#n


## 3) Fit the model
## Train the learner on the task using a random subset of the data as training set
model = train(lrn, task, subset = train.set )#train.set

## 4) Make predictions
## Predict values of the response variable for new observations by the trained model
## using the other part of the data as test set
pred = predict(model, task = task, subset = test.set)


## 5) Evaluate the learner
## Calculate the mean misclassification error and accuracy
performance(pred, measures = list(mmce, acc))

##      mmce       acc 
## 0.4111675 0.5888325

acc_lda  <-  round(( 1 - (performance(pred, measures = list(mmce))))*100,2)

#cat(paste0("   Model accuracy for mlr classif.lda is : ", acc_lda ," %"))  #  1  lda

############# Model 2 lda : linear discriminant analysis cross validation mlr

#####----------cross validation for classif.lda--------everything below is for testing

##   use 5-fold cross-validation type:

## 5-fold cross-validation  Resample description: cross-validation with 3 iterations.
rdesc = makeResampleDesc("CV", iters = 5)
rdesc

## Resample description: cross-validation with 5 iterations.
## Predict: test
## Stratification: FALSE

## Calculate the performance
r = resample("classif.lda",task , rdesc)  #bh.task    regr.lm

## Resampling: cross-validation

## Measures:             mmce

## [Resample] iter 1:    0.3826531

## [Resample] iter 2:    0.3756345

## [Resample] iter 3:    0.4234694

## [Resample] iter 4:    0.4517766

## [Resample] iter 5:    0.3857868

##

## Aggregated Result: mmce.test.mean=0.4038641

##

## Resample Result
## Task: Wine5
## Learner: classif.lda
## Aggr perf: mmce.test.mean=0.4038641
## Runtime: 0.0480461

#mse.test.mean
r$aggr

## mmce.test.mean 
##      0.4038641

                                                              ###############  Model 2

pred = getRRPredictions(r)
#pred
list(pred)

## [[1]]
## Resampled Prediction for:
## Resample description: cross-validation with 5 iterations.
## Predict: test
## Stratification: FALSE
## predict.type: response
## threshold: 
## time (mean): 0.00
##   id              truth           response iter  set
## 1  8 -0.861035077041974 -0.861035077041974    1 test
## 2 17 -0.861035077041974 -0.861035077041974    1 test
## 3 34  0.467689382506315  0.467689382506315    1 test
## 4 38 -0.861035077041974 -0.861035077041974    1 test
## 5 51 -0.861035077041974 -0.861035077041974    1 test
## 6 54 -0.861035077041974  0.467689382506315    1 test
## ... (#rows: 983, #cols: 5)

#performance(pred, measures = list(mmce))
acc_lda_cv  <-  round(( 1 - (performance(pred, measures = list(mmce))))*100,2)
#acc_lda_cv
#cat(paste0("   Model accuracy for mlr classif.lda 5-fold cross-validation is : ", acc_lda_cv ," %"))  #  2  lda cv

Confusion Matrix

Its probably more confusing having added this in, I think

this would be more useful had I of used less features

calculateConfusionMatrix(pred)    #  <-- works fine , going with the next one tho, takes

##                     predicted
## true                 -3.51848399613855 -2.18975953659026
##   -3.51848399613855                  0                 0
##   -2.18975953659026                  0                 0
##   -0.861035077041974                 0                 0
##   0.467689382506315                  0                 1
##   1.7964138420546                    0                 0
##   3.12513830160289                   0                 0
##   -err.-                             0                 1
##                     predicted
## true                 -0.861035077041974 0.467689382506315 1.7964138420546
##   -3.51848399613855                   2                 0               0
##   -2.18975953659026                  23                 5               0
##   -0.861035077041974                285               120               4
##   0.467689382506315                 134               252              39
##   1.7964138420546                     5                57              49
##   3.12513830160289                    0                 5               2
##   -err.-                            164               187              45
##                     predicted
## true                 3.12513830160289 -err.-
##   -3.51848399613855                 0      2
##   -2.18975953659026                 0     28
##   -0.861035077041974                0    124
##   0.467689382506315                 0    174
##   1.7964138420546                   0     62
##   3.12513830160289                  0      7
##   -err.-                            0    397

                                    #      up to much space otherwise
#conf.matrix = calculateConfusionMatrix(pred, relative = TRUE)
#conf.matrix

############# Model 3 naive bayes mlr

##---------------------------    try the algorithm naive bayes

## 2) Define the learner
## Choose a specific algorithm (e.g. linear discriminant analysis)
lrn = makeLearner("classif.naiveBayes") #classif.lda  #classif.randomForest


## 3) Fit the model
## Train the learner on the task using a random subset of the data as training set
model = train(lrn, task, subset = train.set )#train.set

## 4) Make predictions
## Predict values of the response variable for new observations by the trained model
## using the other part of the data as test set
pred = predict(model, task = task, subset = test.set)               
                                                              ###############  Model 3

## 5) Evaluate the learner
## Calculate the mean misclassification error and accuracy
performance(pred, measures = list(mmce, acc))

##      mmce       acc 
## 0.4314721 0.5685279

#performance(pred, measures = list(mmce))
acc_naive  <-  round(( 1 - (performance(pred, measures = list(mmce))))*100,2)
#acc_lda_cv
#cat(paste0("   Model accuracy for mlr naive bayes is : ", acc_naive ," %"))  #  3 naive bayes

############ Model 4 k nearest neighbour mlr

##---------------------------    try the algorithm nearest neighbour
lrn = makeLearner("classif.knn")

## 3) Fit the model
## Train the learner on the task using a random subset of the data as training set
model = train(lrn, task, subset = train.set )#train.set

## 4) Make predictions
## Predict values of the response variable for new observations by the trained model
## using the other part of the data as test set
pred = predict(model, task = task, subset = test.set)                  
                                                            ###############  Model 4

## 5) Evaluate the learner
## Calculate the mean misclassification error and accuracy
performance(pred, measures = list(mmce, acc))

##      mmce       acc 
## 0.3857868 0.6142132

#performance(pred, measures = list(mmce))
acc_knn  <-  round(( 1 - (performance(pred, measures = list(mmce))))*100,2)
#acc_knn
#cat(paste0("   Model accuracy mlr k nearest neighbour is : ", acc_knn ," %"))  #  4 knn

############# Model 5 randomForest mlr

##---------------------------    try the algorithm randomForest
lrn = makeLearner("classif.randomForest")

## 3) Fit the model
## Train the learner on the task using a random subset of the data as training set
model = train(lrn, task, subset = train.set )#train.set

## 4) Make predictions
## Predict values of the response variable for new observations by the trained model
## using the other part of the data as test set
pred = predict(model, task = task, subset = test.set)                     ###############  Model 5


## 5) Evaluate the learner
## Calculate the mean misclassification error and accuracy
performance(pred, measures = list(mmce, acc))

##      mmce       acc 
## 0.3096447 0.6903553

#performance(pred, measures = list(mmce))
acc_ranFor  <-  round(( 1 - (performance(pred, measures = list(mmce))))*100,2)
#acc_lda_cv
#cat(paste0("   Model accuracy for mlr randomForest is : ", acc_ranFor ," %"))  #  5 random Forest

############# Model 6 classif.randomForest cross validation mlr

#####----------cross validation for classif.randomForest--------everything below is for testing

##   use 5-fold cross-validation type:

## 3-fold cross-validation  Resample description: cross-validation with 3 iterations.
rdesc = makeResampleDesc("CV", iters = 5)
rdesc

## Resample description: cross-validation with 5 iterations.
## Predict: test
## Stratification: FALSE

## Calculate the performance
r = resample("classif.randomForest",task , rdesc)  #bh.task    regr.lm

## Resampling: cross-validation

## Measures:             mmce

## [Resample] iter 1:    0.2690355

## [Resample] iter 2:    0.2639594

## [Resample] iter 3:    0.3316327

## [Resample] iter 4:    0.3418367

## [Resample] iter 5:    0.3147208

##

## Aggregated Result: mmce.test.mean=0.3042370

##

## Resample Result
## Task: Wine5
## Learner: classif.randomForest
## Aggr perf: mmce.test.mean=0.3042370
## Runtime: 2.5995

#mse.test.mean
r$aggr                                                                     ###############  Model 6

## mmce.test.mean 
##       0.304237

#r$measures.test     ## same as r above

#r$pred
# Resampled Prediction for:
# Resample description: subsampling with 5 iterations and 0.80 split rate.
# Predict: test


pred = getRRPredictions(r)
#pred
list(pred)

## [[1]]
## Resampled Prediction for:
## Resample description: cross-validation with 5 iterations.
## Predict: test
## Stratification: FALSE
## predict.type: response
## threshold: 
## time (mean): 0.01
##   id              truth           response iter  set
## 1  9 -0.861035077041974 -0.861035077041974    1 test
## 2 11  0.467689382506315  0.467689382506315    1 test
## 3 15 -0.861035077041974 -0.861035077041974    1 test
## 4 17 -0.861035077041974 -0.861035077041974    1 test
## 5 25 -0.861035077041974 -0.861035077041974    1 test
## 6 28 -0.861035077041974 -0.861035077041974    1 test
## ... (#rows: 983, #cols: 5)

#performance(pred, measures = list(mmce))
acc_ranFor_cv  <-  round(( 1 - (performance(pred, measures = list(mmce))))*100,2)
#acc_lda_cv
#cat(paste0("   Model accuracy for mlr randomForest 5-fold cross-validation is : ", acc_ranFor_cv ," %"))   #6  #random Forest cv

############# Model 7 randomForest resample using holdout mlr

###--------------    Resample
par(mfrow = c(1,1))
## Make predictions on both training and test sets
rdesc = makeResampleDesc("Holdout", predict = "both")

# resample to see if our errors remain in the same ballpark
r = resample("classif.randomForest", task, rdesc, show.info = FALSE)
#r

predList = getRRPredictionList(r)                                      ###############  Model 7
predList

## $train
## $train$`1`
## Prediction: 655 observations
## predict.type: response
## threshold: 
## time: 0.01
##      id              truth           response
## 158 158 -0.861035077041974 -0.861035077041974
## 288 288    1.7964138420546    1.7964138420546
## 749 749 -0.861035077041974 -0.861035077041974
## 304 304    1.7964138420546    1.7964138420546
## 334 334 -0.861035077041974 -0.861035077041974
## 360 360 -0.861035077041974 -0.861035077041974
## ... (#rows: 655, #cols: 3)
## 
## 
## $test
## $test$`1`
## Prediction: 328 observations
## predict.type: response
## threshold: 
## time: 0.01
##      id              truth           response
## 400 400 -0.861035077041974 -0.861035077041974
## 145 145  0.467689382506315 -0.861035077041974
## 300 300  0.467689382506315  0.467689382506315
## 861 861 -0.861035077041974 -0.861035077041974
## 450 450    1.7964138420546  0.467689382506315
## 408 408 -0.861035077041974 -0.861035077041974
## ... (#rows: 328, #cols: 3)

#Below we calculate the mean misclassification error (mmce) on the training and the test data sets.
mmceTrainMean = setAggregation(mmce, train.mean)
rdesc = makeResampleDesc("CV", iters = 5, predict = "both")
r = resample("classif.randomForest", task, rdesc, measures = list(mmce, mmceTrainMean))#classif.rpart

## Resampling: cross-validation

## Measures:             mmce.train   mmce.test

## [Resample] iter 1:    0.0000000    0.2959184

## [Resample] iter 2:    0.0000000    0.3502538

## [Resample] iter 3:    0.0000000    0.3299492

## [Resample] iter 4:    0.0000000    0.2944162

## [Resample] iter 5:    0.0000000    0.2857143

##

## Aggregated Result: mmce.test.mean=0.3112504,mmce.train.mean=0.0000000

##

#r
M <- unlist(r$aggr)
#M[1]
acc_ranFor_resample  <-  round( (1-M[1])*100,2)
#acc_ranFor_resample  <-  round(( 1 - (performance(pred, measures = list(mmce))))*100,2)
#acc_lda_cv
#cat(paste0("   Model accuracy for mlr randomForest Resample is : ", acc_ranFor_resample ," %"))  #  7  random #Forest resample

############# Model 8 k nearest neighbour using spFSR

#############-------------- using  spFSR 

#head(Wine5)    ## a before check  use wine4 
data <- Wine5
#head(data)     ## an after check

## last column is the target variable Y
Y <- data %>% pull(quality)   

## other columns make up the feature matrix
X <- data %>% select(-quality)

## set the MLR classification task
my.task <- makeClassifTask(data = cbind(X, Y), target = "Y")

# View(my.task)  ## have a wee peek to see whats what
# str(my.task)

## set the performance measure
my.measure <- mmce ## mean misclassification error

## set the wrapper classification algorithm
my.wrapper <- makeLearner("classif.knn", k = 1)

## you can try other algorithms as well
### my.wrapper <- makeLearner("classif.rpart", minsplit = 5, cp = 0, xval = 0)
### my.wrapper <- makeLearner("classif.svm")
### my.wrapper <- makeLearner("classif.naiveBayes")

################################################                         ###############  Model 8
### compute performance with full set of features

my.rdesc <- makeResampleDesc("RepCV", folds = 3, reps = 3)    ###folds = 5  <- changed the folds  3

repcv.full <- resample(my.wrapper, 
                       my.task, 
                       my.rdesc,
                       measures = my.measure)

## Resampling: repeated cross-validation

## Measures:             mmce

## [Resample] iter 1:    0.3902439

## [Resample] iter 2:    0.4268293

## [Resample] iter 3:    0.3302752

## [Resample] iter 4:    0.3841463

## [Resample] iter 5:    0.4359756

## [Resample] iter 6:    0.3975535

## [Resample] iter 7:    0.4298780

## [Resample] iter 8:    0.3792049

## [Resample] iter 9:    0.3871951

##

## Aggregated Result: mmce.test.mean=0.3957002

##

result.full.mean <- mean(repcv.full$measures.test[[2]])
cat('Repeated CV error with full set of features =', 100 * round(result.full.mean, 3))

## Repeated CV error with full set of features = 39.6

acc_spFSR  <-  round( 100 - (round(result.full.mean, 3))*100)
#acc_rf_small
#cat(paste0("   Model accuracy for spFSR k nearest neighbour is : ", acc_spFSR ," %"))  #   8  spFSR knn

############# Model 9 rpart using spFSR

## set the wrapper classification algorithm
my.wrapper <- makeLearner("classif.rpart")  #, k = 1)

## you can try other algorithms as well
### my.wrapper <- makeLearner("classif.rpart", minsplit = 5, cp = 0, xval = 0)
### my.wrapper <- makeLearner("classif.svm")
### my.wrapper <- makeLearner("classif.naiveBayes")

################################################
### compute performance with full set of features

my.rdesc <- makeResampleDesc("RepCV", folds = 3, reps = 3)    ###folds = 5  <- changed the folds  3

repcv.full <- resample(my.wrapper,                                 ###############  Model 9
                       my.task, 
                       my.rdesc,
                       measures = my.measure)

## Resampling: repeated cross-validation

## Measures:             mmce

## [Resample] iter 1:    0.3932927

## [Resample] iter 2:    0.4678899

## [Resample] iter 3:    0.4085366

## [Resample] iter 4:    0.4281346

## [Resample] iter 5:    0.4207317

## [Resample] iter 6:    0.4298780

## [Resample] iter 7:    0.4298780

## [Resample] iter 8:    0.4085366

## [Resample] iter 9:    0.4923547

##

## Aggregated Result: mmce.test.mean=0.4310259

##

result.full.mean <- mean(repcv.full$measures.test[[2]])
cat('Repeated CV error with full set of features =', 100 * round(result.full.mean, 3))

## Repeated CV error with full set of features = 43.1

acc_rf_small  <-  round(( 1 - (result.full.mean))*100,2)
# acc_rf_small  <-  round(( 1 - (performance(pred, measures = list(mmce))))*100,2)
#acc_rf_small
#cat(paste0("   Model accuracy for spFSR rpart is : ", acc_rf_small ," %"))  #    9  spFSR  rpart

Reduced columns ( features ) section

#Wine %>% correlate() %>% focus(quality)    ##  <- used to compare to the original unaltered data set 
Wine4 %>% correlate() %>% focus(quality)

## # A tibble: 11 x 2
##    rowname               quality
##    <chr>                   <dbl>
##  1 fixed.acidity         0.146  
##  2 volatile.acidity     -0.359  
##  3 citric.acid           0.246  
##  4 residual.sugar        0.0445 
##  5 chlorides            -0.147  
##  6 free.sulfur.dioxide   0.00573
##  7 total.sulfur.dioxide -0.184  
##  8 density              -0.212  
##  9 pH                   -0.109  
## 10 sulphates             0.447  
## 11 alcohol               0.486

par(mfrow = c(1,1))   ##  Reset columns to original setting
par(mar=c(5,4,4,2))   ##  Reset margins to Original setting

Based on tha above correlation I will select certain columns,

bind them , and then use this new data set to see if I could

improve my model accuracy. This next piece will only show

the columns selected for binding.

## [1] "   "

## [1] "volatile.acidity"

## [1] "   "

## [1] "   "

## [1] "   "

## [1] "   "

## [1] "   "

## [1] "   "

## [1] "   "

## [1] "sulphates"

## [1] "alcohol"

View the new data set

##   Visualize new data set
par(mar=c(5,10,4,2)+.1) 
boxplot(Wine8, horizontal = TRUE, las = 1 , outline=TRUE ,  ##  I have used this as a way to 
        col=rainbow(8),cex.axis = 0.8 )#, pos=2 )     ##  to indicate any outliers

par(mfrow = c(1,1))   ##  Reset columns to original setting
par(mar=c(5,4,4,2))   ##  Reset margins to Original setting

Remove NA values and sum check

#set.seed(1234)
Wine8 <- as.data.frame(Wine8 %>%         ## omit missing values
                          na.omit() %>% 
                          select_if(is.numeric))
#head(Wine8)
##   Check for anomalies and missing values 
sum(is.na(Wine8))

## [1] 0

#normalize the variables
Wine8 <- normalizeFeatures(Wine8,method = "standardize")

############# Model 10 knn with reduced column Number (4 columns)

Wine8$quality <-  as.factor(Wine8$quality)
str(Wine8)

## 'data.frame':    983 obs. of  4 variables:
##  $ volatile.acidity: num  0.776 -1.116 0.654 -0.445 -0.201 ...
##  $ sulphates       : num  0.44 -1.361 1.577 -1.076 -0.413 ...
##  $ alcohol         : num  -0.904 -1.008 -1.113 -0.487 -0.174 ...
##  $ quality         : Factor w/ 6 levels "-3.51848399613855",..: 4 3 3 4 4 3 5 3 3 2 ...

set.seed(1234)

n = nrow(Wine8)
train.set = sample(n, size = 0.8*n)
test.set = setdiff(1:n, train.set)
#n

## 3) Fit the model
## Train the learner on the task using a random subset of the data as training set
model = train(lrn, task, subset = train.set )#train.set

## 4) Make predictions
## Predict values of the response variable for new observations by the trained model
## using the other part of the data as test set
pred = predict(model, task = task, subset = test.set)

##---------------------------    try the algorithm randomForest
lrn = makeLearner("classif.knn")   #  classif.knn    classif.randomForest

## 3) Fit the model
## Train the learner on the task using a random subset of the data as training set
model = train(lrn, task, subset = train.set )#train.set

## 4) Make predictions
## Predict values of the response variable for new observations by the trained model
## using the other part of the data as test set
pred = predict(model, task = task, subset = test.set)                    ###############  Model 10


## 5) Evaluate the learner
## Calculate the mean misclassification error and accuracy
performance(pred, measures = list(mmce, acc))

##      mmce       acc 
## 0.3857868 0.6142132

N <- unlist(performance(pred))
#N[1]

acc_knn_small  <-  round(( 1 - (N[1]))*100,2)
#acc_knn_small
#cat(paste0("   Model accuracy for k nearest neighbour 4 columns is : ", acc_knn_small ," %"))  #  10  knn  small

############# Model 11 rpart with reduced column Number (4 columns)

#############-------------- using  spFSR 

#head(Wine5)    ## a before check  use wine4 
data <- Wine8
#head(data)     ## an after check

## last column is the target variable Y
Y <- data %>% pull(quality)   

## other columns make up the feature matrix
X <- data %>% select(-quality)

## set the MLR classification task
my.task <- makeClassifTask(data = cbind(X, Y), target = "Y")

# View(my.task)  ## have a wee peek to see whats what
# str(my.task)

## set the performance measure
my.measure <- mmce ## mean misclassification error

## set the wrapper classification algorithm
## my.wrapper <- makeLearner("classif.knn", k = 1)   # classif.randomForest  classif.knn

## you can try other algorithms as well
 my.wrapper <- makeLearner("classif.rpart", minsplit = 5, cp = 0, xval = 0)
### my.wrapper <- makeLearner("classif.svm")
### my.wrapper <- makeLearner("classif.naiveBayes")

################################################
### compute performance with full set of features

my.rdesc <- makeResampleDesc("RepCV", folds = 3, reps = 3)    ###folds = 5  <- changed the folds  3

repcv.full <- resample(my.wrapper, 
                       my.task,                                  ###############  Model 11
                       my.rdesc,
                       measures = my.measure)

## Resampling: repeated cross-validation

## Measures:             mmce

## [Resample] iter 1:    0.4342508

## [Resample] iter 2:    0.4176829

## [Resample] iter 3:    0.4512195

## [Resample] iter 4:    0.4495413

## [Resample] iter 5:    0.4207317

## [Resample] iter 6:    0.4268293

## [Resample] iter 7:    0.4725610

## [Resample] iter 8:    0.4006116

## [Resample] iter 9:    0.4359756

##

## Aggregated Result: mmce.test.mean=0.4343782

##

result.full.mean <- mean(repcv.full$measures.test[[2]])
cat('Repeated CV error with full set of features =', 100 * round(result.full.mean, 3))

## Repeated CV error with full set of features = 43.4

acc_rp_smallspFSR <- (1 - (1 * round(result.full.mean, 3)) )* 100
#acc
#performance(pred, measures = list(my.measure, acc))
#cat("Current working dir: ", acc)
#cat(paste0("   Model accuracy spFSR rpart is : ", acc_rp_smallspFSR ," %"))   #  11  rpart  cv  small

############# Model 12 cross validation for rpart (4 columns)

par(mfrow = c(1,1))   ##  Reset columns to original setting
par(mar=c(5,4,4,2))   ##  Reset margins to Original setting
#####----------cross validation for classif.randomForest--------everything below is for testing

##   use 5-fold cross-validation type:

## 3-fold cross-validation  Resample description: cross-validation with 3 iterations.
rdesc = makeResampleDesc("CV", iters = 5)
rdesc

## Resample description: cross-validation with 5 iterations.
## Predict: test
## Stratification: FALSE

## Calculate the performance
r = resample("classif.rpart",task , rdesc)  #bh.task    regr.lm   classif.randomForest

## Resampling: cross-validation

## Measures:             mmce

## [Resample] iter 1:    0.3877551

## [Resample] iter 2:    0.4263959

## [Resample] iter 3:    0.3826531

## [Resample] iter 4:    0.4416244

## [Resample] iter 5:    0.4517766

##

## Aggregated Result: mmce.test.mean=0.4180410

##

## Resample Result
## Task: Wine5
## Learner: classif.rpart
## Aggr perf: mmce.test.mean=0.4180410
## Runtime: 0.065068

#mse.test.mean
r$aggr

## mmce.test.mean 
##       0.418041

#r$measures.test     ## same as r above

#r$pred
# Resampled Prediction for:
# Resample description: subsampling with 5 iterations and 0.80 split rate.
# Predict: test
                                                                  ###############  Model 12

pred = getRRPredictions(r)
#pred
list(pred)

## [[1]]
## Resampled Prediction for:
## Resample description: cross-validation with 5 iterations.
## Predict: test
## Stratification: FALSE
## predict.type: response
## threshold: 
## time (mean): 0.00
##   id              truth           response iter  set
## 1  5  0.467689382506315  0.467689382506315    1 test
## 2  6 -0.861035077041974 -0.861035077041974    1 test
## 3 12 -0.861035077041974  0.467689382506315    1 test
## 4 24  0.467689382506315  0.467689382506315    1 test
## 5 29  0.467689382506315 -0.861035077041974    1 test
## 6 33  0.467689382506315 -0.861035077041974    1 test
## ... (#rows: 983, #cols: 5)

## 5) Evaluate the learner
## Calculate the mean misclassification error and accuracy
## performance(pred, measures = list(mmce, acc))
acc_rf_smallspFSRresample <-  ( round(  (1 - r$aggr), 3)) *100

#cat(paste0("   Model accuracy for spFSR randomForest, 4 columns,  5-fold cross-validation is : ", 
#           acc_rf_smallspFSRresample ," %"),sep="\n")  #  12 random Forest cv small

My conclusion at this point is , my data is just to noisy. Even with a smaller data set to work with

the results are similar to all the above. The upside here is my confusion matrix is a bit easier to

read, actually I wonder if the columns showing the listing of errors would be better if removed for

the next run and so on.

##                     predicted
## true                 -3.51848399613855 -2.18975953659026
##   -3.51848399613855                  0                 0
##   -2.18975953659026                  0                 0
##   -0.861035077041974                 0                 0
##   0.467689382506315                  0                 0
##   1.7964138420546                    0                 0
##   3.12513830160289                   0                 0
##   -err.-                             0                 0
##                     predicted
## true                 -0.861035077041974 0.467689382506315 1.7964138420546
##   -3.51848399613855                   2                 0               0
##   -2.18975953659026                  21                 7               0
##   -0.861035077041974                290               112               7
##   0.467689382506315                 148               235              43
##   1.7964138420546                    13                51              47
##   3.12513830160289                    0                 5               2
##   -err.-                            184               175              52
##                     predicted
## true                 3.12513830160289 -err.-
##   -3.51848399613855                 0      2
##   -2.18975953659026                 0     28
##   -0.861035077041974                0    119
##   0.467689382506315                 0    191
##   1.7964138420546                   0     64
##   3.12513830160289                  0      7
##   -err.-                            0    411

######## The caret section

############# Model 13 caret randomForest

par(mfrow = c(1,1))   ##  Reset columns to original setting
par(mar=c(5,4,4,2))   ##  Reset margins to Original setting
###############################################

#####       I couldn't load caret library with the rest up top as this has a 'train' call similar to 
#####       mlr that is  why its been loaded here as the caret 'train, over rides the mlr 
#####       'train' call

#---------Try a different approach-----------with the algorithm random forest
#--       and switch back to wine 4 (the cleaned version of data set)

library(caret) # hyperparameter tuning

## Loading required package: lattice

## 
## Attaching package: 'caret'

## The following object is masked from 'package:purrr':
## 
##     lift

## The following object is masked from 'package:mlr':
## 
##     train

set.seed(1234)

# train/test split P is proportion I'll need to increase for my data set 
training_indexs <- createDataPartition(Wine4$quality, p = .8, list = F)
training <- Wine4[training_indexs, ]
testing  <- Wine4[-training_indexs, ]

# get predictors
predictors <- training %>% select(-quality) %>% as.matrix()
output <- training$quality

#library(randomForest) # for our model
# train a random forest model
model <- randomForest(x = predictors, y = output,
                      ntree = 20) # number of trees
# as the number of trees are increased
# the % variance does decrease slightly

# check out the details
#model
par(mfrow = c(1,1))   ##  Reset columns to original setting
par(mar=c(5,4,4,2))   ##  Reset margins to Original setting

plot(model, col='red')

print(model)   ## summerize the model

## 
## Call:
##  randomForest(x = predictors, y = output, ntree = 20) 
##                Type of random forest: regression
##                      Number of trees: 20
## No. of variables tried at each split: 3
## 
##           Mean of squared residuals: 0.599516
##                     % Var explained: 40.97

# check out our model's root mean squared error on the held out test data
rmse(predict(model, testing), testing$quality)    #getting around 76.37035 % here usually

## [1] 0.7009149

acc_caret <- ( 1 - (rmse(predict(model, testing), testing$quality)))       ###############  Model 13
acc_caret <- round(( 1 - acc_caret)*100, 2)
#cat(paste0("   Model accuracy for caret randomForest is : ", acc_caret ," %"))   #  13  caret

############# Model 14 caret randomForest cross validation

##----------Cross Validation  by adding a control to the model

#set up the control parameter to train with the 10-fold cross validation in 3 repetitions:
control = trainControl(method="repeatedcv", number=10, repeats=3)

#Then, you can train the classification model on telecom churn data with rpart:
model <- randomForest(x = predictors, y = output,
                      ntree = 20, trControl=control) # number of trees
                                                                            ###############  Model 14
#Check model for any improvements
rmse(predict(model, testing), testing$quality)

## [1] 0.7068028

acc_caret_cv <- ( 1 - (rmse(predict(model, testing), testing$quality)))
acc_caret_cv <- round(( 1 - acc_caret_cv)*100, 2)
#cat(paste0("   Model accuracy for caret randomForest 10-fold cross validation is : ", acc_caret_cv ," %"))   #  14 # caret

Section 3. Summary

############### Summary

cat(paste0("Model  1.  Model accuracy for mlr classif.lda is : ", acc_lda ," %"),sep="\n")  #  1  lda

## Model  1.  Model accuracy for mlr classif.lda is : 58.88 %

cat(paste0("Model  2.  Model accuracy for mlr classif.lda 5-fold cross-validation is : ", acc_lda_cv ," %"),sep="\n")  #  2 # lda cv

## Model  2.  Model accuracy for mlr classif.lda 5-fold cross-validation is : 59.61 %

cat(paste0("Model  3.  Model accuracy for mlr naive bayes is : ", acc_naive ," %"),sep="\n")  #  3 naive bayes

## Model  3.  Model accuracy for mlr naive bayes is : 56.85 %

cat(paste0("Model  4.  Model accuracy for mlr k nearest neighbour is : ", acc_knn ," %"),sep="\n")  #  4 knn

## Model  4.  Model accuracy for mlr k nearest neighbour is : 61.42 %

cat(paste0("Model  5.  Model accuracy for mlr randomForest is : ", acc_ranFor ," %"),sep="\n")  #  5 random Forest

## Model  5.  Model accuracy for mlr randomForest is : 69.04 %

cat(paste0("Model  6.  Model accuracy for mlr randomForest 5-fold cross-validation is : ", acc_ranFor_cv ," %"),sep="\n") # #6 rf cv

## Model  6.  Model accuracy for mlr randomForest 5-fold cross-validation is : 69.58 %

cat(paste0("Model  7.  Model accuracy for mlr randomForest Resample is : ", acc_ranFor_resample ," %"),sep="\n")  #  7  rf #resample

## Model  7.  Model accuracy for mlr randomForest Resample is : 68.87 %

cat(paste0("Model  8.  Model accuracy for spFSR k nearest neighbour is : ", acc_spFSR ," %"),sep="\n")  #   8  spFSR knn

## Model  8.  Model accuracy for spFSR k nearest neighbour is : 60 %

cat(paste0("Model  9.  Model accuracy for spFSR rpart is : ", acc_rf_small ," %"),sep="\n")  #    9  spFSR  rpart

## Model  9.  Model accuracy for spFSR rpart is : 56.9 %

cat(paste0("Model 10.  Model accuracy for mlr k nearest neighbour 4 columns is : ", acc_knn_small ," %"),sep="\n")  #  10  #knn  small

## Model 10.  Model accuracy for mlr k nearest neighbour 4 columns is : 61.42 %

cat(paste0("Model 11.  Model accuracy for spFSR rpart 4 columns is : ", acc_rp_smallspFSR ," %"),sep="\n")   #  11  rpart  #cv  small

## Model 11.  Model accuracy for spFSR rpart 4 columns is : 56.6 %

cat(paste0("Model 12.  Model accuracy for spFSR randomForest, 4 columns,  5-fold cross-validation is : ", 
           acc_rf_smallspFSRresample ," %"),sep="\n")  #  12 random Forest cv small

## Model 12.  Model accuracy for spFSR randomForest, 4 columns,  5-fold cross-validation is : 58.2 %

cat(paste0("Model 13.  Model accuracy for caret randomForest is : ", acc_caret ," %"),sep="\n")   #  13  caret

## Model 13.  Model accuracy for caret randomForest is : 70.09 %

cat(paste0("Model 14.  Model accuracy for caret randomForest 10-fold cross validation is : ", 
           acc_caret_cv ," %"))   #  14 # caret

## Model 14.  Model accuracy for caret randomForest 10-fold cross validation is : 70.68 %

##################################################################
cat(paste0(),sep="\n")

cat(paste0(),sep="\n")

cat(paste0(),sep="\n")

cat(paste0("  The section below is the above section, grouped together for easier comparison of Like vs Like "),sep="\n")

##   The section below is the above section, grouped together for easier comparison of Like vs Like

cat(paste0(),sep="\n")

cat(paste0(),sep="\n")

###################################################################
cat(paste0("        mlr lda (linear discriminant analysis)"),sep="\n")

##         mlr lda (linear discriminant analysis)

cat(paste0(),sep="\n")

cat(paste0(" Model  1.  Model accuracy for mlr classif.lda is : ", acc_lda ," %"),sep="\n")  #  1  lda

##  Model  1.  Model accuracy for mlr classif.lda is : 58.88 %

cat(paste0(" Model  2.  Model accuracy for mlr classif.lda 5-fold cross-validation is : ", acc_lda_cv ," %"),sep="\n")  #  2 # lda cv

##  Model  2.  Model accuracy for mlr classif.lda 5-fold cross-validation is : 59.61 %

cat(paste0(),sep="\n")

cat(paste0("        mlr naive Bayes "),sep="\n")

##         mlr naive Bayes

cat(paste0(),sep="\n")

cat(paste0(" Model  3.  Model accuracy for mlr naive bayes is : ", acc_naive ," %"),sep="\n")  #  3 naive bayes

##  Model  3.  Model accuracy for mlr naive bayes is : 56.85 %

cat(paste0(),sep="\n")

cat(paste0("        mlr k nearest neighbour (knn) also includes reduced columns mlr knn and spFSR knn"),sep="\n")

##         mlr k nearest neighbour (knn) also includes reduced columns mlr knn and spFSR knn

cat(paste0(),sep="\n")

cat(paste0(" Model  4.  Model accuracy for mlr k nearest neighbour is : ", acc_knn ," %"),sep="\n")  #  4 knn

##  Model  4.  Model accuracy for mlr k nearest neighbour is : 61.42 %

cat(paste0(" Model 10.  Model accuracy for mlr k nearest neighbour 4 columns is : ", acc_knn_small ," %"),sep="\n")  #  10  #knn  small

##  Model 10.  Model accuracy for mlr k nearest neighbour 4 columns is : 61.42 %

cat(paste0(" Model  8.  Model accuracy for spFSR k nearest neighbour is : ", acc_spFSR ," %"),sep="\n")  #   8  spFSR knn

##  Model  8.  Model accuracy for spFSR k nearest neighbour is : 60 %

cat(paste0(),sep="\n")

cat(paste0("        mlr randomForest (rf) also includes caret (rf)   "),sep="\n")

##         mlr randomForest (rf) also includes caret (rf)

cat(paste0(),sep="\n")

cat(paste0(" Model  5.  Model accuracy for mlr randomForest is : ", acc_ranFor ," %"),sep="\n")  #  5 random Forest

##  Model  5.  Model accuracy for mlr randomForest is : 69.04 %

cat(paste0(" Model  6.  Model accuracy for mlr randomForest 5-fold cross-validation is : ", acc_ranFor_cv ," %"),sep="\n") # #6 rf cv

##  Model  6.  Model accuracy for mlr randomForest 5-fold cross-validation is : 69.58 %

cat(paste0(" Model  7.  Model accuracy for mlr randomForest Resample is : ", acc_ranFor_resample ," %"),sep="\n")  #  7  rf #resample

##  Model  7.  Model accuracy for mlr randomForest Resample is : 68.87 %

cat(paste0(" Model 13.  Model accuracy for caret randomForest is : ", acc_caret ," %"),sep="\n")   #  13  caret

##  Model 13.  Model accuracy for caret randomForest is : 70.09 %

cat(paste0(" Model 14.  Model accuracy for caret randomForest 10-fold cross validation is : ", 
           acc_caret_cv ," %"),sep="\n")   #  14 # caret

##  Model 14.  Model accuracy for caret randomForest 10-fold cross validation is : 70.68 %

cat(paste0(),sep="\n")

cat(paste0("        spFSR  rpart and reduced columns spFSR rpart  "),sep="\n")

##         spFSR  rpart and reduced columns spFSR rpart

cat(paste0(),sep="\n")

cat(paste0(" Model  9.  Model accuracy for spFSR rpart is : ", acc_rf_small ," %"),sep="\n")  #    9  spFSR  rpart

##  Model  9.  Model accuracy for spFSR rpart is : 56.9 %

cat(paste0(" Model 11.  Model accuracy for spFSR rpart 4 columns is : ", acc_rp_smallspFSR ," %"),sep="\n")   #  11  rpart  #cv  small

##  Model 11.  Model accuracy for spFSR rpart 4 columns is : 56.6 %

cat(paste0(" Model 12.  Model accuracy for spFSR rpart, 4 columns,  5-fold cross-validation is : ", 
           acc_rf_smallspFSRresample ," %"),sep="\n")  #  12 random Forest cv small

##  Model 12.  Model accuracy for spFSR rpart, 4 columns,  5-fold cross-validation is : 58.2 %

############### Summary

The best performing model for my particular data set seemed to be caret followed closely by mlr. Numerous trailing was undertaken and it should be noted that at each rerun a random subset would be chosen which would then be tested for accuracy in each ensemble.

There were cases of occasionally over fitting, but not so many cases on under fitting the model. It should be noted that I couldn’t find a way to assess whether caret was using the same subset as all the others, simply because at the partitioning section, while I could see the structure of the caret subset I couldn’t see the structure of the mlr or spFSR subset.

At all stages of outlier removal for my data set there was a distinct lack of any form of pattern emergence, I can only assume from this that my data set was seriously noisy. Algorithm performances are stated below :

   Lowest accuracy   ----->   naive bayes  mlr     (from less than or equal to 55% acc)
                     ----->   lda      mlr         (from                 55% - 61% acc)  
                     ----->   rpart    spFSR       (from                 58% - 62% acc)
                     ----->   knn      spFSR       (from                 60% - 66% acc)
                     ----->   knn      mlr         (from                 60% - 66% acc)
                     ----->   random Forest mlr    (from                 68% - 71% acc)
   Highest accuracy  ----->   random Forest caret  (from                 75% - 82% acc)

(of note: In my last run, mlr actually scored highest, generally though caret scored highest)

NOTE: This was only for a small test of no more than 40 runs (at least 40 new subsets for testing) And should not be considered absolute, this is only what appeared to happen for my data set. Of interest, spFSR and mlr were scoring about the same for accuracy for knn.

It would have been nice had I have been able to get spFSR working correctly to then trail it against mlr and caret using the same algorithms. personally I expected the algorithms to all be similar in their results for similar algorithms, so why caret out performed I’m not sure. I would have liked to have trialed a multiple regression algorithm and also have trialed a logarithmic algorithm, but my report was beginning to get big, at almost double the size of the expected reports I therefore decided to stop.

From all of the above, my goto choices would be caret then either mlr or spFSR.

So, what was all this about ? Could I use any of these model’s as a way of predicting a semi decent wine ? I believe I could, as most models fell within a 60% plus accuracy rating which for me mean’s for each three bottles I chose hopefully at least 2 would be good or better if they contained the required physio-chemical’s. From a wine producers standpoint, if I knew the coefficients for each of the physio-chemical’s from the best model and I were able to alter one of more of the physio-chemical’s, I could produce better wines more often, and thus increase my profit margin.

ERRORS (The only error I could not overcome was in the spFSR feature selection part, an example is given below)

 iter  value   st.dev  num.ft  best.value

Error in instantiateResampleInstance.CVDesc(desc, size, task) :

Cannot use more folds (5) than size (2)!

spFSR could not get past this error, and as I could not alter any spFSR parameters because of this, the section was dropped from my assessment. I was therefore unable to use the Feature Selection part of spFSR.

Acknowledgements

This dataset is also available from the UCI machine learning repository,

https://archive.ics.uci.edu/ml/datasets/wine+quality , I just shared it to kaggle

for convenience. (I am mistaken and the public license type disallowed me from doing

so, I will take this down at first request. I am not the owner of this dataset.

Please include this citation if you plan to use this database: P. Cortez, A. Cerdeira,

F. Almeida, T. Matos and J. Reis. Modeling wine preferences by data mining from

physicochemical properties. In Decision Support Systems, Elsevier, 47(4):547-553, 2009.

Relevant publication

P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis. Modeling wine preferences by

data mining from physicochemical properties. In Decision Support Systems, Elsevier,

Assign Phase 2

s3686502 Dan Enoka

24 May 2018

https://archive.ics.uci.edu/ml/datasets/wine+quality

As per site Citation Request:

######################## Phase 2

For phase 2 , Machine Learning will be used to determine a model that could

be used to predict the ideal chemical properties that determine

a good wine, based on the data provided from the Red Wine.csv

Wine Quality in this data set is classified as being, if

equal to “6” = Good , equal to “7” = Good Plus ,

if equal to “8” = Very Good.

The highest value in this data set is “8”

The Red Wine.csv contains columns in the order as listed below

12 - quality <== this will be the target variable

Almost 50% of this report is dedicated to data cleaning, normalization

and structure assessment. Once I became satisfied that the data was cleaned sufficiently

I then shuffled to randomize the selection before applying the following algorithms.

The algorithms trialled are :

1. mlr lda (linear discriminant analysis),

2. mlr Naive Bayes

3. mlr k nearest neighbour and spFSR k nearest neighbour

4. mlr Random Forest , spFSR random Forest and caret random Forest

5. spFSR rpart

I also as part of this examination trialled reducing features, by reducing the

number of columns used, as I wished to see if this change in approach

would increase my model accuracy. To acheive this I had written a small program

to select only those columns that have a correlation of less than -0.3 or greater than 0.3

14 Models were produced from the above which includes resampling and cross validation.

There are three sections to this study

Section 1. Data Preperation

Section 2. Algorithm Application

Section 3. Summary

View this data set

Section 1. Data Preperation

The following program will be used to remove outliers

Remove NA values and sum check

View the new data set

Apply outlier removal program again (2nd time)

Remove NA values and sum check

View the new data set

Normalize this newest data set

View the new data set and check for outliers

Apply my outlier removal program again (3rd time), I’m not using the above normalized data set

Remove NA values and sum check

View the new data set

Normalize this newest data set

View the new data set, and look for outliers

This part shows the columns selected that continually have outliers

Find the lower and upper limits for the three selected columns

Then remove the outliers

View the new columns, and look for outliers

Check the result visually with all columns added back together, there are still outliers

but I’m ok with this data set at this point

Section 2. Algorithm Application

Shuffle up the cleaned data set and check row numbers are randomly chosen

Preparation of shuffled data set for use in mlr , spFSR

############# Model 1 lda : linear discriminant analysis mlr

############# Model 2 lda : linear discriminant analysis cross validation mlr

Confusion Matrix

Its probably more confusing having added this in, I think

this would be more useful had I of used less features

############# Model 3 naive bayes mlr

############ Model 4 k nearest neighbour mlr

############# Model 5 randomForest mlr

############# Model 6 classif.randomForest cross validation mlr

############# Model 7 randomForest resample using holdout mlr

############# Model 8 k nearest neighbour using spFSR

############# Model 9 rpart using spFSR

Reduced columns ( features ) section

Based on tha above correlation I will select certain columns,

bind them , and then use this new data set to see if I could

improve my model accuracy. This next piece will only show

the columns selected for binding.

View the new data set

Remove NA values and sum check

############# Model 10 knn with reduced column Number (4 columns)

############# Model 11 rpart with reduced column Number (4 columns)

############# Model 12 cross validation for rpart (4 columns)

My conclusion at this point is , my data is just to noisy. Even with a smaller data set to work with