This data set is Obtained from and is also available from the UCI machine learning repository,

wine+quality

Of the two data sets, of which one is related to red and and the other to white vinho
verde wine samples. I have chosen to analyse the " Red variety “” which is a variant
of the Portuguese “Vinho Verde” wine.

As per site Citation Request:

Please include this citation if you plan to use this database:

P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis. Modeling wine preferences by data mining from physicochemical properties. In Decision Support Systems, Elsevier, 47(4):547-553, 2009.

######################## Phase 2

For phase 2 , Machine Learning will be used to determine a model that could

be used to predict the ideal chemical properties that determine

a good wine, based on the data provided from the Red Wine.csv

Definition of physiochemical: (of or pertaining to both physical and chemical properties, changes, and reactions. of or according to physical chemistry.)

Wine Quality in this data set is classified as being, if

equal to “6” = Good , equal to “7” = Good Plus ,

if equal to “8” = Very Good.

The highest value in this data set is “8”

For more information, read [Cortez et al., 2009]. Input variables (based on physicochemical tests):

The Red Wine.csv contains columns in the order as listed below

Content: (Column Names)

1 - fixed acidity 2 - volatile acidity 3 - citric acid 4 - residual sugar 5 - chlorides 6 - free sulfur dioxide 7 - total sulfur dioxide 8 - density 9 - pH 10 - sulphates 11 - alcohol 12 - quality (score between 0 and 10)

12 - quality <== this will be the target variable

  Objective: Use machine learning to determine which physiochemical 
           properties make a wine 'good'
           
           
For a complete look at each ensembles accuracy, please see the conclusion 
at the end of this report.

Overview

Almost 50% of this report is dedicated to data cleaning, normalization

and structure assessment. Once I became satisfied that the data was cleaned sufficiently

I then shuffled to randomize the selection before applying the following algorithms.

The algorithms trialled are :

1. mlr lda (linear discriminant analysis),

2. mlr Naive Bayes

3. mlr k nearest neighbour and spFSR k nearest neighbour

4. mlr Random Forest , spFSR random Forest and caret random Forest

5. spFSR rpart

I also as part of this examination trialled reducing features, by reducing the

number of columns used, as I wished to see if this change in approach

would increase my model accuracy. To acheive this I had written a small program

to select only those columns that have a correlation of less than -0.3 or greater than 0.3

14 Models were produced from the above which includes resampling and cross validation.

There are three sections to this study

Section 1. Data Preperation

Section 2. Algorithm Application

Section 3. Summary

## Loading required package: mlr

## Loading required package: ParamHelpers

## Loading required package: parallelMap

## Loading required package: parallel

## Loading required package: tictoc

## randomForest 4.6-14

## Type rfNews() to see new features/changes/bug fixes.

## -- Attaching packages ---------------------------------- tidyverse 1.2.1 --

## v ggplot2 2.2.1     v purrr   0.2.4
## v tibble  1.4.2     v dplyr   0.7.5
## v tidyr   0.8.1     v stringr 1.3.1
## v readr   1.1.1     v forcats 0.3.0

## -- Conflicts ------------------------------------- tidyverse_conflicts() --
## x dplyr::combine()  masks randomForest::combine()
## x dplyr::filter()   masks stats::filter()
## x dplyr::lag()      masks stats::lag()
## x ggplot2::margin() masks randomForest::margin()

##  fixed.acidity   volatile.acidity  citric.acid    residual.sugar  
##  Min.   : 4.60   Min.   :0.1200   Min.   :0.000   Min.   : 0.900  
##  1st Qu.: 7.10   1st Qu.:0.3900   1st Qu.:0.090   1st Qu.: 1.900  
##  Median : 7.90   Median :0.5200   Median :0.260   Median : 2.200  
##  Mean   : 8.32   Mean   :0.5278   Mean   :0.271   Mean   : 2.539  
##  3rd Qu.: 9.20   3rd Qu.:0.6400   3rd Qu.:0.420   3rd Qu.: 2.600  
##  Max.   :15.90   Max.   :1.5800   Max.   :1.000   Max.   :15.500  
##    chlorides       free.sulfur.dioxide total.sulfur.dioxide
##  Min.   :0.01200   Min.   : 1.00       Min.   :  6.00      
##  1st Qu.:0.07000   1st Qu.: 7.00       1st Qu.: 22.00      
##  Median :0.07900   Median :14.00       Median : 38.00      
##  Mean   :0.08747   Mean   :15.87       Mean   : 46.47      
##  3rd Qu.:0.09000   3rd Qu.:21.00       3rd Qu.: 62.00      
##  Max.   :0.61100   Max.   :72.00       Max.   :289.00      
##     density             pH          sulphates         alcohol     
##  Min.   :0.9901   Min.   :2.740   Min.   :0.3300   Min.   : 8.40  
##  1st Qu.:0.9956   1st Qu.:3.210   1st Qu.:0.5500   1st Qu.: 9.50  
##  Median :0.9968   Median :3.310   Median :0.6200   Median :10.20  
##  Mean   :0.9967   Mean   :3.311   Mean   :0.6581   Mean   :10.42  
##  3rd Qu.:0.9978   3rd Qu.:3.400   3rd Qu.:0.7300   3rd Qu.:11.10  
##  Max.   :1.0037   Max.   :4.010   Max.   :2.0000   Max.   :14.90  
##     quality     
##  Min.   :3.000  
##  1st Qu.:5.000  
##  Median :6.000  
##  Mean   :5.636  
##  3rd Qu.:6.000  
##  Max.   :8.000

View this data set

## ############################# ## Section 1. Data Preperation ## #############################

The following program will be used to remove outliers

## Calculate the boundaries for each column 
## then apply using the 1.5 * IQR rule, remove outliers

remove_outliers <- function(x, na.rm = TRUE, ...) 
{
  qnt <- quantile(x, probs=c(.25, .75), na.rm = na.rm, ...)
  H <- 1.5 * IQR(x, na.rm = na.rm)
  y <- x
  y[x < (qnt[1] - H)] <- NA
  y[x > (qnt[2] + H)] <- NA
  y
}

## use the above formula to remove outliers from each column

fixed.acidity <-  remove_outliers(Wine$fixed.acidity)
volatile.acidity <-  remove_outliers(Wine$volatile.acidity)
citric.acid <-  remove_outliers(Wine$citric.acid)
residual.sugar <-  remove_outliers(Wine$residual.sugar)

chlorides <-  remove_outliers(Wine$chlorides)
free.sulfur.dioxide <-  remove_outliers(Wine$free.sulfur.dioxide)
total.sulfur.dioxide <-  remove_outliers(Wine$total.sulfur.dioxide)
density <-  remove_outliers(Wine$density)

pH <-  remove_outliers(Wine$pH)
sulphates <-  remove_outliers(Wine$sulphates)
alcohol <-  remove_outliers(Wine$alcohol)
quality <-  (Wine$quality)   ##  remove_outliers

##  Combine all columns back into a Data Frame
Wine1 <- cbind.data.frame(fixed.acidity,volatile.acidity,citric.acid,
                          residual.sugar,chlorides,free.sulfur.dioxide,
                          total.sulfur.dioxide,density,pH,
                          sulphates,alcohol,quality)

##   Check for anomalies and missing values 
sum(is.na(Wine1))

## [1] 573

#sum(is.na(Wine1r))

Remove NA values and sum check

## [1] 0

View the new data set

Apply outlier removal program again (2nd time)

## [1] 141

Remove NA values and sum check

## [1] 0

View the new data set

Normalize this newest data set

#normalize the variables
Wine1d <- normalizeFeatures(Wine1c,method = "standardize")
#head(Wine1d)

View the new data set and check for outliers

Apply my outlier removal program again (3rd time), I’m not using the above normalized data set

## [1] 81

Remove NA values and sum check

## [1] 0

View the new data set

Normalize this newest data set

View the new data set, and look for outliers

This part shows the columns selected that continually have outliers

Find the lower and upper limits for the three selected columns

##                          
## min            -1.4738585
## lower quartile -0.7698727
## median         -0.1978841
## upper quartile  0.5501009
## max             2.4860621

##                          
## min            -1.6553630
## lower quartile -0.8063988
## median         -0.1999958
## upper quartile  0.5276878
## max             2.4681775

##                          
## min            -2.1642912
## lower quartile -0.7034142
## median         -0.2420846
## upper quartile  0.6036863
## max             2.5258929

Then remove the outliers

View the new columns, and look for outliers

Check the result visually with all columns added back together, there are still outliers

but I’m ok with this data set at this point

## ################################# ## Section 2. Algorithm Application ## #################################

Shuffle up the cleaned data set and check row numbers are randomly chosen

# Shuffle up the newest Data Frame 
Wine4 <- Wine2[ sample(nrow(Wine2)),]
head(Wine4) ## Check the rows are shuffled

##     fixed.acidity volatile.acidity citric.acid residual.sugar  chlorides
## 799     1.3725689       -0.9332074   0.8985385      0.9992855  0.1362148
## 382     0.5267980       -0.2617874   0.3766523      1.8009430 -0.8877200
## 983    -0.8571907        1.3252055  -1.1890062      0.4648472 -1.8328907
## 900     0.1423567        2.2102592  -0.2032212     -1.9401253  0.6875644
## 948    -0.9340790        0.8979382  -1.3629683     -0.6040295  0.2149790
## 72     -1.3954086        0.6537854   0.4346397     -1.1384678  0.8450928
##     free.sulfur.dioxide total.sulfur.dioxide     density          pH
## 799          1.61921325           1.12208941  0.58845798 -1.40647741
## 382         -0.56383759          -0.50587796 -0.01881794 -0.03151286
## 983         -0.92767940          -1.03386737 -0.91949684  0.93905035
## 900         -0.92767940          -0.37388060 -1.24701643 -1.81087875
## 948          0.89152964          -0.02188766  0.39740488  0.53464901
## 72           0.04256542           1.07809029  0.32234831  1.01993062
##      sulphates    alcohol    quality
## 799 -0.3181873 -0.5911795 -0.8610351
## 382  0.4400627  0.8689884  1.7964138
## 983  0.0609377  1.1818815  0.4676894
## 900 -1.5503437 -0.3825841 -0.8610351
## 948  1.1035316 -0.6954772 -0.8610351
## 72   1.5774379 -1.1126680 -0.8610351

#str(Wine4)  ## Check the structure hasn't changed

# A visual check that the row numbers are all randomly selected,
#  Row Numbers are the first column below
par(mfrow = c(1,1))   ##  Reset columns to original setting
par(mar=c(5,4,4,2))   ##  Reset margins to Original setting

Preparation of shuffled data set for use in mlr , spFSR

Wine5 <- as.data.frame(Wine4,row.names = NULL, optional = FALSE)
#str(Wine5)
#head(Wine5) ## Check the rows are shuffled 

Wine5$quality <-  as.factor(Wine5$quality)
#str(Wine5)  #check that this has been changed to factor

## 1) Define the task
## Specify the type of analysis (e.g. classification) and provide data and response variable
task = makeClassifTask(data = Wine5, target = "quality")

############# Model 1 lda : linear discriminant analysis mlr

## 2) Define the learner
## Choose a specific algorithm (e.g. lda :  linear discriminant analysis)
#K Nearest Neighbour  
#classif.featureless  
#classif.knn
#classif.lda  
#classif.ada
#classif.AdaBag
#classif.randomForest
#classif.rda 
#classif.ada 
#classif.bst
lrn = makeLearner("classif.lda")                                  ###############  Model 1

set.seed(1234)

n = nrow(Wine5)
train.set = sample(n, size = 0.8*n)
test.set = setdiff(1:n, train.set)
#n


## 3) Fit the model
## Train the learner on the task using a random subset of the data as training set
model = train(lrn, task, subset = train.set )#train.set

## 4) Make predictions
## Predict values of the response variable for new observations by the trained model
## using the other part of the data as test set
pred = predict(model, task = task, subset = test.set)


## 5) Evaluate the learner
## Calculate the mean misclassification error and accuracy
performance(pred, measures = list(mmce, acc))

##      mmce       acc 
## 0.3299492 0.6700508

acc_lda  <-  round(( 1 - (performance(pred, measures = list(mmce))))*100,2)

#cat(paste0("   Model accuracy for mlr classif.lda is : ", acc_lda ," %"))  #  1  lda

############# Model 2 lda : linear discriminant analysis cross validation mlr

#####----------cross validation for classif.lda--------everything below is for testing

##   use 5-fold cross-validation type:

## 5-fold cross-validation  Resample description: cross-validation with 3 iterations.
rdesc = makeResampleDesc("CV", iters = 5)
rdesc

## Resample description: cross-validation with 5 iterations.
## Predict: test
## Stratification: FALSE

## Calculate the performance
r = resample("classif.lda",task , rdesc)  #bh.task    regr.lm

## Resampling: cross-validation

## Measures:             mmce

## [Resample] iter 1:    0.4489796

## [Resample] iter 2:    0.3826531

## [Resample] iter 3:    0.3654822

## [Resample] iter 4:    0.4263959

## [Resample] iter 5:    0.4213198

##

## Aggregated Result: mmce.test.mean=0.4089661

##

## Resample Result
## Task: Wine5
## Learner: classif.lda
## Aggr perf: mmce.test.mean=0.4089661
## Runtime: 0.060045

#mse.test.mean
r$aggr

## mmce.test.mean 
##      0.4089661

                                                              ###############  Model 2

pred = getRRPredictions(r)
#pred
list(pred)

## [[1]]
## Resampled Prediction for:
## Resample description: cross-validation with 5 iterations.
## Predict: test
## Stratification: FALSE
## predict.type: response
## threshold: 
## time (mean): 0.01
##   id              truth           response iter  set
## 1  9  0.467689382506315 -0.861035077041974    1 test
## 2 11  0.467689382506315 -0.861035077041974    1 test
## 3 16 -0.861035077041974  0.467689382506315    1 test
## 4 17    1.7964138420546    1.7964138420546    1 test
## 5 18 -0.861035077041974 -0.861035077041974    1 test
## 6 27  0.467689382506315  0.467689382506315    1 test
## ... (#rows: 983, #cols: 5)

#performance(pred, measures = list(mmce))
acc_lda_cv  <-  round(( 1 - (performance(pred, measures = list(mmce))))*100,2)
#acc_lda_cv
#cat(paste0("   Model accuracy for mlr classif.lda 5-fold cross-validation is : ", acc_lda_cv ," %"))  #  2  lda cv

Confusion Matrix

Its probably more confusing having added this in, I think

this would be more useful had I of used less features

##                     predicted
## true                 -3.51848399613855 -2.18975953659026
##   -3.51848399613855                  0                 0
##   -2.18975953659026                  0                 0
##   -0.861035077041974                 1                 0
##   0.467689382506315                  0                 2
##   1.7964138420546                    0                 0
##   3.12513830160289                   0                 0
##   -err.-                             1                 2
##                     predicted
## true                 -0.861035077041974 0.467689382506315 1.7964138420546
##   -3.51848399613855                   2                 0               0
##   -2.18975953659026                  22                 6               0
##   -0.861035077041974                291               111               6
##   0.467689382506315                 139               243              42
##   1.7964138420546                     4                60              47
##   3.12513830160289                    0                 6               1
##   -err.-                            167               183              49
##                     predicted
## true                 3.12513830160289 -err.-
##   -3.51848399613855                 0      2
##   -2.18975953659026                 0     28
##   -0.861035077041974                0    118
##   0.467689382506315                 0    183
##   1.7964138420546                   0     64
##   3.12513830160289                  0      7
##   -err.-                            0    402

############# Model 3 naive bayes mlr

##      mmce       acc 
## 0.4365482 0.5634518

############ Model 4 k nearest neighbour mlr

##      mmce       acc 
## 0.3451777 0.6548223

############# Model 5 randomForest mlr

##      mmce       acc 
## 0.2436548 0.7563452

############# Model 6 classif.randomForest cross validation mlr

## Resample description: cross-validation with 5 iterations.
## Predict: test
## Stratification: FALSE

## Resampling: cross-validation

## Measures:             mmce

## [Resample] iter 1:    0.2959184

## [Resample] iter 2:    0.2893401

## [Resample] iter 3:    0.3045685

## [Resample] iter 4:    0.2842640

## [Resample] iter 5:    0.2908163

##

## Aggregated Result: mmce.test.mean=0.2929815

##

## Resample Result
## Task: Wine5
## Learner: classif.randomForest
## Aggr perf: mmce.test.mean=0.2929815
## Runtime: 2.68957

## mmce.test.mean 
##      0.2929815

## [[1]]
## Resampled Prediction for:
## Resample description: cross-validation with 5 iterations.
## Predict: test
## Stratification: FALSE
## predict.type: response
## threshold: 
## time (mean): 0.01
##   id              truth           response iter  set
## 1  1 -0.861035077041974 -0.861035077041974    1 test
## 2  4 -0.861035077041974 -0.861035077041974    1 test
## 3 11  0.467689382506315  0.467689382506315    1 test
## 4 13  0.467689382506315  0.467689382506315    1 test
## 5 17    1.7964138420546    1.7964138420546    1 test
## 6 20  0.467689382506315  0.467689382506315    1 test
## ... (#rows: 983, #cols: 5)

############# Model 7 randomForest resample using holdout mlr

###--------------    Resample
par(mfrow = c(1,1))
## Make predictions on both training and test sets
rdesc = makeResampleDesc("Holdout", predict = "both")

# resample to see if our errors remain in the same ballpark
r = resample("classif.randomForest", task, rdesc, show.info = FALSE)
#r

predList = getRRPredictionList(r)                                      ###############  Model 7
predList

## $train
## $train$`1`
## Prediction: 655 observations
## predict.type: response
## threshold: 
## time: 0.02
##      id              truth           response
## 301 301 -0.861035077041974 -0.861035077041974
## 155 155  0.467689382506315  0.467689382506315
## 675 675  0.467689382506315  0.467689382506315
## 463 463    1.7964138420546    1.7964138420546
## 368 368 -0.861035077041974 -0.861035077041974
## 764 764  0.467689382506315  0.467689382506315
## ... (#rows: 655, #cols: 3)
## 
## 
## $test
## $test$`1`
## Prediction: 328 observations
## predict.type: response
## threshold: 
## time: 0.02
##      id              truth          response
## 789 789 -0.861035077041974 0.467689382506315
## 423 423  0.467689382506315 0.467689382506315
## 887 887    1.7964138420546 0.467689382506315
## 206 206    1.7964138420546 0.467689382506315
## 794 794  0.467689382506315   1.7964138420546
## 724 724  0.467689382506315 0.467689382506315
## ... (#rows: 328, #cols: 3)

#Below we calculate the mean misclassification error (mmce) on the training and the test data sets.
mmceTrainMean = setAggregation(mmce, train.mean)
rdesc = makeResampleDesc("CV", iters = 5, predict = "both")
r = resample("classif.randomForest", task, rdesc, measures = list(mmce, mmceTrainMean))#classif.rpart

## Resampling: cross-validation

## Measures:             mmce.train   mmce.test

## [Resample] iter 1:    0.0000000    0.2908163

## [Resample] iter 2:    0.0000000    0.3654822

## [Resample] iter 3:    0.0000000    0.2653061

## [Resample] iter 4:    0.0000000    0.2944162

## [Resample] iter 5:    0.0000000    0.2791878

##

## Aggregated Result: mmce.test.mean=0.2990417,mmce.train.mean=0.0000000

##

#r
M <- unlist(r$aggr)
#M[1]
acc_ranFor_resample  <-  round( (1-M[1])*100,2)
#acc_ranFor_resample  <-  round(( 1 - (performance(pred, measures = list(mmce))))*100,2)
#acc_lda_cv
#cat(paste0("   Model accuracy for mlr randomForest Resample is : ", acc_ranFor_resample ," %"))  #  7  random #Forest resample

############# Model 8 k nearest neighbour using spFSR

#############-------------- using  spFSR 

#head(Wine5)    ## a before check  use wine4 
data <- Wine5
#head(data)     ## an after check

## last column is the target variable Y
Y <- data %>% pull(quality)   

## other columns make up the feature matrix
X <- data %>% select(-quality)

## set the MLR classification task
my.task <- makeClassifTask(data = cbind(X, Y), target = "Y")

# View(my.task)  ## have a wee peek to see whats what
# str(my.task)

## set the performance measure
my.measure <- mmce ## mean misclassification error

## set the wrapper classification algorithm
my.wrapper <- makeLearner("classif.knn", k = 1)

## you can try other algorithms as well
### my.wrapper <- makeLearner("classif.rpart", minsplit = 5, cp = 0, xval = 0)
### my.wrapper <- makeLearner("classif.svm")
### my.wrapper <- makeLearner("classif.naiveBayes")

################################################                         ###############  Model 8
### compute performance with full set of features

my.rdesc <- makeResampleDesc("RepCV", folds = 3, reps = 3)    ###folds = 5  <- changed the folds  3

repcv.full <- resample(my.wrapper, 
                       my.task, 
                       my.rdesc,
                       measures = my.measure)

## Resampling: repeated cross-validation

## Measures:             mmce

## [Resample] iter 1:    0.3963415

## [Resample] iter 2:    0.4146341

## [Resample] iter 3:    0.3394495

## [Resample] iter 4:    0.3597561

## [Resample] iter 5:    0.3932927

## [Resample] iter 6:    0.4495413

## [Resample] iter 7:    0.3993902

## [Resample] iter 8:    0.3700306

## [Resample] iter 9:    0.3689024

##

## Aggregated Result: mmce.test.mean=0.3879265

##

result.full.mean <- mean(repcv.full$measures.test[[2]])
cat('Repeated CV error with full set of features =', 100 * round(result.full.mean, 3))

## Repeated CV error with full set of features = 38.8

acc_spFSR  <-  round( 100 - (round(result.full.mean, 3))*100)
#acc_rf_small
#cat(paste0("   Model accuracy for spFSR k nearest neighbour is : ", acc_spFSR ," %"))  #   8  spFSR knn

############# Model 9 rpart using spFSR

## set the wrapper classification algorithm
my.wrapper <- makeLearner("classif.rpart")  #, k = 1)

## you can try other algorithms as well
### my.wrapper <- makeLearner("classif.rpart", minsplit = 5, cp = 0, xval = 0)
### my.wrapper <- makeLearner("classif.svm")
### my.wrapper <- makeLearner("classif.naiveBayes")

################################################
### compute performance with full set of features

my.rdesc <- makeResampleDesc("RepCV", folds = 3, reps = 3)    ###folds = 5  <- changed the folds  3

repcv.full <- resample(my.wrapper,                                 ###############  Model 9
                       my.task, 
                       my.rdesc,
                       measures = my.measure)

## Resampling: repeated cross-validation

## Measures:             mmce

## [Resample] iter 1:    0.3993902

## [Resample] iter 2:    0.4237805

## [Resample] iter 3:    0.4250765

## [Resample] iter 4:    0.4525994

## [Resample] iter 5:    0.4146341

## [Resample] iter 6:    0.4207317

## [Resample] iter 7:    0.4268293

## [Resample] iter 8:    0.4176829

## [Resample] iter 9:    0.4373089

##

## Aggregated Result: mmce.test.mean=0.4242259

##

result.full.mean <- mean(repcv.full$measures.test[[2]])
cat('Repeated CV error with full set of features =', 100 * round(result.full.mean, 3))

## Repeated CV error with full set of features = 42.4

acc_rf_small  <-  round(( 1 - (result.full.mean))*100,2)
# acc_rf_small  <-  round(( 1 - (performance(pred, measures = list(mmce))))*100,2)
#acc_rf_small
#cat(paste0("   Model accuracy for spFSR rpart is : ", acc_rf_small ," %"))  #    9  spFSR  rpart

Reduced columns ( features ) section

## # A tibble: 11 x 2
##    rowname               quality
##    <chr>                   <dbl>
##  1 fixed.acidity         0.146  
##  2 volatile.acidity     -0.359  
##  3 citric.acid           0.246  
##  4 residual.sugar        0.0445 
##  5 chlorides            -0.147  
##  6 free.sulfur.dioxide   0.00573
##  7 total.sulfur.dioxide -0.184  
##  8 density              -0.212  
##  9 pH                   -0.109  
## 10 sulphates             0.447  
## 11 alcohol               0.486

Based on tha above correlation I will select certain columns,

bind them , and then use this new data set to see if I could

improve my model accuracy. This next piece will only show

the columns selected for binding.

## [1] "   "

## [1] "volatile.acidity"

## [1] "   "

## [1] "   "

## [1] "   "

## [1] "   "

## [1] "   "

## [1] "   "

## [1] "   "

## [1] "sulphates"

## [1] "alcohol"

View the new data set

Remove NA values and sum check

#set.seed(1234)
Wine8 <- as.data.frame(Wine8 %>%         ## omit missing values
                          na.omit() %>% 
                          select_if(is.numeric))
#head(Wine8)
##   Check for anomalies and missing values 
sum(is.na(Wine8))

## [1] 0

#normalize the variables
Wine8 <- normalizeFeatures(Wine8,method = "standardize")

############# Model 10 knn with reduced column Number (4 columns)

## 'data.frame':    983 obs. of  4 variables:
##  $ volatile.acidity: num  -0.933 -0.262 1.325 2.21 0.898 ...
##  $ sulphates       : num  -0.3182 0.4401 0.0609 -1.5503 1.1035 ...
##  $ alcohol         : num  -0.591 0.869 1.182 -0.383 -0.695 ...
##  $ quality         : Factor w/ 6 levels "-3.51848399613855",..: 3 5 4 3 3 3 4 3 4 4 ...

##      mmce       acc 
## 0.3451777 0.6548223

############# Model 11 rpart with reduced column Number (4 columns)

## Resampling: repeated cross-validation

## Measures:             mmce

## [Resample] iter 1:    0.4603659

## [Resample] iter 2:    0.4207317

## [Resample] iter 3:    0.3914373

## [Resample] iter 4:    0.4115854

## [Resample] iter 5:    0.4097859

## [Resample] iter 6:    0.4054878

## [Resample] iter 7:    0.4176829

## [Resample] iter 8:    0.4054878

## [Resample] iter 9:    0.3883792

##

## Aggregated Result: mmce.test.mean=0.4123271

##

## Repeated CV error with full set of features = 41.2

############# Model 12 cross validation for rpart (4 columns)

## Resample description: cross-validation with 5 iterations.
## Predict: test
## Stratification: FALSE

## Resampling: cross-validation

## Measures:             mmce

## [Resample] iter 1:    0.3520408

## [Resample] iter 2:    0.4438776

## [Resample] iter 3:    0.3553299

## [Resample] iter 4:    0.5126904

## [Resample] iter 5:    0.4213198

##

## Aggregated Result: mmce.test.mean=0.4170517

##

## Resample Result
## Task: Wine5
## Learner: classif.rpart
## Aggr perf: mmce.test.mean=0.4170517
## Runtime: 0.196188

## mmce.test.mean 
##      0.4170517

## [[1]]
## Resampled Prediction for:
## Resample description: cross-validation with 5 iterations.
## Predict: test
## Stratification: FALSE
## predict.type: response
## threshold: 
## time (mean): 0.00
##   id              truth           response iter  set
## 1 13  0.467689382506315  0.467689382506315    1 test
## 2 17    1.7964138420546    1.7964138420546    1 test
## 3 18 -0.861035077041974 -0.861035077041974    1 test
## 4 21  0.467689382506315  0.467689382506315    1 test
## 5 24  0.467689382506315  0.467689382506315    1 test
## 6 25  0.467689382506315  0.467689382506315    1 test
## ... (#rows: 983, #cols: 5)

My conclusion at this point is , my data is just to noisy. Even with a smaller data set to work with

the results are similar to all the above. The upside here is my confusion matrix is a bit easier to

read, actually I wonder if the columns showing the listing of errors would be better if removed for

the next run and so on.

##                     predicted
## true                 -3.51848399613855 -2.18975953659026
##   -3.51848399613855                  0                 0
##   -2.18975953659026                  0                 0
##   -0.861035077041974                 0                 0
##   0.467689382506315                  0                 0
##   1.7964138420546                    0                 0
##   3.12513830160289                   0                 0
##   -err.-                             0                 0
##                     predicted
## true                 -0.861035077041974 0.467689382506315 1.7964138420546
##   -3.51848399613855                   2                 0               0
##   -2.18975953659026                  24                 4               0
##   -0.861035077041974                281               125               3
##   0.467689382506315                 142               253              31
##   1.7964138420546                     8                64              39
##   3.12513830160289                    0                 5               2
##   -err.-                            176               198              36
##                     predicted
## true                 3.12513830160289 -err.-
##   -3.51848399613855                 0      2
##   -2.18975953659026                 0     28
##   -0.861035077041974                0    128
##   0.467689382506315                 0    173
##   1.7964138420546                   0     72
##   3.12513830160289                  0      7
##   -err.-                            0    410

######## The caret section

############# Model 13 caret randomForest

par(mfrow = c(1,1))   ##  Reset columns to original setting
par(mar=c(5,4,4,2))   ##  Reset margins to Original setting
###############################################

#####       I couldn't load caret library with the rest up top as this has a 'train' call similar to 
#####       mlr that is  why its been loaded here as the caret 'train, over rides the mlr 
#####       'train' call

#---------Try a different approach-----------with the algorithm random forest
#--       and switch back to wine 4 (the cleaned version of data set)

library(caret) # hyperparameter tuning

## Loading required package: lattice

## 
## Attaching package: 'caret'

## The following object is masked from 'package:purrr':
## 
##     lift

## The following object is masked from 'package:mlr':
## 
##     train

set.seed(1234)

# train/test split P is proportion I'll need to increase for my data set 
training_indexs <- createDataPartition(Wine4$quality, p = .8, list = F)
training <- Wine4[training_indexs, ]
testing  <- Wine4[-training_indexs, ]

# get predictors
predictors <- training %>% select(-quality) %>% as.matrix()
output <- training$quality

#library(randomForest) # for our model
# train a random forest model
model <- randomForest(x = predictors, y = output,
                      ntree = 20) # number of trees
# as the number of trees are increased
# the % variance does decrease slightly

# check out the details
#model
par(mfrow = c(1,1))   ##  Reset columns to original setting
par(mar=c(5,4,4,2))   ##  Reset margins to Original setting

plot(model, col='red')

print(model)   ## summerize the model

## 
## Call:
##  randomForest(x = predictors, y = output, ntree = 20) 
##                Type of random forest: regression
##                      Number of trees: 20
## No. of variables tried at each split: 3
## 
##           Mean of squared residuals: 0.6181932
##                     % Var explained: 39.06

# check out our model's root mean squared error on the held out test data
rmse(predict(model, testing), testing$quality)    #getting around 76.37035 % here usually

## [1] 0.7350491

acc_caret <- ( 1 - (rmse(predict(model, testing), testing$quality)))       ###############  Model 13
acc_caret <- round(( 1 - acc_caret)*100, 2)
#cat(paste0("   Model accuracy for caret randomForest is : ", acc_caret ," %"))   #  13  caret

############# Model 14 caret randomForest cross validation

## [1] 0.7484778

Section 3. Summary

############### Summary

## Model  1.  Model accuracy for mlr classif.lda is : 67.01 %

## Model  2.  Model accuracy for mlr classif.lda 5-fold cross-validation is : 59.1 %

## Model  3.  Model accuracy for mlr naive bayes is : 56.35 %

## Model  4.  Model accuracy for mlr k nearest neighbour is : 65.48 %

## Model  5.  Model accuracy for mlr randomForest is : 75.63 %

## Model  6.  Model accuracy for mlr randomForest 5-fold cross-validation is : 70.7 %

## Model  7.  Model accuracy for mlr randomForest Resample is : 70.1 %

## Model  8.  Model accuracy for spFSR k nearest neighbour is : 61 %

## Model  9.  Model accuracy for spFSR rpart is : 57.58 %

## Model 10.  Model accuracy for mlr k nearest neighbour 4 columns is : 65.48 %

## Model 11.  Model accuracy for spFSR rpart 4 columns is : 58.8 %

## Model 12.  Model accuracy for spFSR randomForest, 4 columns,  5-fold cross-validation is : 58.3 %

## Model 13.  Model accuracy for caret randomForest is : 73.5 %

## Model 14.  Model accuracy for caret randomForest 10-fold cross validation is : 74.85 %

##   The section below is the above section, grouped together for easier comparison of Like vs Like

##         mlr lda (linear discriminant analysis)

##  Model  1.  Model accuracy for mlr classif.lda is : 67.01 %

##  Model  2.  Model accuracy for mlr classif.lda 5-fold cross-validation is : 59.1 %

##         mlr naive Bayes

##  Model  3.  Model accuracy for mlr naive bayes is : 56.35 %

##         mlr k nearest neighbour (knn) also includes reduced columns mlr knn and spFSR knn

##  Model  4.  Model accuracy for mlr k nearest neighbour is : 65.48 %

##  Model 10.  Model accuracy for mlr k nearest neighbour 4 columns is : 65.48 %

##  Model  8.  Model accuracy for spFSR k nearest neighbour is : 61 %

##         mlr randomForest (rf) also includes caret (rf)

##  Model  5.  Model accuracy for mlr randomForest is : 75.63 %

##  Model  6.  Model accuracy for mlr randomForest 5-fold cross-validation is : 70.7 %

##  Model  7.  Model accuracy for mlr randomForest Resample is : 70.1 %

##  Model 13.  Model accuracy for caret randomForest is : 73.5 %

##  Model 14.  Model accuracy for caret randomForest 10-fold cross validation is : 74.85 %

##         spFSR  rpart and reduced columns spFSR rpart

##  Model  9.  Model accuracy for spFSR rpart is : 57.58 %

##  Model 11.  Model accuracy for spFSR rpart 4 columns is : 58.8 %

##  Model 12.  Model accuracy for spFSR rpart, 4 columns,  5-fold cross-validation is : 58.3 %

############### Summary

The best performing model for my particular data set seemed to be caret followed closely by mlr. Numerous trailing was undertaken and it should be noted that at each rerun a random subset would be chosen which would then be tested for accuracy in each ensemble.

There were cases of occasionally over fitting, but not so many cases on under fitting the model. It should be noted that I couldn’t find a way to assess whether caret was using the same subset as all the others, simply because at the partitioning section, while I could see the structure of the caret subset I couldn’t see the structure of the mlr or spFSR subset.

At all stages of outlier removal for my data set there was a distinct lack of any form of pattern emergence, I can only assume from this that my data set was seriously noisy. Algorithm performances are stated below :

   Lowest accuracy   ----->   naive bayes  mlr     (from less than or equal to 55% acc)
                     ----->   lda      mlr         (from                 55% - 61% acc)  
                     ----->   rpart    spFSR       (from                 58% - 62% acc)
                     ----->   knn      spFSR       (from                 60% - 66% acc)
                     ----->   knn      mlr         (from                 60% - 66% acc)
                     ----->   random Forest mlr    (from                 68% - 71% acc)
   Highest accuracy  ----->   random Forest caret  (from                 75% - 82% acc)

(of note: In my last run, mlr actually scored highest, generally though caret scored highest)

NOTE: This was only for a small test of no more than 40 runs (at least 40 new subsets for testing) And should not be considered absolute, this is only what appeared to happen for my data set. Of interest, spFSR and mlr were scoring about the same for accuracy for knn.

It would have been nice had I have been able to get spFSR working correctly to then trail it against mlr and caret using the same algorithms. personally I expected the algorithms to all be similar in their results for similar algorithms, so why caret out performed I’m not sure. I would have liked to have trialed a multiple regression algorithm and also have trialed a logarithmic algorithm, but my report was beginning to get big, at almost double the size of the expected reports I therefore decided to stop.

From all of the above, my goto choices would be caret then either mlr or spFSR.

So, what was all this about ? Could I use any of these model’s as a way of predicting a semi decent wine ? I believe I could, as most models fell within a 60% plus accuracy rating which for me mean’s for each three bottles I chose hopefully at least 2 would be good or better if they contained the required physio-chemical’s. From a wine producers standpoint, if I knew the coefficients for each of the physio-chemical’s from the best model and I were able to alter one of more of the physio-chemical’s, I could produce better wines more often, and thus increase my profit margin.

ERRORS (The only error I could not overcome was in the spFSR feature selection part, an example is given below)

 iter  value   st.dev  num.ft  best.value

Error in instantiateResampleInstance.CVDesc(desc, size, task) :

Cannot use more folds (5) than size (2)!

spFSR could not get past this error, and as I could not alter any spFSR parameters because of this, the section was dropped from my assessment. I was therefore unable to use the Feature Selection part of spFSR.

Acknowledgements

This dataset is also available from the UCI machine learning repository,

https://archive.ics.uci.edu/ml/datasets/wine+quality , I just shared it to kaggle

for convenience. (I am mistaken and the public license type disallowed me from doing

so, I will take this down at first request. I am not the owner of this dataset.

Please include this citation if you plan to use this database: P. Cortez, A. Cerdeira,

F. Almeida, T. Matos and J. Reis. Modeling wine preferences by data mining from

physicochemical properties. In Decision Support Systems, Elsevier, 47(4):547-553, 2009.

Relevant publication

P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis. Modeling wine preferences by

data mining from physicochemical properties. In Decision Support Systems, Elsevier,

Assign Phase 2

s3686502 Dan Enoka

24 May 2018

https://archive.ics.uci.edu/ml/datasets/wine+quality

As per site Citation Request:

######################## Phase 2

For phase 2 , Machine Learning will be used to determine a model that could

be used to predict the ideal chemical properties that determine

a good wine, based on the data provided from the Red Wine.csv

Wine Quality in this data set is classified as being, if

equal to “6” = Good , equal to “7” = Good Plus ,

if equal to “8” = Very Good.

The highest value in this data set is “8”

The Red Wine.csv contains columns in the order as listed below

12 - quality <== this will be the target variable

Almost 50% of this report is dedicated to data cleaning, normalization

and structure assessment. Once I became satisfied that the data was cleaned sufficiently

I then shuffled to randomize the selection before applying the following algorithms.

The algorithms trialled are :

1. mlr lda (linear discriminant analysis),

2. mlr Naive Bayes

3. mlr k nearest neighbour and spFSR k nearest neighbour

4. mlr Random Forest , spFSR random Forest and caret random Forest

5. spFSR rpart

I also as part of this examination trialled reducing features, by reducing the

number of columns used, as I wished to see if this change in approach

would increase my model accuracy. To acheive this I had written a small program

to select only those columns that have a correlation of less than -0.3 or greater than 0.3

14 Models were produced from the above which includes resampling and cross validation.

There are three sections to this study

Section 1. Data Preperation

Section 2. Algorithm Application

Section 3. Summary

View this data set

The following program will be used to remove outliers

Remove NA values and sum check

View the new data set

Apply outlier removal program again (2nd time)

Remove NA values and sum check

View the new data set

Normalize this newest data set

View the new data set and check for outliers

Apply my outlier removal program again (3rd time), I’m not using the above normalized data set

Remove NA values and sum check

View the new data set

Normalize this newest data set

View the new data set, and look for outliers

This part shows the columns selected that continually have outliers

Find the lower and upper limits for the three selected columns

Then remove the outliers

View the new columns, and look for outliers

Check the result visually with all columns added back together, there are still outliers

but I’m ok with this data set at this point

Shuffle up the cleaned data set and check row numbers are randomly chosen

Preparation of shuffled data set for use in mlr , spFSR

############# Model 1 lda : linear discriminant analysis mlr

############# Model 2 lda : linear discriminant analysis cross validation mlr

Confusion Matrix

Its probably more confusing having added this in, I think

this would be more useful had I of used less features

############# Model 3 naive bayes mlr

############ Model 4 k nearest neighbour mlr

############# Model 5 randomForest mlr

############# Model 6 classif.randomForest cross validation mlr

############# Model 7 randomForest resample using holdout mlr

############# Model 8 k nearest neighbour using spFSR

############# Model 9 rpart using spFSR

Reduced columns ( features ) section

Based on tha above correlation I will select certain columns,

bind them , and then use this new data set to see if I could

improve my model accuracy. This next piece will only show

the columns selected for binding.

View the new data set

Remove NA values and sum check

############# Model 10 knn with reduced column Number (4 columns)

############# Model 11 rpart with reduced column Number (4 columns)

############# Model 12 cross validation for rpart (4 columns)

My conclusion at this point is , my data is just to noisy. Even with a smaller data set to work with

the results are similar to all the above. The upside here is my confusion matrix is a bit easier to

read, actually I wonder if the columns showing the listing of errors would be better if removed for