I am using this wine quality data to demostrate the application of some common Machine Learning tools. The dataset is downloaded from the UCI Machine Learning Repository https://archive.ics.uci.edu/ml/index.php. The original dataset is published in the paper: P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis. Modeling wine preferences by data mining from physicochemical properties. In Decision Support Systems, Elsevier, 47(4):547-553. ISSN: 0167-9236.

The description of this dataset can be obtained from the UCI website. I am copying it here:


There are two questions in this analysis. The first question is to predict wine quality using the input variables. The second question is to specifically classify wines with excellent qualities, which is defined as wines with quality >= 7.

To solve the first question, I will use the wine quality variable as a continuous variable and build models such as linear regression, random forest, support vector machine to predict wine quality. This part is displayed in Session 1 of this analysis.

To solve the second question, I will transform the quality variable into a binary variable, and build models such as logistic regression, lasso to make classification. This part is displayed in Session 2 of this analysis.


Analysis Outlines:

R environment preparation and customized functions

R environment preparation and customized functions

rm(list = ls())
# For aesthetic purposes, I usually put this part in a separate .R file, which can be called through the source() function. I am displaying it here just for the purpose of showing all used code.

packages = c("tidyverse", "RCurl", "psych", "stats", 
             "randomForest", "glmnet", "caret","kernlab", 
             "rpart", "rpart.plot", "neuralnet", "C50",
             "doParallel", "AUC", "ggfortify")
if (length(setdiff(packages, rownames(installed.packages()))) > 0) {
  install.packages(setdiff(packages, rownames(installed.packages())))  
}
invisible(lapply(packages, require, character.only = TRUE))

# customized function to evaluate model performance for continuous predictors
eval = function(pred, true, plot = F, title = "") {
  rmse = sqrt(mean((pred - true)^2))
  mae = mean(abs(pred - true))
  cor = cor(pred, true)
  if (plot == TRUE) {
    par(mfrow = c(1,2), oma = c(0, 0, 2, 0))
    diff = pred - true
    plot(jitter(true, factor = 1), 
         jitter(pred, factor = 0.5), #jitter so that we can see overlapped dots
         pch = 3, asp = 1,
         xlab = "Truth", ylab = "Predicted") 
    abline(0,1, lty = 2)
    hist(diff, breaks = 20, main = NULL)
    mtext(paste0(title, " predicted vs. true using test set"), outer = TRUE)
    par(mfrow = c(1,1))}
  return(list(rmse = rmse,
              mae = mae,
              cor = cor))
}
# customized function to evaluate model performance for binary predictors
eval_class = function(prob, true, plot = F, title = "") {
    # find cutoff with the best kappa
    cuts = seq(0.01, 0.99, by=0.01)
    kappa = c()
    for (cut in cuts){
      cat = as.factor(ifelse(prob >= cut, 1, 0))
      cm = confusionMatrix(cat, true, positive = "1")
      kappa = c(kappa, cm$overall[["Kappa"]])
    }
    opt.cut = cuts[which.max(kappa)]
    
    # make predictions based on best kappa
    pred = as.factor(ifelse(prob >= opt.cut, 1, 0))
    confM = confusionMatrix(pred, true, positive = "1")
    
    # calculate AUC
    roc = roc(as.vector(prob), as.factor(true))
    auc = round(AUC::auc(roc),3)
    
    if (plot==T){
      # plot area under the curve
      par(mfrow = c(1,2), oma = c(0, 0, 2, 0))
      plot(roc, main = "AUC curve"); abline(0,1)
      text(0.8, 0.2, paste0("AUC = ", auc))
      
      # plot confusion matrix
      tab = table(true, pred)
      plot(tab,
           xlab = "Truth",
           ylab = "Predicted",
           main = "Confusion Matrix")
      text(0.9, 0.9, paste0('FN:', tab[2,1]))
      text(0.9, 0.05, paste0('TP:', tab[2,2]))
      text(0.1, 0.9, paste0('TN:', tab[1,1]))
      text(0.1, 0.05, paste0('FP:', tab[1,2]))
      mtext(paste0(title, " predicted vs. true using test set"), outer = TRUE)
      par(mfrow = c(1,1))
      }
    return(list(auc=auc, 
                confusionMatrix = confM))
}

Session 1. For the purpose of prediction

1. Data importing and processing

myfile = getURL('https://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-white.csv')
raw = read.csv(textConnection(myfile), header = T, sep = ";")
n = nrow(raw); p = ncol(raw); dim(raw)
## [1] 4898   12
head(raw) # check the first few lines
##   fixed.acidity volatile.acidity citric.acid residual.sugar chlorides free.sulfur.dioxide
## 1           7.0             0.27        0.36           20.7     0.045                  45
## 2           6.3             0.30        0.34            1.6     0.049                  14
## 3           8.1             0.28        0.40            6.9     0.050                  30
## 4           7.2             0.23        0.32            8.5     0.058                  47
## 5           7.2             0.23        0.32            8.5     0.058                  47
## 6           8.1             0.28        0.40            6.9     0.050                  30
##   total.sulfur.dioxide density   pH sulphates alcohol quality
## 1                  170  1.0010 3.00      0.45     8.8       6
## 2                  132  0.9940 3.30      0.49     9.5       6
## 3                   97  0.9951 3.26      0.44    10.1       6
## 4                  186  0.9956 3.19      0.40     9.9       6
## 5                  186  0.9956 3.19      0.40     9.9       6
## 6                   97  0.9951 3.26      0.44    10.1       6
str(raw) # check the general structure
## 'data.frame':    4898 obs. of  12 variables:
##  $ fixed.acidity       : num  7 6.3 8.1 7.2 7.2 8.1 6.2 7 6.3 8.1 ...
##  $ volatile.acidity    : num  0.27 0.3 0.28 0.23 0.23 0.28 0.32 0.27 0.3 0.22 ...
##  $ citric.acid         : num  0.36 0.34 0.4 0.32 0.32 0.4 0.16 0.36 0.34 0.43 ...
##  $ residual.sugar      : num  20.7 1.6 6.9 8.5 8.5 6.9 7 20.7 1.6 1.5 ...
##  $ chlorides           : num  0.045 0.049 0.05 0.058 0.058 0.05 0.045 0.045 0.049 0.044 ...
##  $ free.sulfur.dioxide : num  45 14 30 47 47 30 30 45 14 28 ...
##  $ total.sulfur.dioxide: num  170 132 97 186 186 97 136 170 132 129 ...
##  $ density             : num  1.001 0.994 0.995 0.996 0.996 ...
##  $ pH                  : num  3 3.3 3.26 3.19 3.19 3.26 3.18 3 3.3 3.22 ...
##  $ sulphates           : num  0.45 0.49 0.44 0.4 0.4 0.44 0.47 0.45 0.49 0.45 ...
##  $ alcohol             : num  8.8 9.5 10.1 9.9 9.9 10.1 9.6 8.8 9.5 11 ...
##  $ quality             : int  6 6 6 6 6 6 6 6 6 6 ...
summary(raw) # check the summary
##  fixed.acidity    volatile.acidity  citric.acid     residual.sugar     chlorides      
##  Min.   : 3.800   Min.   :0.0800   Min.   :0.0000   Min.   : 0.600   Min.   :0.00900  
##  1st Qu.: 6.300   1st Qu.:0.2100   1st Qu.:0.2700   1st Qu.: 1.700   1st Qu.:0.03600  
##  Median : 6.800   Median :0.2600   Median :0.3200   Median : 5.200   Median :0.04300  
##  Mean   : 6.855   Mean   :0.2782   Mean   :0.3342   Mean   : 6.391   Mean   :0.04577  
##  3rd Qu.: 7.300   3rd Qu.:0.3200   3rd Qu.:0.3900   3rd Qu.: 9.900   3rd Qu.:0.05000  
##  Max.   :14.200   Max.   :1.1000   Max.   :1.6600   Max.   :65.800   Max.   :0.34600  
##  free.sulfur.dioxide total.sulfur.dioxide    density             pH          sulphates     
##  Min.   :  2.00      Min.   :  9.0        Min.   :0.9871   Min.   :2.720   Min.   :0.2200  
##  1st Qu.: 23.00      1st Qu.:108.0        1st Qu.:0.9917   1st Qu.:3.090   1st Qu.:0.4100  
##  Median : 34.00      Median :134.0        Median :0.9937   Median :3.180   Median :0.4700  
##  Mean   : 35.31      Mean   :138.4        Mean   :0.9940   Mean   :3.188   Mean   :0.4898  
##  3rd Qu.: 46.00      3rd Qu.:167.0        3rd Qu.:0.9961   3rd Qu.:3.280   3rd Qu.:0.5500  
##  Max.   :289.00      Max.   :440.0        Max.   :1.0390   Max.   :3.820   Max.   :1.0800  
##     alcohol         quality     
##  Min.   : 8.00   Min.   :3.000  
##  1st Qu.: 9.50   1st Qu.:5.000  
##  Median :10.40   Median :6.000  
##  Mean   :10.51   Mean   :5.878  
##  3rd Qu.:11.40   3rd Qu.:6.000  
##  Max.   :14.20   Max.   :9.000

It seems that the dataset is very clean, with no missing data and clear structure. All variables are numeric. The range of independent variables varies greatly, so when building the model I will normalize them to be within the same range.

Next step I will check the pairwise relationship of each variable. As we can see from the following figure, there is not a clear linear relationship between the quality variable and other covariants, indicating that a simple linear regression might not work. In addition, there is some collinearity between covariants. These are against the assumption of a linear model.

pairs.panels(raw)

I will split the dataset into a training and testing set, and normalize each set separately.

set.seed(1)
idx = sample(n, 0.9*n)
train = raw[idx,]; dim(train)
## [1] 4408   12
test = raw[-idx,]; dim(test)
## [1] 490  12
# normalize train set so that the range is 0 ~ 1
normalize_train = function(x) (x - min(x))/(max(x) - min(x))
train.norm = data.frame(apply(train[,-p], 2, normalize_train), 
                        quality = train[,p])
summary(train.norm)
##  fixed.acidity    volatile.acidity  citric.acid     residual.sugar      chlorides      
##  Min.   :0.0000   Min.   :0.0000   Min.   :0.0000   Min.   :0.00000   Min.   :0.00000  
##  1st Qu.:0.2404   1st Qu.:0.1275   1st Qu.:0.1627   1st Qu.:0.01687   1st Qu.:0.07186  
##  Median :0.2885   Median :0.1765   Median :0.1867   Median :0.07055   Median :0.09281  
##  Mean   :0.2939   Mean   :0.1946   Mean   :0.2014   Mean   :0.08879   Mean   :0.10075  
##  3rd Qu.:0.3365   3rd Qu.:0.2353   3rd Qu.:0.2289   3rd Qu.:0.14264   3rd Qu.:0.11377  
##  Max.   :1.0000   Max.   :1.0000   Max.   :1.0000   Max.   :1.00000   Max.   :1.00000  
##  free.sulfur.dioxide total.sulfur.dioxide    density              pH           sulphates     
##  Min.   :0.00000     Min.   :0.0000       Min.   :0.00000   Min.   :0.0000   Min.   :0.0000  
##  1st Qu.:0.07317     1st Qu.:0.2297       1st Qu.:0.08849   1st Qu.:0.3372   1st Qu.:0.2405  
##  Median :0.11150     Median :0.2900       Median :0.12763   Median :0.4220   Median :0.3165  
##  Mean   :0.11600     Mean   :0.2999       Mean   :0.13321   Mean   :0.4298   Mean   :0.3410  
##  3rd Qu.:0.15331     3rd Qu.:0.3689       3rd Qu.:0.17332   3rd Qu.:0.5138   3rd Qu.:0.4177  
##  Max.   :1.00000     Max.   :1.0000       Max.   :1.00000   Max.   :1.0000   Max.   :1.0000  
##     alcohol          quality     
##  Min.   :0.0000   Min.   :3.000  
##  1st Qu.:0.2419   1st Qu.:5.000  
##  Median :0.3871   Median :6.000  
##  Mean   :0.4068   Mean   :5.883  
##  3rd Qu.:0.5484   3rd Qu.:6.000  
##  Max.   :1.0000   Max.   :9.000
# normalize test set using the values from train set to make prediction comparable
train.min = apply(train[,-p], 2, min)
train.max = apply(train[,-p], 2, max)
test.norm = data.frame(sweep(test, 2, c(train.min, 0)) %>% 
                         sweep(2, c(train.max-train.min, 1), FUN = "/"))
summary(test.norm) # test.norm might have data out of range 0~1, since it's normalized against the training set.
##  fixed.acidity     volatile.acidity    citric.acid     residual.sugar       chlorides        
##  Min.   :0.05769   Min.   :0.004902   Min.   :0.0000   Min.   :0.001534   Min.   :-0.008982  
##  1st Qu.:0.24038   1st Qu.:0.127451   1st Qu.:0.1627   1st Qu.:0.015337   1st Qu.: 0.071856  
##  Median :0.28846   Median :0.176471   Median :0.1928   Median :0.069018   Median : 0.089820  
##  Mean   :0.29242   Mean   :0.192047   Mean   :0.2008   Mean   :0.089140   Mean   : 0.104369  
##  3rd Qu.:0.33654   3rd Qu.:0.235294   3rd Qu.:0.2410   3rd Qu.:0.141104   3rd Qu.: 0.113772  
##  Max.   :0.57692   Max.   :0.666667   Max.   :0.4458   Max.   :0.475460   Max.   : 0.682635  
##  free.sulfur.dioxide total.sulfur.dioxide    density               pH            sulphates      
##  Min.   :0.003484    Min.   :0.04408      Min.   :0.005591   Min.   :0.06422   Min.   :0.05063  
##  1st Qu.:0.070557    1st Qu.:0.23492      1st Qu.:0.090418   1st Qu.:0.33945   1st Qu.:0.24051  
##  Median :0.108014    Median :0.29002      Median :0.128976   Median :0.42202   Median :0.32911  
##  Mean   :0.116572    Mean   :0.30213      Mean   :0.134698   Mean   :0.42756   Mean   :0.34650  
##  3rd Qu.:0.149826    3rd Qu.:0.36137      3rd Qu.:0.173077   3rd Qu.:0.51147   3rd Qu.:0.41772  
##  Max.   :0.449477    Max.   :0.70534      Max.   :0.447079   Max.   :1.00917   Max.   :1.08861  
##     alcohol          quality     
##  Min.   :0.0000   Min.   :3.000  
##  1st Qu.:0.2419   1st Qu.:5.000  
##  Median :0.3710   Median :6.000  
##  Mean   :0.3939   Mean   :5.833  
##  3rd Qu.:0.5161   3rd Qu.:6.000  
##  Max.   :0.9516   Max.   :8.000

2. Linear Regression

This is not always the best model, but it’s okay to start with, so that we have a basic sense of the relationship between the independent variable and dependent variable.

First, I will check the normality of the dependent variable using the Shapiro-Wilk test.

hist(raw$quality)  

shapiro.test(raw$quality) #Didn't pass normality test, so linear model may have a problem
## 
##  Shapiro-Wilk normality test
## 
## data:  raw$quality
## W = 0.88904, p-value < 2.2e-16

The dependent variable doesn’t pass the normality test, so one assumption of linear regression is not met. In addition, as we see from the pairwise plot, the relationship among independent variables and dependent variables are not entirely linear. There is also some collinearity among independent variables. Any of those could sabotage the performance of the linear model.

Then I will apply this linear model to the test set, and visualize the predicted value against the true value. I will also evaluate the model performance based on 3 measures: RMSE (root mean square error), MAE (mean absolute error) and cor (correlation). Smaller RMSE, MAE and larger cor are indicators of a good prediction.

tr.lm = lm(quality~., data = train.norm)
summary(tr.lm)
## 
## Call:
## lm(formula = quality ~ ., data = train.norm)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.9052 -0.4946 -0.0378  0.4684  3.0895 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           5.522913   0.112623  49.039  < 2e-16 ***
## fixed.acidity         0.815891   0.227717   3.583 0.000343 ***
## volatile.acidity     -1.937979   0.122947 -15.763  < 2e-16 ***
## citric.acid          -0.007099   0.167319  -0.042 0.966157    
## residual.sugar        5.497204   0.515538  10.663  < 2e-16 ***
## chlorides            -0.075770   0.197262  -0.384 0.700916    
## free.sulfur.dioxide   1.170572   0.259457   4.512 6.60e-06 ***
## total.sulfur.dioxide -0.148746   0.173382  -0.858 0.390990    
## density              -8.026818   1.032594  -7.773 9.45e-15 ***
## pH                    0.762083   0.120353   6.332 2.66e-10 ***
## sulphates             0.536491   0.084862   6.322 2.84e-10 ***
## alcohol               1.194482   0.156881   7.614 3.23e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.756 on 4396 degrees of freedom
## Multiple R-squared:  0.2858, Adjusted R-squared:  0.284 
## F-statistic: 159.9 on 11 and 4396 DF,  p-value: < 2.2e-16
tr.lm.pred = predict(tr.lm, test.norm[,-p])
tr.lm.eval = eval(tr.lm.pred, test.norm$quality, plot = T, title = "lm: "); unlist(tr.lm.eval)

##      rmse       mae       cor 
## 0.7097731 0.5602135 0.4901282

As we can see above in the model summary, the \(R^2\) of the model is only 0.28, so this model doesn’t explain much of the variance component. The evaluation of comparison between predicted quality and true quality, RMSE is 0.71 and MAE is 0.56, considering that the quality range is from 0 to 10, this level of difference is not too bad. As we can see from the plot above, the model performs worse in predicting extreme cases, i.e., wines with especially high or low qualities.

To remedy this issue, there are a few thoughts:

  1. I could check if there are many outliers or influential points.

  2. For those variables having a non-linear relationship with the y variable, I could try higher order regression, such as quadratic regression. I could also break them into a few intervals and make them categorical variables.

  3. I could test the interaction between each pair of the independent variables, especially those pairs with high correlations from the pairs plot.

  4. I could use non-linear models, such as random forest (rf), support vector machine (SVM), regression trees (rt), neural network (NN).

  5. For wines with especially high or low qualities, they are relatively rare, so a common model would not put enough weight on them. So I could boost their weight in training the model.

I will try some of these ideas in the following part.


3. Outlier/influential point detection

par(mfrow=c(2,3))
lapply(1:6, function(x) plot(tr.lm, which=x, labels.id= 1:nrow(train.norm))) %>% invisible()

par(mfrow=c(1,1))

From the plots above, the observations #245, #984 and #2466 appears as outliers in most of the plots. So I will remove them in the following analysis.

rm = c(245, 984, 2466)
removed = train.norm[rm, ];removed # these observations will be removed from the training set
##      fixed.acidity volatile.acidity citric.acid residual.sugar chlorides free.sulfur.dioxide
## 1932     0.3173077        0.4019608   0.1325301     0.02147239 0.1047904          0.50348432
## 4746     0.2211538        0.1764706   0.1506024     0.03527607 0.1047904          1.00000000
## 2782     0.3846154        0.8676471   0.3614458     1.00000000 0.1856287          0.02090592
##      total.sulfur.dioxide   density        pH sulphates   alcohol quality
## 1932            0.6925754 0.1019857 0.4770642 0.1898734 0.4838710       3
## 4746            1.0000000 0.1162522 0.6605505 0.5316456 0.4032258       3
## 2782            0.3503480 1.0000000 0.6146789 0.5949367 0.5967742       6
train.norm = train.norm[-rm, ]
tr.lm.rmoutlier = lm(quality~., data = train.norm)
summary(tr.lm.rmoutlier)
## 
## Call:
## lm(formula = quality ~ ., data = train.norm)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.3832 -0.5032 -0.0431  0.4640  3.0939 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)            5.723668   0.122950  46.553  < 2e-16 ***
## fixed.acidity          1.367609   0.250063   5.469 4.78e-08 ***
## volatile.acidity      -1.911980   0.122176 -15.649  < 2e-16 ***
## citric.acid           -0.036915   0.166163  -0.222 0.824200    
## residual.sugar         6.836281   0.591479  11.558  < 2e-16 ***
## chlorides             -0.031412   0.196154  -0.160 0.872778    
## free.sulfur.dioxide    1.482441   0.265093   5.592 2.38e-08 ***
## total.sulfur.dioxide  -0.007955   0.174900  -0.045 0.963723    
## density              -11.583903   1.260367  -9.191  < 2e-16 ***
## pH                     0.987459   0.127692   7.733 1.29e-14 ***
## sulphates              0.594696   0.085286   6.973 3.57e-12 ***
## alcohol                0.688053   0.188469   3.651 0.000265 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.7503 on 4393 degrees of freedom
## Multiple R-squared:  0.2936, Adjusted R-squared:  0.2918 
## F-statistic:   166 on 11 and 4393 DF,  p-value: < 2.2e-16

By only removing 3 observations from more than 4000 observations, the \(R^2\) increases from 0.286 to 0.294, so the effects of these outliers on the model are considered significant.

4. Polynomial Regression

I will fit a quadratic regression model by feeding \(x + x^2\) into the model for each independent variables. As we can see from the model summary, comparing with the linear model the \(R^2\) increases, and RMSE MAE both decreases, but \(R^2 = 0.325\) is still not very satisfying.

# 2nd order regression (quadratic model)
tr.qm = lm(quality~ poly(fixed.acidity, 2) + 
                     poly(volatile.acidity,2) + 
                     poly(citric.acid,2) + 
                     poly(residual.sugar,2) +  
                     poly(chlorides,2) + 
                     poly(free.sulfur.dioxide,2) +
                     poly(total.sulfur.dioxide,2) + 
                     poly(density,2) + 
                     poly(pH,2) + 
                     poly(sulphates,2) + 
                     poly(alcohol,2), 
           data = train.norm)
summary(tr.qm)
## 
## Call:
## lm(formula = quality ~ poly(fixed.acidity, 2) + poly(volatile.acidity, 
##     2) + poly(citric.acid, 2) + poly(residual.sugar, 2) + poly(chlorides, 
##     2) + poly(free.sulfur.dioxide, 2) + poly(total.sulfur.dioxide, 
##     2) + poly(density, 2) + poly(pH, 2) + poly(sulphates, 2) + 
##     poly(alcohol, 2), data = train.norm)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.3719 -0.4980 -0.0193  0.4428  3.1643 
## 
## Coefficients:
##                                 Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                      5.88294    0.01108 530.850  < 2e-16 ***
## poly(fixed.acidity, 2)1          8.23372    1.34170   6.137 9.16e-10 ***
## poly(fixed.acidity, 2)2         -3.75624    0.76762  -4.893 1.03e-06 ***
## poly(volatile.acidity, 2)1     -12.01445    0.83110 -14.456  < 2e-16 ***
## poly(volatile.acidity, 2)2       2.64353    0.77623   3.406 0.000666 ***
## poly(citric.acid, 2)1           -0.19872    0.79667  -0.249 0.803034    
## poly(citric.acid, 2)2           -2.98193    0.77025  -3.871 0.000110 ***
## poly(residual.sugar, 2)1        31.37846    2.99353  10.482  < 2e-16 ***
## poly(residual.sugar, 2)2        -8.72643    2.12088  -4.115 3.95e-05 ***
## poly(chlorides, 2)1             -0.89190    0.83490  -1.068 0.285457    
## poly(chlorides, 2)2              1.60890    0.80462   2.000 0.045607 *  
## poly(free.sulfur.dioxide, 2)1    4.30472    1.00592   4.279 1.91e-05 ***
## poly(free.sulfur.dioxide, 2)2   -4.20019    0.88676  -4.737 2.24e-06 ***
## poly(total.sulfur.dioxide, 2)1  -0.64706    1.14915  -0.563 0.573412    
## poly(total.sulfur.dioxide, 2)2  -5.26681    0.88050  -5.982 2.39e-09 ***
## poly(density, 2)1              -42.88176    4.65481  -9.212  < 2e-16 ***
## poly(density, 2)2               12.96425    2.25765   5.742 9.97e-09 ***
## poly(pH, 2)1                    10.00738    1.17318   8.530  < 2e-16 ***
## poly(pH, 2)2                     1.45152    0.75887   1.913 0.055846 .  
## poly(sulphates, 2)1              5.75641    0.79662   7.226 5.84e-13 ***
## poly(sulphates, 2)2             -0.48247    0.76401  -0.631 0.527747    
## poly(alcohol, 2)1                5.30659    2.52432   2.102 0.035594 *  
## poly(alcohol, 2)2                3.40788    0.82761   4.118 3.90e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.7358 on 4385 degrees of freedom
## Multiple R-squared:  0.3251, Adjusted R-squared:  0.3218 
## F-statistic: 96.03 on 22 and 4385 DF,  p-value: < 2.2e-16
tr.qm.pred = predict(tr.qm, test.norm[,-p])
tr.qm.eval = eval(tr.qm.pred, test.norm$quality, plot=T, title="quadratic model: ");unlist(tr.qm.eval)

##      rmse       mae       cor 
## 0.6865258 0.5438849 0.5354802

5. Categorize continuous covariants (ANOVA)

I will categorize each independent variable according to their value quantiles. I will split each variable into 6 categories:

0: below 10%

1: 10% to 25%

2: 25% to median

3: median to 75%

4: 75% to 90%

5: above 90%

I further fit a multi-level ANOVA model using the new categorical variables. From the model summary below, the model performance is very similar to the quadratic model.

# categorize covariates by cutoff in the quantiles of c(0.1, 0.25, 0.5, 0.75, 0.9)
low10pct = apply(train.norm, 2, function(x) quantile(x, 0.1))
q1 = apply(train.norm, 2, function(x) quantile(x, 0.25))
q2 = apply(train.norm, 2, function(x) quantile(x, 0.5))
q3 = apply(train.norm, 2, function(x) quantile(x, 0.75))
top10pct = apply(train.norm, 2, function(x) quantile(x, 0.9))

categorize = function(dataset = train.norm) {
  df.cat = dataset
  for (i in 1:(p-1)){
    col = dataset[,i]
    cat = case_when(col<low10pct[i]              ~ "0",
                    col>=low10pct[i] & col<q1[i] ~ "1",
                    col>=q1[i] & col<q2[i]       ~ "2",
                    col>=q2[i] & col<q3[i]       ~ "3",
                    col>=q3[i] & col<top10pct[i] ~ "4",
                    col>=top10pct[i]             ~ "5")
    df.cat[,i] = cat
    }
  return(df.cat)
}
train.cat = categorize(train.norm)
test.cat = categorize(test.norm)
head(train.cat)
##      fixed.acidity volatile.acidity citric.acid residual.sugar chlorides free.sulfur.dioxide
## 1301             5                2           4              3         3                   4
## 1823             5                5           2              2         1                   2
## 2805             2                2           3              3         0                   1
## 4446             0                4           1              3         1                   2
## 988              4                3           2              3         4                   3
## 4396             2                2           1              4         4                   3
##      total.sulfur.dioxide density pH sulphates alcohol quality
## 1301                    3       3  0         0       2       6
## 1823                    1       1  1         0       5       7
## 2805                    0       2  2         4       3       8
## 4446                    2       2  5         1       3       6
## 988                     3       3  2         2       2       6
## 4396                    3       4  2         5       1       5
tr.cat.lm = lm(quality~., data = train.cat)
summary(tr.cat.lm)
## 
## Call:
## lm(formula = quality ~ ., data = train.cat)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.2450 -0.4729 -0.0221  0.4634  3.2177 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)            5.297782   0.118161  44.835  < 2e-16 ***
## fixed.acidity1        -0.061804   0.048363  -1.278 0.201346    
## fixed.acidity2         0.031633   0.045541   0.695 0.487335    
## fixed.acidity3         0.044674   0.047553   0.939 0.347552    
## fixed.acidity4         0.118540   0.052870   2.242 0.025005 *  
## fixed.acidity5        -0.006463   0.060968  -0.106 0.915582    
## volatile.acidity1     -0.094588   0.048974  -1.931 0.053499 .  
## volatile.acidity2     -0.284745   0.046121  -6.174 7.27e-10 ***
## volatile.acidity3     -0.443219   0.046554  -9.521  < 2e-16 ***
## volatile.acidity4     -0.451411   0.049706  -9.082  < 2e-16 ***
## volatile.acidity5     -0.539100   0.055774  -9.666  < 2e-16 ***
## citric.acid1           0.121115   0.047380   2.556 0.010615 *  
## citric.acid2           0.376397   0.046651   8.068 9.13e-16 ***
## citric.acid3           0.265753   0.046331   5.736 1.04e-08 ***
## citric.acid4           0.198994   0.048989   4.062 4.95e-05 ***
## citric.acid5           0.183801   0.051776   3.550 0.000389 ***
## residual.sugar1        0.130548   0.052593   2.482 0.013094 *  
## residual.sugar2        0.327616   0.052249   6.270 3.95e-10 ***
## residual.sugar3        0.528849   0.060597   8.727  < 2e-16 ***
## residual.sugar4        0.781175   0.076252  10.245  < 2e-16 ***
## residual.sugar5        0.910234   0.091810   9.914  < 2e-16 ***
## chlorides1             0.005692   0.046706   0.122 0.903002    
## chlorides2            -0.091877   0.045153  -2.035 0.041931 *  
## chlorides3            -0.130462   0.047873  -2.725 0.006453 ** 
## chlorides4            -0.192611   0.051705  -3.725 0.000198 ***
## chlorides5            -0.214358   0.056921  -3.766 0.000168 ***
## free.sulfur.dioxide1   0.288731   0.048612   5.939 3.08e-09 ***
## free.sulfur.dioxide2   0.445220   0.046431   9.589  < 2e-16 ***
## free.sulfur.dioxide3   0.481210   0.049009   9.819  < 2e-16 ***
## free.sulfur.dioxide4   0.468402   0.055183   8.488  < 2e-16 ***
## free.sulfur.dioxide5   0.417932   0.060993   6.852 8.30e-12 ***
## total.sulfur.dioxide1  0.147315   0.048440   3.041 0.002370 ** 
## total.sulfur.dioxide2  0.170678   0.047244   3.613 0.000306 ***
## total.sulfur.dioxide3  0.144391   0.051214   2.819 0.004834 ** 
## total.sulfur.dioxide4  0.062783   0.058621   1.071 0.284231    
## total.sulfur.dioxide5  0.012124   0.065750   0.184 0.853716    
## density1              -0.115054   0.053491  -2.151 0.031539 *  
## density2              -0.264043   0.064328  -4.105 4.12e-05 ***
## density3              -0.609737   0.083033  -7.343 2.47e-13 ***
## density4              -0.895847   0.106250  -8.431  < 2e-16 ***
## density5              -0.934590   0.127772  -7.315 3.06e-13 ***
## pH1                   -0.006227   0.047061  -0.132 0.894738    
## pH2                    0.031411   0.044958   0.699 0.484786    
## pH3                    0.038228   0.046640   0.820 0.412465    
## pH4                    0.186123   0.051526   3.612 0.000307 ***
## pH5                    0.242725   0.058466   4.152 3.37e-05 ***
## sulphates1             0.059732   0.048340   1.236 0.216645    
## sulphates2             0.079923   0.047004   1.700 0.089137 .  
## sulphates3             0.126369   0.046352   2.726 0.006431 ** 
## sulphates4             0.191432   0.050493   3.791 0.000152 ***
## sulphates5             0.212021   0.054382   3.899 9.81e-05 ***
## alcohol1              -0.079556   0.053299  -1.493 0.135611    
## alcohol2              -0.070079   0.056975  -1.230 0.218768    
## alcohol3               0.110158   0.063874   1.725 0.084670 .  
## alcohol4               0.276248   0.075264   3.670 0.000245 ***
## alcohol5               0.568417   0.088273   6.439 1.33e-10 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.7317 on 4352 degrees of freedom
## Multiple R-squared:  0.3376, Adjusted R-squared:  0.3293 
## F-statistic: 40.33 on 55 and 4352 DF,  p-value: < 2.2e-16
tr.cat.lm.pred = predict(tr.cat.lm, test.cat[,-p])
tr.cat.lm.eval = eval(tr.cat.lm.pred, test.cat$quality, plot=T, title="ANOVA: ");unlist(tr.cat.lm.eval)

##      rmse       mae       cor 
## 0.6808479 0.5421466 0.5484151

6. Variable interaction and variable selection

I will further examine the pairwise interactions between independent variables. Considering the relatively large training size and relatively small number of covariants, I can examine their interaction all at once. After feeding all interaction pairs into the model, I will perform variable selection using the stepwise method.

tr.lm.interract = lm(quality~ .^2, data = train.norm)
summary(tr.lm.interract)
## 
## Call:
## lm(formula = quality ~ .^2, data = train.norm)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.4333 -0.4898 -0.0191  0.4357  3.0801 
## 
## Coefficients:
##                                           Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                                6.07864    0.71660   8.483  < 2e-16 ***
## fixed.acidity                              5.24962    1.79061   2.932 0.003388 ** 
## volatile.acidity                          -6.65602    1.06780  -6.233 5.00e-10 ***
## citric.acid                               -0.57329    1.75719  -0.326 0.744246    
## residual.sugar                            16.79688    3.33288   5.040 4.85e-07 ***
## chlorides                                 -0.21317    2.42545  -0.088 0.929968    
## free.sulfur.dioxide                       -7.85107    2.73798  -2.867 0.004158 ** 
## total.sulfur.dioxide                       3.28613    1.59292   2.063 0.039176 *  
## density                                  -21.93411    6.06909  -3.614 0.000305 ***
## pH                                         1.77909    0.97573   1.823 0.068321 .  
## sulphates                                  1.70433    0.89972   1.894 0.058253 .  
## alcohol                                   -1.86535    1.00300  -1.860 0.062986 .  
## fixed.acidity:volatile.acidity            -5.46377    2.25859  -2.419 0.015599 *  
## fixed.acidity:citric.acid                 -5.06638    3.38202  -1.498 0.134197    
## fixed.acidity:residual.sugar               8.69740    4.43860   1.959 0.050119 .  
## fixed.acidity:chlorides                   -9.31087    5.09144  -1.829 0.067508 .  
## fixed.acidity:free.sulfur.dioxide          7.66103    5.58028   1.373 0.169862    
## fixed.acidity:total.sulfur.dioxide        -4.12905    3.28640  -1.256 0.209037    
## fixed.acidity:density                    -15.14948    8.63142  -1.755 0.079303 .  
## fixed.acidity:pH                           3.21056    1.07112   2.997 0.002738 ** 
## fixed.acidity:sulphates                    1.22283    1.58411   0.772 0.440197    
## fixed.acidity:alcohol                     -1.93108    1.80006  -1.073 0.283426    
## volatile.acidity:citric.acid               2.03214    1.48243   1.371 0.170504    
## volatile.acidity:residual.sugar          -10.94961    4.26643  -2.566 0.010308 *  
## volatile.acidity:chlorides                 1.22016    2.19746   0.555 0.578747    
## volatile.acidity:free.sulfur.dioxide       3.11721    2.71744   1.147 0.251399    
## volatile.acidity:total.sulfur.dioxide      0.45197    1.63279   0.277 0.781943    
## volatile.acidity:density                  26.19721    8.34878   3.138 0.001713 ** 
## volatile.acidity:pH                        0.25759    1.26190   0.204 0.838262    
## volatile.acidity:sulphates                -0.87199    0.87996  -0.991 0.321773    
## volatile.acidity:alcohol                   7.81496    1.28031   6.104 1.12e-09 ***
## citric.acid:residual.sugar                -8.26222    7.92146  -1.043 0.296998    
## citric.acid:chlorides                     -0.83501    2.27321  -0.367 0.713393    
## citric.acid:free.sulfur.dioxide            1.73601    3.58565   0.484 0.628299    
## citric.acid:total.sulfur.dioxide          -1.89037    2.17663  -0.868 0.385176    
## citric.acid:density                       11.27982   17.26019   0.654 0.513458    
## citric.acid:pH                             2.47834    1.82024   1.362 0.173411    
## citric.acid:sulphates                     -1.10665    1.31364  -0.842 0.399594    
## citric.acid:alcohol                        1.37612    2.59093   0.531 0.595355    
## residual.sugar:chlorides                 -39.06460   11.32822  -3.448 0.000569 ***
## residual.sugar:free.sulfur.dioxide       -20.71341   12.74082  -1.626 0.104075    
## residual.sugar:total.sulfur.dioxide       -7.35648    7.45304  -0.987 0.323676    
## residual.sugar:density                    -4.77874    3.38649  -1.411 0.158281    
## residual.sugar:pH                         -0.87727    2.65244  -0.331 0.740856    
## residual.sugar:sulphates                  -3.44287    3.70841  -0.928 0.353254    
## residual.sugar:alcohol                     0.60194    1.99708   0.301 0.763116    
## chlorides:free.sulfur.dioxide              6.99704    4.18534   1.672 0.094636 .  
## chlorides:total.sulfur.dioxide            -6.18974    3.64383  -1.699 0.089449 .  
## chlorides:density                         66.66484   23.97168   2.781 0.005443 ** 
## chlorides:pH                              -8.18641    2.40929  -3.398 0.000685 ***
## chlorides:sulphates                       -4.01208    1.78780  -2.244 0.024873 *  
## chlorides:alcohol                          4.64300    3.67653   1.263 0.206702    
## free.sulfur.dioxide:total.sulfur.dioxide -11.04844    1.20751  -9.150  < 2e-16 ***
## free.sulfur.dioxide:density               30.81759   26.03299   1.184 0.236561    
## free.sulfur.dioxide:pH                     1.29952    2.76443   0.470 0.638316    
## free.sulfur.dioxide:sulphates              6.45906    1.77946   3.630 0.000287 ***
## free.sulfur.dioxide:alcohol               10.40772    4.13162   2.519 0.011803 *  
## total.sulfur.dioxide:density              20.92010   14.57759   1.435 0.151335    
## total.sulfur.dioxide:pH                   -2.69600    1.89144  -1.425 0.154123    
## total.sulfur.dioxide:sulphates            -5.70039    1.32263  -4.310 1.67e-05 ***
## total.sulfur.dioxide:alcohol               2.24932    2.28701   0.984 0.325408    
## density:pH                                -8.59629    5.63886  -1.524 0.127464    
## density:sulphates                          1.78314    7.68131   0.232 0.816440    
## density:alcohol                           -6.33866    2.65750  -2.385 0.017113 *  
## pH:sulphates                               1.10848    0.84065   1.319 0.187374    
## pH:alcohol                                -0.03463    1.05015  -0.033 0.973694    
## sulphates:alcohol                         -0.65384    1.14231  -0.572 0.567088    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.7192 on 4339 degrees of freedom
## Multiple R-squared:  0.3612, Adjusted R-squared:  0.3515 
## F-statistic: 37.17 on 66 and 4339 DF,  p-value: < 2.2e-16
# variable selection using stepwise methods
lm0 = lm(quality ~ 1, data = train.norm)
tr.lm.interract.step = step(lm0, ~ (fixed.acidity + volatile.acidity + 
                      citric.acid + residual.sugar +  chlorides + free.sulfur.dioxide +
                      total.sulfur.dioxide + density + pH + sulphates + alcohol)^2, 
                    direction = "both", trace = 0)
summary(tr.lm.interract.step)
## 
## Call:
## lm(formula = quality ~ alcohol + volatile.acidity + residual.sugar + 
##     free.sulfur.dioxide + density + sulphates + pH + fixed.acidity + 
##     alcohol:volatile.acidity + alcohol:free.sulfur.dioxide + 
##     alcohol:pH + free.sulfur.dioxide:pH + pH:fixed.acidity + 
##     volatile.acidity:density + alcohol:density + free.sulfur.dioxide:fixed.acidity + 
##     residual.sugar:sulphates + residual.sugar:free.sulfur.dioxide + 
##     sulphates:pH + density:pH + volatile.acidity:sulphates + 
##     alcohol:residual.sugar + alcohol:fixed.acidity + volatile.acidity:residual.sugar + 
##     free.sulfur.dioxide:density + volatile.acidity:fixed.acidity, 
##     data = train.norm)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.5507 -0.4919 -0.0159  0.4463  3.0600 
## 
## Coefficients:
##                                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                          7.60805    0.42637  17.844  < 2e-16 ***
## alcohol                             -2.44638    0.64584  -3.788 0.000154 ***
## volatile.acidity                    -5.82928    0.63713  -9.149  < 2e-16 ***
## residual.sugar                       9.57116    1.63763   5.845 5.45e-09 ***
## free.sulfur.dioxide                 -6.07640    2.07232  -2.932 0.003383 ** 
## density                            -11.82430    3.20925  -3.684 0.000232 ***
## sulphates                            0.57740    0.31707   1.821 0.068663 .  
## pH                                   0.01711    0.66194   0.026 0.979382    
## fixed.acidity                       -1.26703    0.77222  -1.641 0.100918    
## alcohol:volatile.acidity             7.14474    0.78436   9.109  < 2e-16 ***
## alcohol:free.sulfur.dioxide         13.14449    3.06526   4.288 1.84e-05 ***
## alcohol:pH                           1.58120    0.77714   2.035 0.041946 *  
## free.sulfur.dioxide:pH              -5.72802    2.09939  -2.728 0.006389 ** 
## pH:fixed.acidity                     4.91345    0.95413   5.150 2.72e-07 ***
## volatile.acidity:density            20.74528    4.61036   4.500 6.98e-06 ***
## alcohol:density                    -11.46318    2.29834  -4.988 6.35e-07 ***
## free.sulfur.dioxide:fixed.acidity    5.91850    4.02891   1.469 0.141903    
## residual.sugar:sulphates            -3.14400    1.10898  -2.835 0.004603 ** 
## residual.sugar:free.sulfur.dioxide -24.17534    9.30787  -2.597 0.009427 ** 
## sulphates:pH                         1.39014    0.54367   2.557 0.010593 *  
## density:pH                          -6.56424    2.56322  -2.561 0.010472 *  
## volatile.acidity:sulphates          -1.87283    0.71996  -2.601 0.009319 ** 
## alcohol:residual.sugar               4.54911    1.67718   2.712 0.006707 ** 
## alcohol:fixed.acidity                1.66326    0.96210   1.729 0.083920 .  
## volatile.acidity:residual.sugar     -6.97992    3.12919  -2.231 0.025759 *  
## free.sulfur.dioxide:density         37.00826   19.08737   1.939 0.052579 .  
## volatile.acidity:fixed.acidity      -2.14583    1.39820  -1.535 0.124927    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.7325 on 4379 degrees of freedom
## Multiple R-squared:  0.3311, Adjusted R-squared:  0.3272 
## F-statistic: 83.38 on 26 and 4379 DF,  p-value: < 2.2e-16
tr.lm.interract.step.pred = predict(tr.lm.interract.step, test.norm[,-p])
tr.lm.interract.step.eval = eval(tr.lm.interract.step.pred, test.norm$quality, plot=T, title="lm wiht interaction and var selection: ");unlist(tr.lm.interract.step.eval)

##      rmse       mae       cor 
## 0.7073700 0.5441501 0.5004769

7. Random Forest

I will build a random forest (rf) model using the randomForest function from the randomForest package. This model will ensemble 1000 decision trees, each tree with sqrt(p) = 3 variables selected.

tr.rf = randomForest(quality~., data = train.norm, ntree = 1000, mtry = sqrt(p))
tr.rf
## 
## Call:
##  randomForest(formula = quality ~ ., data = train.norm, ntree = 1000,      mtry = sqrt(p)) 
##                Type of random forest: regression
##                      Number of trees: 1000
## No. of variables tried at each split: 3
## 
##           Mean of squared residuals: 0.3657041
##                     % Var explained: 54.13
tr.rf.pred = predict(tr.rf, test.norm[,-p])
tr.rf.eval = eval(tr.rf.pred, test.norm$quality, plot = T, title = "Random Forest: "); unlist(tr.rf.eval)

##      rmse       mae       cor 
## 0.5144320 0.3727410 0.7772215

I will use the train function in the caret package to automatically optimize model hyperparameters. Here I am using a 10-fold cross-validation (cv) with 2 repeats, including 2, 4 or 6 variables respectively at each tree level. I am using RMSE to compare model performance. Due to the limitation of computation power, I only choose a few combinations. The available methods in the train function can be obtained by typing names(getModelInfo()).

Since this step will be computationally expensive, I will use parallel computation from the doParallel package.

ct = trainControl(method = "repeatedcv", number = 10, repeats = 2)
grid_rf = expand.grid(.mtry = c(2, 3, 6))

set.seed(1)
cl = makePSOCKcluster(4)
registerDoParallel(cl)
tr.cvrf = train(quality~., data = train.norm,
                method = 'rf',
                metric = "RMSE",
                trControl = ct,
                tuneGrid = grid_rf)
stopCluster(cl)
save(tr.cvrf, file = "~/Downloads/wine_train_cvrf.RData")
load(file = "~/Downloads/wine_train_cvrf.RData"); tr.cvrf
## Random Forest 
## 
## 4405 samples
##   11 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 2 times) 
## Summary of sample sizes: 3965, 3964, 3965, 3964, 3964, 3965, ... 
## Resampling results across tuning parameters:
## 
##   mtry  RMSE       Rsquared   MAE      
##   2     0.6104983  0.5430114  0.4448273
##   3     0.6093373  0.5416098  0.4417419
##   6     0.6096012  0.5383133  0.4403383
## 
## RMSE was used to select the optimal model using the smallest value.
## The final value used for the model was mtry = 3.
tr.cvrf.pred = predict(tr.cvrf, test.norm[,-p])
tr.cvrf.eval = eval(tr.cvrf.pred, test.norm$quality, plot = T, title = "Random Forest with CV: "); unlist(tr.cvrf.eval)

##      rmse       mae       cor 
## 0.5144424 0.3724882 0.7772617

A comparison of linear regression, quadratic regression, ANOVA and the random forest is shown in the table below. We can tell that the random forest model performs much better than other models:

knitr::kable(cbind(lm = unlist(tr.lm.eval),
                   quadratic = unlist(tr.qm.eval),
                   ANOVA = unlist(tr.cat.lm.eval), 
                   rf = unlist(tr.rf.eval),
                   rf.cv = unlist(tr.cvrf.eval)) %>% round(3),
             title = "Comparing linear regression, quatratic regression, ANOVA and random forest")
lm quadratic ANOVA rf rf.cv
rmse 0.71 0.687 0.681 0.513 0.516
mae 0.56 0.544 0.542 0.372 0.375
cor 0.49 0.535 0.548 0.778 0.775

8. Support Vector Machine

Next I will apply the Support Vector Machine (svm) model using the ksvm function from the kernlab package. I will use the default rbfdot Radial Basis kernel (Gaussian kernel). Other available kernels options can be obtained by typing the ?ksvm command.

tr.svm = ksvm(quality ~ ., 
              data = train.norm, 
              scaled = F,
              kernel = "rbfdot", 
              C = 1)
tr.svm.pred = predict(tr.svm, test.norm[,-p])
tr.svm.eval = eval(tr.svm.pred, test.norm$quality, plot = T, title = "SVM: "); unlist(tr.svm.eval)

##      rmse       mae       cor 
## 0.6379821 0.4767281 0.6185084

Similarly, I will use the train function to optimize model hyperparameters. This step will be computationally expensive.

cl = makePSOCKcluster(4)
registerDoParallel(cl)
tr = trainControl(method = "repeatedcv", number = 10, repeats = 2)
set.seed(1)
tr.svmRadial = train(quality ~.,
                    data = train.norm,
                    method = "svmRadial",
                    trControl=tr,
                    preProcess = NULL,
                    tuneLength = 10)
stopCluster(cl)
save(tr.svmRadial, file="~/Downloads/wine_train_cv_svmRadial.RData")
load(file = "~/Downloads/wine_train_cv_svmRadial.RData"); tr.svmRadial
## Support Vector Machines with Radial Basis Function Kernel 
## 
## 4405 samples
##   11 predictor
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 2 times) 
## Summary of sample sizes: 3965, 3964, 3965, 3964, 3964, 3965, ... 
## Resampling results across tuning parameters:
## 
##   C       RMSE       Rsquared   MAE      
##     0.25  0.7075534  0.3734247  0.5371903
##     0.50  0.7010562  0.3846698  0.5312537
##     1.00  0.6959988  0.3934084  0.5257879
##     2.00  0.6934331  0.3984556  0.5218951
##     4.00  0.6922973  0.4019772  0.5185848
##     8.00  0.6928832  0.4038856  0.5163878
##    16.00  0.7022211  0.3958122  0.5202775
##    32.00  0.7162806  0.3852693  0.5253449
##    64.00  0.7363789  0.3711688  0.5317107
##   128.00  0.7661621  0.3526886  0.5423699
## 
## Tuning parameter 'sigma' was held constant at a value of 0.08049664
## RMSE was used to select the optimal model using the smallest value.
## The final values used for the model were sigma = 0.08049664 and C = 4.
tr.svmRadial.pred = predict(tr.svmRadial, test.norm[,-p])
tr.svmRadial.eval = eval(tr.svmRadial.pred, test.norm$quality, plot = T, title = "SVM with CV: "); unlist(tr.svmRadial.eval)

##      rmse       mae       cor 
## 0.6285840 0.4672283 0.6370404

It seems that svm model with Gaussian kernel doesn’t perform as well as the random forest model.

 knitr::kable(cbind(lm = unlist(tr.lm.eval),
                   quadratic = unlist(tr.qm.eval),
                   ANOVA = unlist(tr.cat.lm.eval), 
                   rf = unlist(tr.rf.eval),
                   rf.cv = unlist(tr.cvrf.eval),
                   svm = unlist(tr.svm.eval),
                  svm.cv = unlist(tr.svmRadial.eval)) %>% round(3),
              caption = "")
lm quadratic ANOVA rf rf.cv svm svm.cv
rmse 0.71 0.687 0.681 0.513 0.516 0.638 0.628
mae 0.56 0.544 0.542 0.372 0.375 0.477 0.467
cor 0.49 0.535 0.548 0.778 0.775 0.619 0.638

9. Regression Tree

Next I will try running a regression using the rpart function from the rpart package.

tr.rpart = rpart(quality~., data=train.norm)
rpart.plot(tr.rpart)  

tr.rpart.pred = predict(tr.rpart, test.norm[,-p])
tr.rpart.eval = eval(tr.rpart.pred, test.norm$quality, plot=T, title = "Regression Tree:"); unlist(tr.rpart.eval)

##      rmse       mae       cor 
## 0.7022890 0.5609858 0.5052963

10. Neural Network

I will train a simple neural network with 1 hidden layer and with sigmoid activation function. I am using the neuralnet function from the neuralnet package.

tr.nn = neuralnet(quality ~ fixed.acidity + volatile.acidity + citric.acid +
                      residual.sugar +  chlorides + free.sulfur.dioxide +
                      total.sulfur.dioxide + density + pH + sulphates + alcohol,
                    data=train.norm,
                    hidden = 2,
                    act.fct = "logistic")
save(tr.nn, file="~/Downloads/wine_train_nn.RData")
load(file="~/Downloads/wine_train_nn.RData")
plot(tr.nn) # for some reason, the plot doesn't show up here, so I am plotting it separately below.
NeuralNet Plot

NeuralNet Plot

tr.nn.pred = compute(tr.nn, test.norm[,-p])
tr.nn.pred = tr.nn.pred$net.result
tr.nn.eval = eval(tr.nn.pred, test.norm$quality, plot=T, title = "Neural Net: "); unlist(tr.nn.eval)

##      rmse       mae       cor 
## 0.6722937 0.5306325 0.5618459

11. Neural Network using keras

I will use the keras package with tensorflow embedded to build a neural network model with 4 hidden layers having 64, 32, 16 and 1 neurons each. I will use the relu activation function and minimize the mean square error (mse) loss. Similar to the SVM model, this model needs substantial efforts to tune model hyper-parameters. As we can see in the model performance below, the hyper-parameters I am using here don’t give great fitting. I will try other hyper-parameters in the future to improve model performance.

# see more instructions under: https://keras.rstudio.com
# devtools::install_github("rstudio/keras")
library(keras); #install_keras()
cl = makePSOCKcluster(4)
registerDoParallel(cl)
build_model = function() {
  model = keras_model_sequential() %>%
    layer_dense(units = 64, activation = "relu",
                input_shape = dim(train.norm[,-p])[2]) %>%
    layer_dense(units = 32, activation = "relu") %>%
    layer_dense(units = 16, activation = "relu") %>%
    layer_dense(units = 1)
  model %>% compile(
    loss = "mse",
    optimizer = optimizer_rmsprop(),
    metrics = list("mean_absolute_error")
  )
  model
}

model = build_model()
model %>% summary() # show the network structure
## ____________________________________________________________________________________________________
## Layer (type)                                 Output Shape                            Param #        
## ====================================================================================================
## dense_1 (Dense)                              (None, 64)                              768            
## ____________________________________________________________________________________________________
## dense_2 (Dense)                              (None, 32)                              2080           
## ____________________________________________________________________________________________________
## dense_3 (Dense)                              (None, 16)                              528            
## ____________________________________________________________________________________________________
## dense_4 (Dense)                              (None, 1)                               17             
## ====================================================================================================
## Total params: 3,393
## Trainable params: 3,393
## Non-trainable params: 0
## ____________________________________________________________________________________________________
#layer1 number of parameter = 11 * 64 + 64 = 768
#layer2 number of parameter = 64 * 32 + 32 = 2080
#layer3 number of parameter = 32 * 16 + 16 = 528
#layer4 number of parameter = 16 * 1  + 1  = 17

# Display training progress by printing a single dot for each completed epoch.
print_dot_callback = callback_lambda(
  on_epoch_end = function(epoch, logs) {
    if (epoch %% 100 == 0) cat("\n")
    cat(".")
  }
)    
epochs = 500

# Fit the model and store training stats
history = model %>% fit(
  train.norm[,-p] %>% as.matrix(),
  train.norm[,p],
  epochs = epochs,
  validation_split = 0.2,
  verbose = 0,
  callbacks = list(print_dot_callback)
)
## 
## ....................................................................................................
## ....................................................................................................
## ....................................................................................................
## ....................................................................................................
## ....................................................................................................
# plot the training history
plot(history, metrics = "mean_absolute_error", smooth = FALSE) +
  coord_cartesian(ylim = c(0, 1))

#As we can see from the above model training history, although training MAE keeps decreasing, the MAE of the validation data stops decreasing before epoch 200. So in the next step, I will detect early stop, which remedies overfitting problems.

early_stop = callback_early_stopping(monitor = "val_loss", patience = 50)

model = build_model()
history = model %>% fit(
  train.norm[,-p] %>% as.matrix(),
  train.norm[,p],
  epochs = epochs,
  validation_split = 0.2,
  verbose = 0,
  callbacks = list(early_stop, print_dot_callback)
)
## 
## ....................................................................................................
## .................................................................................
# plot the training history
plot(history, metrics = "mean_absolute_error", smooth = FALSE) +
  coord_cartesian(xlim = c(0, 300), ylim = c(0, 1))

stopCluster(cl)

I will then make predictions and evaluate model performance using the eval() function.

# evaluate model performance using test set
tr.nnkeras.pred = model %>% predict(test.norm[, -p] %>% as.matrix())
tr.nnkeras.pred = tr.nnkeras.pred[ , 1]

tr.nnkeras.eval = eval(tr.nnkeras.pred, test.norm$quality, plot = T, title = "Neural Net using keras: "); unlist(tr.nnkeras.eval)

##      rmse       mae       cor 
## 0.6693359 0.5229692 0.5889432

12. Categorical Y variable

Considering that the y variable type is integer with only less than 10 choices, I will try to treat it as a categorical variable, and fit a random forest model. For simplicity, I will not normalize the datasets.

df = raw %>%
  mutate(quality = as.factor(quality))

set.seed(1) #split into train & test set 
idx = createDataPartition(df$quality, p = 0.9, list=F)
train = df[idx,]; table(train$quality)
## 
##    3    4    5    6    7    8    9 
##   18  147 1312 1979  792  158    5
test = df[-idx,]; table(test$quality)
## 
##   3   4   5   6   7   8   9 
##   2  16 145 219  88  17   0
# Visualizing relationship of the Y variable (quality) and other covariants 
long = gather(train, key = "features", value = "value", 1:(p-1)) 
ggplot(long, aes(x=quality, y= value)) + 
  geom_boxplot(aes(fill = quality)) + 
  coord_flip() +
  facet_wrap(~ features, ncol = 3, scales = "free") +
  labs(title = "Visualizing relationship of the Y variable (quality) and other covariants")

# fit a random forest model
tr.rf.fac= randomForest(quality~., data=train, ntree=1000, mtry=sqrt(p))
tr.rf.fac
## 
## Call:
##  randomForest(formula = quality ~ ., data = train, ntree = 1000,      mtry = sqrt(p)) 
##                Type of random forest: classification
##                      Number of trees: 1000
## No. of variables tried at each split: 3
## 
##         OOB estimate of  error rate: 29.43%
## Confusion matrix:
##   3  4   5    6   7  8 9 class.error
## 3 0  0   7   11   0  0 0   1.0000000
## 4 0 33  75   38   1  0 0   0.7755102
## 5 0  9 930  366   7  0 0   0.2911585
## 6 0  3 234 1636 102  4 0   0.1733199
## 7 0  0  15  328 443  6 0   0.4406566
## 8 0  0   1   53  33 71 0   0.5506329
## 9 0  0   0    3   2  0 0   1.0000000
# evaluate model performance using the test set
tr.rf.fac.prob = predict(tr.rf.fac, test[,-p])
confM = confusionMatrix(tr.rf.fac.prob, test[,p])
# Visualize the confusion matrix table
long = confM$table %>% tbl_df()
ggplot(long, aes(Reference, Prediction)) + 
  geom_tile(aes(fill = log1p(n)), color="black") + 
  geom_text(aes(label = n)) + 
  scale_fill_gradient(low = "white", high = "steelblue") + 
  guides(fill = FALSE) + 
  labs(title = "Confusion Matrix")

From the confusion table above, although the overall accuracy is only around 70%, most of the misclassified observations are in a fairly close range with the true quality level. Again, the model doesn’t do well when predicting extreme cases: for wines with quality 8, only 6 out of (5 + 6 + 6) = 17 observations are classified correctly; for the 2 wines with quality 3, both of them are classified as quality 6. There are no wines with quality 9 in the test set, due to the rarity of this wine kind, but I could have assigned a few observations with quality 9 in the test dataset to check the performance of the model.


Summary of Session 1

Putting all results together in the following table, it seems that the random forest model has the best performance. The best MAE is 0.37 and the best RMSE is 0.51. Considering that wine quality score ranges from 0 to 10, this performance is not acceptable. However, none of these models performs well for those extreme cases, i.e., wines with especially high or low quality. To be able to specifically classify wines with superior quality, we will train classification models in the next session.

 knitr::kable(cbind(lm = unlist(tr.lm.eval),
                    quadratic = unlist(tr.qm.eval),
                    ANOVA = unlist(tr.cat.lm.eval),
                    lm.interac = unlist(tr.lm.interract.step.eval),
                    rf = unlist(tr.rf.eval),
                    rf.cv = unlist(tr.cvrf.eval),
                    svm = unlist(tr.svm.eval),
                    svm.cv = unlist(tr.svmRadial.eval),
                    regression.tree = unlist(tr.rpart.eval),
                    neural.net = unlist(tr.nn.eval),
                    neural.net.keras = unlist(tr.nnkeras.eval)) ,
              caption = "Comparing all models from Session 1: ")
Comparing all models from Session 1:
lm quadratic ANOVA lm.interac rf rf.cv svm svm.cv regression.tree neural.net neural.net.keras
rmse 0.7097731 0.6865258 0.6808479 0.7073700 0.5144320 0.5144424 0.6379821 0.6285840 0.7022890 0.6722937 0.6693359
mae 0.5602135 0.5438849 0.5421466 0.5441501 0.3727410 0.3724882 0.4767281 0.4672283 0.5609858 0.5306325 0.5229692
cor 0.4901282 0.5354802 0.5484151 0.5004769 0.7772215 0.7772617 0.6185084 0.6370404 0.5052963 0.5618459 0.5889432

Session 2. For the purpose of classification

1. Data processing and visualization

Suppose that my clients have a special requirement to classify wines with quality >=7, defined as excellent wines. In this situation, I will build a classification model to identify excellent wines.

I will make a new binary y variable called “excellent”. Since this label is not balanced, i.e., only about 1/5 of the wines are excellent, I will use the createDataPartition function to partition the data while balancing the label proportion in the training and testing set. For simplicity, I will not normalize the datasets in this session.

df = raw %>% 
      mutate(excellent = ifelse(quality >= 7, "1", "0") %>% as.factor()) %>%
      select(-quality); dim(df)
## [1] 4898   12
table(df$excellent)
## 
##    0    1 
## 3838 1060
set.seed(1)
idx = createDataPartition(df$excellent, p = 0.9, list=F)

train.x = df[idx, -p] %>% as.matrix(); dim(train.x)
## [1] 4409   11
train.y = df[idx, p]; table(train.y)
## train.y
##    0    1 
## 3455  954
train = data.frame(train.x, excellent=train.y)

test.x = df[-idx, -p] %>% as.matrix(); dim(test.x)
## [1] 489  11
test.y = df[-idx, p]; table(test.y)
## test.y
##   0   1 
## 383 106
test = data.frame(test.x, excellent=test.y)

The following plot shows the relationship of covariants and the binary Y variables. As we can see from the plots, excellent wines appear to have a smaller variation and fewer outliers for their covariants.

long = gather(train, key = "features", value = "value", 1:(p-1)) 
ggplot(long, aes(x=excellent, y= value)) + 
  geom_boxplot(aes(fill = excellent)) + 
  coord_flip() +
  facet_wrap(~ features, ncol = 3, scales = "free") +
  labs(title = "Visualizing relationship of the Y variable (excellent) and other covariants",
       subtitle = "Note: Wine quality >= 7 is defined to be excellent wine")

Then I will perform a basic principal component analysis (pca) to visualize the separability of the y variables on the first 2 PCs. From the result below, although 1st PC account for 92.6% of the variance, and the first 2 PCs together account for > 99% of the variance, the y variables are not separable by them. This indicates that a model that uses a linear combination of the independent variables will probably not work well.

pca = prcomp(test[,-p]) # use test set here for easy visualization (less data point overlap)
summary(pca)
## Importance of components:
##                            PC1      PC2     PC3     PC4     PC5     PC6     PC7     PC8     PC9
## Standard deviation     44.8620 11.86605 4.38839 1.05198 0.76440 0.13914 0.11456 0.09686 0.08588
## Proportion of Variance  0.9256  0.06475 0.00886 0.00051 0.00027 0.00001 0.00001 0.00000 0.00000
## Cumulative Proportion   0.9256  0.99034 0.99920 0.99971 0.99998 0.99999 0.99999 1.00000 1.00000
##                           PC10      PC11
## Standard deviation     0.01824 0.0005052
## Proportion of Variance 0.00000 0.0000000
## Cumulative Proportion  1.00000 1.0000000
autoplot(pca, data = test, colour = 'excellent')


2. Logistic Regression

I will start with a basic logistic regression (glm).

tr.glm = glm(excellent ~. , data = train, family = binomial)
summary(tr.glm)
## 
## Call:
## glm(formula = excellent ~ ., family = binomial, data = train)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.0503  -0.6704  -0.4143  -0.1738   2.8420  
## 
## Coefficients:
##                        Estimate Std. Error z value Pr(>|z|)    
## (Intercept)           6.058e+02  9.823e+01   6.167 6.95e-10 ***
## fixed.acidity         5.175e-01  9.407e-02   5.501 3.78e-08 ***
## volatile.acidity     -4.040e+00  5.166e-01  -7.820 5.28e-15 ***
## citric.acid          -6.788e-01  4.172e-01  -1.627  0.10368    
## residual.sugar        2.861e-01  3.723e-02   7.683 1.56e-14 ***
## chlorides            -1.275e+01  3.998e+00  -3.190  0.00142 ** 
## free.sulfur.dioxide   7.811e-03  3.261e-03   2.395  0.01660 *  
## total.sulfur.dioxide -1.025e-04  1.586e-03  -0.065  0.94847    
## density              -6.281e+02  9.956e+01  -6.308 2.82e-10 ***
## pH                    3.161e+00  4.470e-01   7.072 1.52e-12 ***
## sulphates             2.291e+00  3.647e-01   6.283 3.33e-10 ***
## alcohol               1.774e-01  1.193e-01   1.487  0.13700    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 4605.5  on 4408  degrees of freedom
## Residual deviance: 3734.5  on 4397  degrees of freedom
## AIC: 3758.5
## 
## Number of Fisher Scoring iterations: 6
tr.glm.prob = predict(tr.glm, data.frame(test.x), type="response")
tr.glm.eval = eval_class(tr.glm.prob, test.y, plot=T, title="Logistic Reg: ");tr.glm.eval

## $auc
## [1] 0.803
## 
## $confusionMatrix
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 305  35
##          1  78  71
##                                          
##                Accuracy : 0.7689         
##                  95% CI : (0.729, 0.8056)
##     No Information Rate : 0.7832         
##     P-Value [Acc > NIR] : 0.7957         
##                                          
##                   Kappa : 0.4065         
##  Mcnemar's Test P-Value : 7.782e-05      
##                                          
##             Sensitivity : 0.6698         
##             Specificity : 0.7963         
##          Pos Pred Value : 0.4765         
##          Neg Pred Value : 0.8971         
##              Prevalence : 0.2168         
##          Detection Rate : 0.1452         
##    Detection Prevalence : 0.3047         
##       Balanced Accuracy : 0.7331         
##                                          
##        'Positive' Class : 1              
## 

As we can see from the above model summary and plots, the AUC is 0.803, which is not bad. I choose the cutoff so that the kappa statistic is maximized. The model has an overall accuracy of 0.77, sensitivity of 0.67, precision (ppv) of 0.48, kappa of 0.41

Note: When choosing the cutoff, I don’t necessarily need to use the cutoff that maximizes the Kappa statistic. The cutoff choice also largely depends on the clients’ requirement. For example, suppose clients care more about precision(PPV) than sensitivity, i.e., comparing with missing classifying some excellent wines, classifying non-excellent wines to be excellent wines cost much higher, because clients will waste all the preparation process and end up with a wine type with non-excellent quality. In that case, I will be more stringent, i.e., increase the model cutoff so that I sacrifice sensitivity for the gain of higher precision.


3. LASSO with Cross Validation

Next step I will build a LASSO model with 10-fold cv using the cv.glmnet function from the glmnet package. As we can see from the summary below, the LASSO model doesn’t out-perform the logistic model.

set.seed(1)
tr.cvlasso = cv.glmnet(train.x, train.y, 
                       family = "binomial",
                       type.measure = "auc")
coef(tr.cvlasso, s=tr.cvlasso$lambda.1se) # show model coefficients
## 12 x 1 sparse Matrix of class "dgCMatrix"
##                                  1
## (Intercept)          -6.381504e+00
## fixed.acidity         .           
## volatile.acidity     -3.399210e+00
## citric.acid          -2.901506e-01
## residual.sugar        3.631815e-02
## chlorides            -1.275182e+01
## free.sulfur.dioxide   6.831982e-03
## total.sulfur.dioxide -6.136448e-04
## density              -5.368455e+00
## pH                    7.389874e-01
## sulphates             1.049691e+00
## alcohol               8.067396e-01
tr.cvlasso.prob = predict(tr.cvlasso, 
                          test.x, 
                          type="response", 
                          s=tr.cvlasso$lambda.1se)
tr.cvlasso.eval = eval_class(tr.cvlasso.prob, test.y, plot = T, title = "LASSO with CV");tr.cvlasso.eval

## $auc
## [1] 0.795
## 
## $confusionMatrix
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 316  40
##          1  67  66
##                                           
##                Accuracy : 0.7812          
##                  95% CI : (0.7419, 0.8171)
##     No Information Rate : 0.7832          
##     P-Value [Acc > NIR] : 0.56933         
##                                           
##                   Kappa : 0.4099          
##  Mcnemar's Test P-Value : 0.01195         
##                                           
##             Sensitivity : 0.6226          
##             Specificity : 0.8251          
##          Pos Pred Value : 0.4962          
##          Neg Pred Value : 0.8876          
##              Prevalence : 0.2168          
##          Detection Rate : 0.1350          
##    Detection Prevalence : 0.2720          
##       Balanced Accuracy : 0.7239          
##                                           
##        'Positive' Class : 1               
## 

4. Decision Tree

I will build a decision tree model, first a single decision tree, then using boosting. The boosting decision tree out-performs all my pervious models, with a Kappa of 0.59.

#decision tree
tr.dt = C5.0(train.x, train.y)
tr.dt
## 
## Call:
## C5.0.default(x = train.x, y = train.y)
## 
## Classification Tree
## Number of samples: 4409 
## Number of predictors: 11 
## 
## Tree size: 76 
## 
## Non-standard options: attempt to group attributes
tr.dt.pred = predict(tr.dt, test.x)
confMat = confusionMatrix(tr.dt.pred, test.y, positive="1")
tr.dt.eval = list(auc = NA, confusionMatrix = confMat); tr.dt.eval
## $auc
## [1] NA
## 
## $confusionMatrix
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 338  51
##          1  45  55
##                                          
##                Accuracy : 0.8037         
##                  95% CI : (0.7657, 0.838)
##     No Information Rate : 0.7832         
##     P-Value [Acc > NIR] : 0.1483         
##                                          
##                   Kappa : 0.4098         
##  Mcnemar's Test P-Value : 0.6098         
##                                          
##             Sensitivity : 0.5189         
##             Specificity : 0.8825         
##          Pos Pred Value : 0.5500         
##          Neg Pred Value : 0.8689         
##              Prevalence : 0.2168         
##          Detection Rate : 0.1125         
##    Detection Prevalence : 0.2045         
##       Balanced Accuracy : 0.7007         
##                                          
##        'Positive' Class : 1              
## 
# using boosting
tr.dtboost = C5.0(train.x, train.y, trials = 10)
tr.dtboost
## 
## Call:
## C5.0.default(x = train.x, y = train.y, trials = 10)
## 
## Classification Tree
## Number of samples: 4409 
## Number of predictors: 11 
## 
## Number of boosting iterations: 10 
## Average tree size: 73.9 
## 
## Non-standard options: attempt to group attributes
tr.dtboost.pred = predict(tr.dtboost, test.x)
confMat = confusionMatrix(tr.dtboost.pred, test.y, positive="1")
tr.dtboost.eval = list(auc = NA, confusionMatrix = confMat); tr.dtboost.eval
## $auc
## [1] NA
## 
## $confusionMatrix
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 365  44
##          1  18  62
##                                           
##                Accuracy : 0.8732          
##                  95% CI : (0.8404, 0.9014)
##     No Information Rate : 0.7832          
##     P-Value [Acc > NIR] : 2.091e-07       
##                                           
##                   Kappa : 0.5903          
##  Mcnemar's Test P-Value : 0.001498        
##                                           
##             Sensitivity : 0.5849          
##             Specificity : 0.9530          
##          Pos Pred Value : 0.7750          
##          Neg Pred Value : 0.8924          
##              Prevalence : 0.2168          
##          Detection Rate : 0.1268          
##    Detection Prevalence : 0.1636          
##       Balanced Accuracy : 0.7690          
##                                           
##        'Positive' Class : 1               
## 

5. Random Forest

Similar with session 1, I will build a random forest model, and use the train function to choose best hyper-parameters. The best kappa from this model reaches 0.64, with an overall accuracy of 0.89, which is an improvement compared with the previous models.

tr.rfclass = randomForest(excellent~., data=train, ntree=1000, mtry=sqrt(p))
tr.rfclass
## 
## Call:
##  randomForest(formula = excellent ~ ., data = train, ntree = 1000,      mtry = sqrt(p)) 
##                Type of random forest: classification
##                      Number of trees: 1000
## No. of variables tried at each split: 3
## 
##         OOB estimate of  error rate: 10.95%
## Confusion matrix:
##      0   1 class.error
## 0 3339 116  0.03357453
## 1  367 587  0.38469602
tr.rfclass.prob = predict(tr.rfclass, test.x)
confMat = confusionMatrix(tr.rfclass.prob, test.y, positive="1")
tr.rfclass.eval = list(auc = NA,
                       confusionMatrix = confMat); tr.rfclass.eval
## $auc
## [1] NA
## 
## $confusionMatrix
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 367  41
##          1  16  65
##                                           
##                Accuracy : 0.8834          
##                  95% CI : (0.8516, 0.9105)
##     No Information Rate : 0.7832          
##     P-Value [Acc > NIR] : 6.216e-09       
##                                           
##                   Kappa : 0.6247          
##  Mcnemar's Test P-Value : 0.001478        
##                                           
##             Sensitivity : 0.6132          
##             Specificity : 0.9582          
##          Pos Pred Value : 0.8025          
##          Neg Pred Value : 0.8995          
##              Prevalence : 0.2168          
##          Detection Rate : 0.1329          
##    Detection Prevalence : 0.1656          
##       Balanced Accuracy : 0.7857          
##                                           
##        'Positive' Class : 1               
## 
ct = trainControl(method = "repeatedcv", number = 10, repeats = 2)
grid_rf = expand.grid(.mtry = c(2, 3, 6))
set.seed(1)
cl = makePSOCKcluster(4)
registerDoParallel(cl)
tr.cvrfclass = train(excellent~., data = train,
                method = 'rf',
                metric = "Kappa",
                trControl = ct,
                tuneGrid = grid_rf)
stopCluster(cl)
save(tr.cvrfclass, file = "~/Downloads/wine_train_cvrfclass.RData")
load(file = "~/Downloads/wine_train_cvrfclass.RData"); tr.cvrfclass
## Random Forest 
## 
## 4409 samples
##   11 predictor
##    2 classes: '0', '1' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold, repeated 2 times) 
## Summary of sample sizes: 3968, 3967, 3969, 3967, 3969, 3968, ... 
## Resampling results across tuning parameters:
## 
##   mtry  Accuracy   Kappa    
##   2     0.8827441  0.6146437
##   3     0.8842170  0.6216576
##   6     0.8835426  0.6216067
## 
## Kappa was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 3.
tr.cvrfclass.pred = predict(tr.cvrfclass, test.x)
confMat = confusionMatrix(tr.cvrfclass.pred, test.y, positive = "1")
tr.cvrfclass.eval = list(auc = NA,
                         confusionMatrix = confMat); tr.cvrfclass.eval
## $auc
## [1] NA
## 
## $confusionMatrix
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 369  40
##          1  14  66
##                                           
##                Accuracy : 0.8896          
##                  95% CI : (0.8584, 0.9159)
##     No Information Rate : 0.7832          
##     P-Value [Acc > NIR] : 5.991e-10       
##                                           
##                   Kappa : 0.6431          
##  Mcnemar's Test P-Value : 0.0006688       
##                                           
##             Sensitivity : 0.6226          
##             Specificity : 0.9634          
##          Pos Pred Value : 0.8250          
##          Neg Pred Value : 0.9022          
##              Prevalence : 0.2168          
##          Detection Rate : 0.1350          
##    Detection Prevalence : 0.1636          
##       Balanced Accuracy : 0.7930          
##                                           
##        'Positive' Class : 1               
## 

6. Support Vector Machine

SVM model is built in a similar way as in session 1. The model here doesn’t perform as well as the random forest model, however, by choosing a different kernel and other hyper-parameters, the model could potentially reach a better performance.

tr.svmclass = ksvm(excellent ~ . , 
                data=train,
                kernel='rbfdot',
                prob.model = T,
                C=1)
tr.svmclass
## Support Vector Machine object of class "ksvm" 
## 
## SV type: C-svc  (classification) 
##  parameter : cost C = 1 
## 
## Gaussian Radial Basis kernel function. 
##  Hyperparameter : sigma =  0.0810442293731343 
## 
## Number of Support Vectors : 1876 
## 
## Objective Function Value : -1634.86 
## Training error : 0.160581 
## Probability model included.
tr.svm.pred = predict(tr.svmclass, test.x, type = "probabilities")[,2]
tr.svm.eval = eval_class(tr.svm.pred, test.y, plot = T, title = "SVM: "); tr.svm.eval

## $auc
## [1] 0.857
## 
## $confusionMatrix
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0 326  27
##          1  57  79
##                                           
##                Accuracy : 0.8282          
##                  95% CI : (0.7918, 0.8606)
##     No Information Rate : 0.7832          
##     P-Value [Acc > NIR] : 0.007917        
##                                           
##                   Kappa : 0.5411          
##  Mcnemar's Test P-Value : 0.001555        
##                                           
##             Sensitivity : 0.7453          
##             Specificity : 0.8512          
##          Pos Pred Value : 0.5809          
##          Neg Pred Value : 0.9235          
##              Prevalence : 0.2168          
##          Detection Rate : 0.1616          
##    Detection Prevalence : 0.2781          
##       Balanced Accuracy : 0.7982          
##                                           
##        'Positive' Class : 1               
## 

Summary of Session 2

In the end, I will put all the classification model performance measures in the following table. The same with my conclusion in session 1, the random forest model out-performs all other models, with the highest accuracy of 0.89, and highest kappa of 0.64.

glm = c(auc = tr.glm.eval$auc, 
        tr.glm.eval$confusionMatrix$overall, 
        tr.glm.eval$confusionMatrix$byClass)
cv.lasso = c(auc = tr.cvlasso.eval$auc, 
             tr.cvlasso.eval$confusionMatrix$overall, 
             tr.cvlasso.eval$confusionMatrix$byClass)
decision.tree = c(auc = tr.dt.eval$auc, 
                 tr.dt.eval$confusionMatrix$overall, 
                 tr.dt.eval$confusionMatrix$byClass)
decision.tree.boost = c(auc = tr.dtboost.eval$auc, 
                         tr.dtboost.eval$confusionMatrix$overall, 
                         tr.dtboost.eval$confusionMatrix$byClass)
rf = c(auc = tr.rfclass.eval$auc, 
       tr.rfclass.eval$confusionMatrix$overall, 
       tr.rfclass.eval$confusionMatrix$byClass)
cv.rf = c(auc = tr.cvrfclass.eval$auc, 
          tr.cvrfclass.eval$confusionMatrix$overall, 
          tr.cvrfclass.eval$confusionMatrix$byClass)
svm = c(auc = tr.svm.eval$auc, 
        tr.svm.eval$confusionMatrix$overall, 
        tr.svm.eval$confusionMatrix$byClass)

all = cbind(glm, cv.lasso, 
            decision.tree, decision.tree.boost,
            rf, cv.rf, svm) %>% data.frame()

knitr::kable(all %>% round(3),
             caption = "comparing all models")
comparing all models
glm cv.lasso decision.tree decision.tree.boost rf cv.rf svm
auc 0.803 0.795 NA NA NA NA 0.857
Accuracy 0.769 0.781 0.804 0.873 0.883 0.890 0.828
Kappa 0.407 0.410 0.410 0.590 0.625 0.643 0.541
AccuracyLower 0.729 0.742 0.766 0.840 0.852 0.858 0.792
AccuracyUpper 0.806 0.817 0.838 0.901 0.911 0.916 0.861
AccuracyNull 0.783 0.783 0.783 0.783 0.783 0.783 0.783
AccuracyPValue 0.796 0.569 0.148 0.000 0.000 0.000 0.008
McnemarPValue 0.000 0.012 0.610 0.001 0.001 0.001 0.002
Sensitivity 0.670 0.623 0.519 0.585 0.613 0.623 0.745
Specificity 0.796 0.825 0.883 0.953 0.958 0.963 0.851
Pos Pred Value 0.477 0.496 0.550 0.775 0.802 0.825 0.581
Neg Pred Value 0.897 0.888 0.869 0.892 0.900 0.902 0.924
Precision 0.477 0.496 0.550 0.775 0.802 0.825 0.581
Recall 0.670 0.623 0.519 0.585 0.613 0.623 0.745
F1 0.557 0.552 0.534 0.667 0.695 0.710 0.653
Prevalence 0.217 0.217 0.217 0.217 0.217 0.217 0.217
Detection Rate 0.145 0.135 0.112 0.127 0.133 0.135 0.162
Detection Prevalence 0.305 0.272 0.204 0.164 0.166 0.164 0.278
Balanced Accuracy 0.733 0.724 0.701 0.769 0.786 0.793 0.798
all$evaluate = rownames(all)
long = gather(all, key="method", value = "measure", 1:7)
ggplot(long, aes(method, evaluate)) + 
  geom_tile(aes(fill = measure), colour = "white") +
  scale_fill_gradient(low = "white", high = "steelblue") +
  theme(axis.text.x = element_text(angle = 20, hjust = 1)) + 
  labs(title = "Visualizing model performance measures")+ 
  geom_text(aes(label = round(measure,2))) + 
  guides(fill = FALSE) 


Conclusion

After trying many models and optimizing their hyper-parameters, this study reaches a conclusion that random forest model performs the best for prediction and for classification. For predicting wine quality, RF model gives the best MAE of 0.37 and RMSE of 0.51, which is not bad considering that the wine quality could range from 0 to 10. For classifying high-quality wine, RF model gives the best overall accuracy of 0.89, kappa of 0.64, sensitivity of 0.62 and precision of 0.82.


Limitations and Discussion

The random forest models built in this study performs the best, however, it is still far from perfect. To further improve my model performance, I could further try the following steps.

  1. To have a closer look at misclassified observations from the confusion matrix (in the prediction case, check observations with large differences between predicted quality and real quality), and try to understand why these wines are classified wrongly.

  2. Learn some wine quality control knowledge to have a better sense of the prior knowledge in the wine producing business, so that I might be able to properly transform some features or interpret the interaction between features

  3. Spend more time to tune model hyper-parameters, and ask colleagues and my manager for further suggestions.

  4. If computing power is a limit, I could ask my manager if there is a way to gain more computing power, by either using on-demand cloud computing, or upgrade my computer, or using a computer cluster.

  5. Communicate with my clients to know more clearly about their goal: whether to predicting wine quality or to pick up wines with superior quality. Also, try to understand whether they have more tolerance for the type I error or the type II error, so that I can properly set the model cutoff value.

  6. Ask clients if they have more data available, especially for the underrepresented classes, such as wines with high or low quality. Using more data, I could build a more complicated model, such as a neural network with more layers.

  7. Ask clients if they have other related features. The model could be improved by adding these features in.