1 Executive Summary

The research showed that high-quality wines tend to have lower volatile acidity level. It is not a surprise since volatile acidity indicates the amount of acetic acid in wine, which at too high levels can lead to an unpleasant, vinegar taste. Also, research showed that high-quality wines are stronger (in alcohol %) than low-quality wines, generally.

At the modeling stage among regression models, ridge regression showed the lowest MSE on the test data. Classification model based on LDA with all predictors showed 0.25 test error rate, which is not a bad result.

2 Introduction

In this work, we explore red wines from the dataset which was published by P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis. Modeling wine preferences by data mining from physicochemical properties. In Decision Support Systems, Elsevier, 47(4):547-553. ISSN: 0167-9236. The article is available at [1],[2]. The data and some description can be found in [3],[4]:

The goal of our analysis is to understand which physicochemical features might affect the quality (taste) of the red wine and to create models to predict the quality of the red wine.

The dataset has 1597 observations of 12 variables.

Here is some description of our variables¹:

3 Predictor variables:

   1 - fixed acidity (tartaric acid - g / dm^3)
   2 - volatile acidity (acetic acid - g / dm^3)
   3 - citric acid (g / dm^3)
   4 - residual sugar (g / dm^3)
   5 - chlorides (sodium chloride - g / dm^3)
   6 - free sulfur dioxide (mg / dm^3)
   7 - total sulfur dioxide (mg / dm^3)
   8 - density (g / cm^3)
   9 - pH
   10 - sulphates (potassium sulphate - g / dm^3)
   11 - alcohol (% by volume)

4 Response variable:

   12 - quality (score between 0 and 10)

The quality variable is the variable that we are trying to predict.

5 Description of the variables:

1 - fixed acidity: most acids involved with wine or fixed or nonvolatile (do not evaporate readily)

2 - volatile acidity: the amount of acetic acid in wine, which at too high of levels can lead to an unpleasant, vinegar taste

3 - citric acid: found in small quantities, citric acid can add ‘freshness’ and flavor to wines

4 - residual sugar: the amount of sugar remaining after fermentation stops, it’s rare to find wines with less than 1 gram/liter and wines with greater than 45 grams/liter are considered sweet

5 - chlorides: the amount of salt in the wine

6 - free sulfur dioxide: the free form of SO2 exists in equilibrium between molecular SO2 (as a dissolved gas) and bisulfite ion; it prevents microbial growth and the oxidation of wine

7 - total sulfur dioxide: amount of free and bound forms of S02; in low concentrations, SO2 is mostly undetectable in wine, but at free SO2 concentrations over 50 ppm, SO2 becomes evident in the nose and taste of wine

8 - density: the density of water is close to that of water depending on the percent alcohol and sugar content

9 - pH: describes how acidic or basic a wine is on a scale from 0 (very acidic) to 14 (very basic); most wines are between 3-4 on the pH scale

10 - sulphates: a wine additive which can contribute to sulfur dioxide gas (S02) levels, which acts as an antimicrobial and antioxidant

11 - alcohol: the percent alcohol content of the wine

12 - quality: median of at least 3 evaluations made by wine experts; each expert graded the wine quality between 0 (very bad) and 10 (very excellent)

6 Analysis and Methods

In this section we first explore the data to get more sense of how our variables are distributed and of how good wines differ from bad wines, then we build regression models to predict quality scores and classification model to predict “high” or “low” quality of a wine.

6.1 Data exploration

We begin with an exploration of distributions. Numerical summary of the data can be found in Appendix A, but it is better to inspect the distributions visually. One important take-away from the numerical summary is that we have no missing values (NA).

6.2 Quality scores distribution:

## 
##   3   4   5   6   7   8 
##  10  53 679 638 199  18

The quality ranges from 3 to 8. Most of the scores are 5 and 6 (medium quality). We have very little data for scores at the tails (unbalanced data), this might be a problem. One of the approaches to address this issue is to classify quality scores by “high” and “low” (e.g., \(>5\) - high-quality, \(\leq 5\) - low-quality), in this case, we will have balanced classes. For this purpose, we will build a classification model.

Here we can explore distributions of all the variables visually.

7 Histograms:

8 Boxplots:

Here we observe that:

fixed.acidity, volatile.acidity, free.sulfur.dioxide, total.sulfur.dioxide, sulphates, alcohol have a heavier right tail (right-skewed distributions)
density and pH are quite symmetric
residual.sugar level in the samples is mostly concentrated between 1 and 3 \(g/dm^3\)
chlorides are mostly concentrated between 0.5 and 1 \(g/dm^3\)
quality scores are mostly medium (5,6) with very little amount of records at both tails (3 and 8 scores)

It is also important to know how predictors are related with our response variable:

From the boxplots above based on the interquartile ranges and median changes with respect to quality score, we observe the following patterns:

high-quality wines tend to have smaller volatile.acidity level
high-quality wines tend to have larger citric.acid level
high-quality wines tend to have smaller density level
high-quality wines tend to have smaller pH level
high-quality wines tend to have larger sulphates level
high-quality wines tend to have larger alcohol level

Here we can examine relationships between predictors:

However, the information above can be easier summarized by the correlation plot (which reflects the measure of linear dependence):

From the correlation plot above the most correlated variables are:

fixed.acidity and pH (-0.68)
fixed.acidity and density (0.67)
fixed.acidity and citric.acid (0.67)
free.sulfur.dioxide and total.sulfur.dioxide (0.67)
citric.acid and pH (-0.54)
alcohol and density (-0.5)

The most correlated variables with quality are:

volatile.acidity (-0.39)
alcohol (0.48)

The last two points where also noticed on predictors vs. quality plot, that with quality score increase volatile acidity tend to decrease, and that with quality score increase alcohol level tend to also increase. We can test these hypotheses with ANOVA test, and then conducting post hoc for differences between group means if the result of ANOVA test was significant.

ANOVA tests in both cases gave significant results (p-values < 0.05).

In the post hoc test the differences between volatile acidity levels are almost all significant (except 8-6 and 8-7) and lower than zero, so we can say that high-quality wines tend to have lower volatile acidity levels. Appendix A.1.

In the post hoc test the differences between alcohol levels for high-quality wines vs. other groups are all significant (adjusted p-values < 0.05) and greater than zero, so we can say that high-quality wines tend to have higher alcohol level. Appendix A.2.

8.1 Modeling

First, we split the whole data into train and test in 80%/20% proportion. The train set we will use to train our models, the test set will be used for evaluation.

8.1.1 Regression

Since we have correlated predictors, it is not a good idea to fit a linear regression model with all predictors at once. Instead, we will try regularization and variable selection techniques.

8.1.1.1 Best Subset Selection

We will select models based on minimal \(Cp\) and \(BIC\) values. Code and output are in Appendix B.

The \(Cp\) model uses volatile.acidity, chlorides, free.sulfur.dioxide, total.sulfur.dioxide, pH, sulphates, alcohol as predictors.

The test MSE for \(Cp\) model is 0.4085839.

The \(BIC\) model uses volatile.acidity, chlorides, total.sulfur.dioxide, pH, sulphates, alcohol as predictors.

The test MSE for \(BIC\) model is 0.4102632 (higher than in \(Cp\) model).

8.1.1.2 Ridge regression

Best lambda is selected based on cross-validation. Code and output are in Appendix C.

The test MSE for the ridge model is 0.4081041, lower than in previous models.

8.1.1.3 Lasso regression

Best lambda is selected based on cross-validation. Code and output are in Appendix D.

The test MSE for the lasso model is 0.4102508, not an improvement over previous models. Note that lasso dropped fixed.acidity, citric.acid and density.

8.1.1.4 Ordinal Logistic Regression

Since we have ordinal responses, it is reasonable to try Ordinal Logistic Regression model. Code and output are in Appendix E.

The test MSE for this model is 0.478125, significantly higher than in previous models. We can also view how model predicts vs. actual in confusion matrix.

8.1.2 Classification

First, we simply divide quality scores into “high” and “low” categories (\(> 5\) - high-quality, \(\leq 5\) - low-quality). We name this variable qualcat.

Here we can’t use logistic regression since we have perfect separation issue, so we decided to use LDA. Code and output are in Appendix F.

The error rate for LDA model with all predictors is 0.25. Confusion matrix is:

##       pred
##        high low
##   high  142  38
##   low    42  98

It is also interesting to compare error rates for regression models with this classification model. We can map regression responses into “high” and “low” categories (\(> 5\) - high, \(\leq 5\) - low).

The lowest error rate was obtained in ordinal logistic regression case (0.259375), it is slightly higher than with the LDA model. For other regression models error rate was above 0.41. Code and output are in Appendix G.

9 Conclusion and Summary

The analysis showed that high-quality wines tend to have lower volatile acidity level (Appendix A.1). It is not a surprise since volatile acidity indicates the amount of acetic acid in wine, which at too high levels can lead to an unpleasant, vinegar taste. Also, it showed that high-quality wines are stronger (in alcohol %) than low-quality wines (Appendix A.2).

By examining correlations and pairs plot, we observe that the quality is positively related with alcohol level (stronger wines are usually of higher quality), sulphates, citric.acid (which adds freshness and flavor to the wine), and negatively related with volatile.acidity (which at too high levels can lead to an unpleasant, vinegar taste). We also found relationship of quality with pH and density, but this is probably due to the causal relation of fixed.acidity on pH, and causal relation of alcohol on density. Surprisingly, it appears that residual.sugar level has almost no relationship with the quality.

The strongest correlation among variables is between fixed.acidity and pH (-0.68), between fixed.acidity and density (0.67), between fixed.acidity and citric.acid (0.67), between free.sulfur.dioxide and total.sulfur.dioxide (0.67).

The strongest correlation quality has with alcohol (0.48) and volatile.acidity (-0.39).

At the modeling stage among regression models, ridge regression showed the lowest MSE on the test data. For classification, we decided to use linear discriminant analysis (LDA), since unlike logistic regression it doesn’t suffer from collinear predictors and perfect separation (remember, we divided quality by \(>5\) and \(\leq 5\)). The model based on LDA with all predictors showed 0.25 test error rate, which is not a bad result. We also tried to map predictions of the regression models to “low” and “high” quality categories to compare test error rates with LDA, but almost all models resulted above 0.41 test error rate, only ordinal logistic model showed about 0.259 test error rate which is slightly higher than with LDA.

One of the directions of future research might be in adding white wines data and performing the similar analysis to figure out how red and white wines are different.

10 References

[1] https://repositorium.sdum.uminho.pt/bitstream/1822/10029/1/wine5.pdf

[2] https://www.sciencedirect.com/science/article/pii/S0167923609001377

[3] https://archive.ics.uci.edu/ml/datasets/wine+quality

[4] https://www.kaggle.com/uciml/red-wine-quality-cortez-et-al-2009

11 Appendix

library(corrplot)
library(leaps)
library(glmnet)
library(MASS)

11.1 A

setwd("C:/Users/sakukadi/Documents/wineAnalysis/Dataset")
df = read.csv("winequality.csv")
dim(df)

## [1] 1597   12

# numerical summary of the data
summary(df)

##  fixed.acidity   volatile.acidity  citric.acid    residual.sugar    chlorides     
##  Min.   : 4.60   Min.   :0.120    Min.   :0.000   Min.   : 0.90   Min.   :0.0120  
##  1st Qu.: 7.10   1st Qu.:0.390    1st Qu.:0.090   1st Qu.: 1.90   1st Qu.:0.0700  
##  Median : 7.90   Median :0.520    Median :0.260   Median : 2.20   Median :0.0790  
##  Mean   : 8.32   Mean   :0.528    Mean   :0.271   Mean   : 2.54   Mean   :0.0875  
##  3rd Qu.: 9.20   3rd Qu.:0.640    3rd Qu.:0.420   3rd Qu.: 2.60   3rd Qu.:0.0900  
##  Max.   :15.90   Max.   :1.580    Max.   :1.000   Max.   :15.50   Max.   :0.6110  
##  free.sulfur.dioxide total.sulfur.dioxide    density            pH         sulphates    
##  Min.   : 1.0        Min.   :  6.0        Min.   :0.990   Min.   :2.74   Min.   :0.330  
##  1st Qu.: 7.0        1st Qu.: 22.0        1st Qu.:0.996   1st Qu.:3.21   1st Qu.:0.550  
##  Median :14.0        Median : 38.0        Median :0.997   Median :3.31   Median :0.620  
##  Mean   :15.8        Mean   : 46.4        Mean   :0.997   Mean   :3.31   Mean   :0.658  
##  3rd Qu.:21.0        3rd Qu.: 62.0        3rd Qu.:0.998   3rd Qu.:3.40   3rd Qu.:0.730  
##  Max.   :72.0        Max.   :289.0        Max.   :1.004   Max.   :4.01   Max.   :2.000  
##     alcohol        quality    
##  Min.   : 8.4   Min.   :3.00  
##  1st Qu.: 9.5   1st Qu.:5.00  
##  Median :10.2   Median :6.00  
##  Mean   :10.4   Mean   :5.64  
##  3rd Qu.:11.1   3rd Qu.:6.00  
##  Max.   :14.9   Max.   :8.00

table(df$quality)

## 
##   3   4   5   6   7   8 
##  10  53 679 638 199  18

11.2 A.1

aov1 = aov(volatile.acidity ~ factor(quality), data=df)
summary(aov1)

##                   Df Sum Sq Mean Sq F value Pr(>F)    
## factor(quality)    5    8.2   1.642    60.7 <2e-16 ***
## Residuals       1591   43.0   0.027                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

TukeyHSD(aov1)

##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = volatile.acidity ~ factor(quality), data = df)
## 
## $`factor(quality)`
##         diff      lwr      upr  p adj
## 4-3 -0.19054 -0.35227 -0.02881 0.0103
## 5-3 -0.30761 -0.45704 -0.15819 0.0000
## 6-3 -0.38702 -0.53652 -0.23752 0.0000
## 7-3 -0.48058 -0.63260 -0.32856 0.0000
## 8-3 -0.46117 -0.64618 -0.27615 0.0000
## 5-4 -0.11708 -0.18398 -0.05017 0.0000
## 6-4 -0.19648 -0.26354 -0.12942 0.0000
## 7-4 -0.29004 -0.36255 -0.21753 0.0000
## 8-4 -0.27063 -0.39860 -0.14266 0.0000
## 6-5 -0.07940 -0.10527 -0.05354 0.0000
## 7-5 -0.17297 -0.21078 -0.13515 0.0000
## 8-5 -0.15355 -0.26557 -0.04153 0.0013
## 7-6 -0.09356 -0.13165 -0.05548 0.0000
## 8-6 -0.07415 -0.18627  0.03796 0.4105
## 8-7  0.01941 -0.09605  0.13487 0.9969

11.3 A.2

aov2 = aov(alcohol ~ factor(quality), data=df)
summary(aov2)

##                   Df Sum Sq Mean Sq F value Pr(>F)    
## factor(quality)    5    483    96.5     115 <2e-16 ***
## Residuals       1591   1331     0.8                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

TukeyHSD(aov2)

##   Tukey multiple comparisons of means
##     95% family-wise confidence level
## 
## Fit: aov(formula = alcohol ~ factor(quality), data = df)
## 
## $`factor(quality)`
##         diff       lwr      upr  p adj
## 4-3  0.31009 -0.589478 1.209667 0.9233
## 5-3 -0.05412 -0.885265 0.777032 1.0000
## 6-3  0.67452 -0.157017 1.506055 0.1887
## 7-3  1.51091  0.665341 2.356485 0.0000
## 8-3  2.13944  1.110370 3.168518 0.0000
## 5-4 -0.36421 -0.736334 0.007912 0.0591
## 6-4  0.36442 -0.008563 0.737413 0.0599
## 7-4  1.20082  0.797508 1.604129 0.0000
## 8-4  1.82935  1.117549 2.541151 0.0000
## 6-5  0.72864  0.584772 0.872499 0.0000
## 7-5  1.56503  1.354705 1.775354 0.0000
## 8-5  2.19356  1.570473 2.816648 0.0000
## 7-6  0.83639  0.624543 1.048244 0.0000
## 8-6  1.46493  0.841321 2.088529 0.0000
## 8-7  0.62853 -0.013669 1.270732 0.0592

11.4 B

# split to train and test, 80%/20%
set.seed(344421)
train_ind = sample.int(n=nrow(df), size=floor(0.8*nrow(df)), replace=F)
train = df[train_ind,]
test = df[-train_ind,]

# best subset selection
best.subset.regr = regsubsets(quality ~. , data=train)
best.subset.summary = summary(best.subset.regr)
best.subset.summary

## Subset selection object
## Call: regsubsets.formula(quality ~ ., data = train)
## 11 Variables  (and intercept)
##                      Forced in Forced out
## fixed.acidity            FALSE      FALSE
## volatile.acidity         FALSE      FALSE
## citric.acid              FALSE      FALSE
## residual.sugar           FALSE      FALSE
## chlorides                FALSE      FALSE
## free.sulfur.dioxide      FALSE      FALSE
## total.sulfur.dioxide     FALSE      FALSE
## density                  FALSE      FALSE
## pH                       FALSE      FALSE
## sulphates                FALSE      FALSE
## alcohol                  FALSE      FALSE
## 1 subsets of each size up to 8
## Selection Algorithm: exhaustive
##          fixed.acidity volatile.acidity citric.acid residual.sugar chlorides
## 1  ( 1 ) " "           " "              " "         " "            " "      
## 2  ( 1 ) " "           "*"              " "         " "            " "      
## 3  ( 1 ) " "           "*"              " "         " "            " "      
## 4  ( 1 ) " "           "*"              " "         " "            "*"      
## 5  ( 1 ) " "           "*"              " "         " "            "*"      
## 6  ( 1 ) " "           "*"              " "         " "            "*"      
## 7  ( 1 ) " "           "*"              " "         " "            "*"      
## 8  ( 1 ) " "           "*"              " "         "*"            "*"      
##          free.sulfur.dioxide total.sulfur.dioxide density pH  sulphates alcohol
## 1  ( 1 ) " "                 " "                  " "     " " " "       "*"    
## 2  ( 1 ) " "                 " "                  " "     " " " "       "*"    
## 3  ( 1 ) " "                 " "                  " "     " " "*"       "*"    
## 4  ( 1 ) " "                 " "                  " "     " " "*"       "*"    
## 5  ( 1 ) " "                 "*"                  " "     " " "*"       "*"    
## 6  ( 1 ) " "                 "*"                  " "     "*" "*"       "*"    
## 7  ( 1 ) "*"                 "*"                  " "     "*" "*"       "*"    
## 8  ( 1 ) "*"                 "*"                  " "     "*" "*"       "*"

par(mfrow=c(1,2))
plot(best.subset.summary$cp, xlab="Number of Variables", ylab="Cp", type="l")
plot(best.subset.summary$bic, xlab="Number of Variables", ylab="BIC", type="l")

# best model based on Cp
which.min(best.subset.summary$cp)

## [1] 7

coef(best.subset.regr, which.min(best.subset.summary$cp))

##          (Intercept)     volatile.acidity            chlorides  free.sulfur.dioxide 
##             4.252836            -0.987651            -2.212041             0.004899 
## total.sulfur.dioxide                   pH            sulphates              alcohol 
##            -0.003381            -0.464992             0.780754             0.308388

fit.cp = lm(quality ~ volatile.acidity + chlorides + free.sulfur.dioxide +
            total.sulfur.dioxide + pH + sulphates + alcohol, data=train)
summary(fit.cp)

## 
## Call:
## lm(formula = quality ~ volatile.acidity + chlorides + free.sulfur.dioxide + 
##     total.sulfur.dioxide + pH + sulphates + alcohol, data = train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.6546 -0.3870 -0.0457  0.4595  1.9619 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           4.252836   0.456424    9.32  < 2e-16 ***
## volatile.acidity     -0.987651   0.112866   -8.75  < 2e-16 ***
## chlorides            -2.212041   0.429634   -5.15  3.0e-07 ***
## free.sulfur.dioxide   0.004899   0.002359    2.08  0.03804 *  
## total.sulfur.dioxide -0.003381   0.000745   -4.54  6.2e-06 ***
## pH                   -0.464992   0.133396   -3.49  0.00051 ***
## sulphates             0.780754   0.124004    6.30  4.2e-10 ***
## alcohol               0.308388   0.019032   16.20  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.651 on 1269 degrees of freedom
## Multiple R-squared:  0.363,  Adjusted R-squared:  0.359 
## F-statistic:  103 on 7 and 1269 DF,  p-value: <2e-16

cp.pred = predict(fit.cp, test)
# test MSE for Cp model
mean((test$quality-cp.pred)^2)

## [1] 0.4086

# best model based on BIC
which.min(best.subset.summary$bic)

## [1] 6

coef(best.subset.regr, which.min(best.subset.summary$bic))

##          (Intercept)     volatile.acidity            chlorides total.sulfur.dioxide 
##             4.124048            -1.011060            -2.191147            -0.002351 
##                   pH            sulphates              alcohol 
##            -0.416662             0.789503             0.308665

fit.bic = lm(quality ~ volatile.acidity + chlorides + total.sulfur.dioxide +
              pH + sulphates + alcohol, data=train)
summary(fit.bic)

## 
## Call:
## lm(formula = quality ~ volatile.acidity + chlorides + total.sulfur.dioxide + 
##     pH + sulphates + alcohol, data = train)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.5750 -0.3703 -0.0471  0.4646  1.9430 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)           4.124048   0.452780    9.11  < 2e-16 ***
## volatile.acidity     -1.011060   0.112448   -8.99  < 2e-16 ***
## chlorides            -2.191147   0.430076   -5.09  4.0e-07 ***
## total.sulfur.dioxide -0.002351   0.000557   -4.22  2.6e-05 ***
## pH                   -0.416662   0.131522   -3.17   0.0016 ** 
## sulphates             0.789503   0.124094    6.36  2.8e-10 ***
## alcohol               0.308665   0.019056   16.20  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.652 on 1270 degrees of freedom
## Multiple R-squared:  0.36,   Adjusted R-squared:  0.357 
## F-statistic:  119 on 6 and 1270 DF,  p-value: <2e-16

bic.pred = predict(fit.bic, test)
# test MSE 
mean((test$quality-bic.pred)^2)

## [1] 0.4103

11.5 C

# prepare the data for modeling
x.train = model.matrix(quality ~. , train)[,-1]
y.train = train$quality
x.test = model.matrix(quality ~. , test)[,-1]
y.test = test$quality

# performing ridge regression
set.seed(25135)
cv.ridge = cv.glmnet(x.train, y.train, alpha=0)
par(mfrow=c(1,1))
plot(cv.ridge)

bestlam = cv.ridge$lambda.min
bestlam

## [1] 0.04388

ridge.best = glmnet(x.train, y.train, alpha=0, lambda=bestlam)
coef(ridge.best)

## 12 x 1 sparse Matrix of class "dgCMatrix"
##                              s0
## (Intercept)           32.752820
## fixed.acidity          0.028953
## volatile.acidity      -0.970898
## citric.acid           -0.039837
## residual.sugar         0.025009
## chlorides             -2.027835
## free.sulfur.dioxide    0.003452
## total.sulfur.dioxide  -0.003061
## density              -29.130570
## pH                    -0.272972
## sulphates              0.788221
## alcohol                0.268435

ridge.pred = predict(ridge.best, x.test)
# test MSE for ridge model
mean((y.test-ridge.pred)^2)

## [1] 0.4081

11.6 D

# performing lasso regression
set.seed(56847)
cv.lasso = cv.glmnet(x.train, y.train, alpha=1)
par(mfrow=c(1,1))
plot(cv.lasso)

bestlam = cv.lasso$lambda.min
bestlam

## [1] 0.009675

lasso.best = glmnet(x.train, y.train, alpha=1, lambda=bestlam)
coef(lasso.best)

## 12 x 1 sparse Matrix of class "dgCMatrix"
##                             s0
## (Intercept)           3.925186
## fixed.acidity         .       
## volatile.acidity     -1.001092
## citric.acid           .       
## residual.sugar        0.007181
## chlorides            -1.877945
## free.sulfur.dioxide   0.001853
## total.sulfur.dioxide -0.002542
## density               .       
## pH                   -0.341526
## sulphates             0.712932
## alcohol               0.301817

lasso.pred = predict(lasso.best, x.test)
# test MSE for lasso model
mean((y.test-lasso.pred)^2)

## [1] 0.4103

11.7 E

# ordinal logistic regression
polr.fit = polr(factor(quality) ~., data=train, Hess=T, method="logistic")
summary(polr.fit)

## Call:
## polr(formula = factor(quality) ~ ., data = train, Hess = T, method = "logistic")
## 
## Coefficients:
##                         Value Std. Error t value
## fixed.acidity          0.1259    0.05754    2.19
## volatile.acidity      -3.2131    0.44435   -7.23
## citric.acid           -0.6095    0.51869   -1.18
## residual.sugar         0.1047    0.04083    2.57
## chlorides             -6.0527    1.46154   -4.14
## free.sulfur.dioxide    0.0132    0.00747    1.77
## total.sulfur.dioxide  -0.0107    0.00255   -4.18
## density              -82.9997    1.09813  -75.58
## pH                    -0.7265    0.56131   -1.29
## sulphates              2.5590    0.39688    6.45
## alcohol                0.8739    0.06748   12.95
## 
## Intercepts:
##     Value   Std. Error t value
## 3|4 -81.669   1.126    -72.546
## 4|5 -79.720   1.124    -70.948
## 5|6 -76.027   1.128    -67.401
## 6|7 -73.239   1.142    -64.136
## 7|8 -70.060   1.179    -59.443
## 
## Residual Deviance: 2465.76 
## AIC: 2497.76

polr.pred = predict(polr.fit, x.test)
# test MSE 
mean((y.test-as.numeric(as.character(polr.pred)))^2)

## [1] 0.4781

# confusion matrix
table(y.test, polr.pred)

##       polr.pred
## y.test  3  4  5  6  7  8
##      3  0  0  2  0  0  0
##      4  0  0  6  3  0  0
##      5  0  1 89 39  0  0
##      6  0  0 40 93 10  0
##      7  0  0  1 22  9  0
##      8  0  0  0  2  3  0

11.8 F

# we divide simply: > 5 - high, <=5 - low
df$qualcat = factor(ifelse(df$quality > 5,"high","low"))
table(df$qualcat)

## 
## high  low 
##  855  742

train = df[train_ind,]
test = df[-train_ind,]

# performing LDA
lda.fit = lda(qualcat ~ .-quality, data=train)
lda.fit

## Call:
## lda(qualcat ~ . - quality, data = train)
## 
## Prior probabilities of groups:
##   high    low 
## 0.5286 0.4714 
## 
## Group means:
##      fixed.acidity volatile.acidity citric.acid residual.sugar chlorides
## high         8.530           0.4729       0.304          2.593   0.08242
## low          8.171           0.5892       0.240          2.558   0.09546
##      free.sulfur.dioxide total.sulfur.dioxide density    pH sulphates alcohol
## high               15.22                39.93  0.9966 3.309    0.6901  10.824
## low                16.67                55.55  0.9971 3.306    0.6257   9.887
## 
## Coefficients of linear discriminants:
##                           LD1
## fixed.acidity        -0.09139
## volatile.acidity      2.44733
## citric.acid           0.62783
## residual.sugar       -0.04719
## chlorides             3.93835
## free.sulfur.dioxide  -0.01857
## total.sulfur.dioxide  0.01335
## density              18.96496
## pH                    0.23089
## sulphates            -1.82943
## alcohol              -0.71062

pred = predict(lda.fit, test)$class
# confusion matrix
table(test$qualcat, pred)

##       pred
##        high low
##   high  142  38
##   low    42  98

# error rate
mean(test$qualcat != pred)

## [1] 0.25

11.9 G

# best subset Cp model
pred = ifelse(cp.pred > 5,"high","low")
mean(test$qualcat != pred)

## [1] 0.4188

# best subset BIC model
pred = ifelse(bic.pred > 5,"high","low")
mean(test$qualcat != pred)

## [1] 0.4219

# ridge model
pred = ifelse(ridge.pred > 5,"high","low")
mean(test$qualcat != pred)

## [1] 0.425

# lasso model
pred = ifelse(lasso.pred > 5,"high","low")
mean(test$qualcat != pred)

## [1] 0.425

# ordinal logistic regression model
pred = ifelse(as.numeric(as.character(polr.pred)) > 5,"high","low")
mean(test$qualcat != pred)

## [1] 0.2594

Variable descriptions were taken from: https://www.kaggle.com/uciml/red-wine-quality-cortez-et-al-2009 ↩

WINE ANALYSIS

Sanjiv Kukadia

`04/25/2018`