1 Executive Summary
The research showed that high-quality wines tend to have lower volatile acidity level. It is not a surprise since volatile acidity indicates the amount of acetic acid in wine, which at too high levels can lead to an unpleasant, vinegar taste. Also, research showed that high-quality wines are stronger (in alcohol %) than low-quality wines, generally.
At the modeling stage among regression models, ridge regression showed the lowest MSE on the test data. Classification model based on LDA with all predictors showed 0.25 test error rate, which is not a bad result.
2 Introduction
In this work, we explore red wines from the dataset which was published by P. Cortez, A. Cerdeira, F. Almeida, T. Matos and J. Reis. Modeling wine preferences by data mining from physicochemical properties. In Decision Support Systems, Elsevier, 47(4):547-553. ISSN: 0167-9236. The article is available at [1],[2]. The data and some description can be found in [3],[4]:
The goal of our analysis is to understand which physicochemical features might affect the quality (taste) of the red wine and to create models to predict the quality of the red wine.
The dataset has 1597 observations of 12 variables.
Here is some description of our variables1:
3 Predictor variables:
1 - fixed acidity (tartaric acid - g / dm^3)
2 - volatile acidity (acetic acid - g / dm^3)
3 - citric acid (g / dm^3)
4 - residual sugar (g / dm^3)
5 - chlorides (sodium chloride - g / dm^3)
6 - free sulfur dioxide (mg / dm^3)
7 - total sulfur dioxide (mg / dm^3)
8 - density (g / cm^3)
9 - pH
10 - sulphates (potassium sulphate - g / dm^3)
11 - alcohol (% by volume)
4 Response variable:
12 - quality (score between 0 and 10)
The quality
variable is the variable that we are trying to predict.
5 Description of the variables:
1 - fixed acidity: most acids involved with wine or fixed or nonvolatile (do not evaporate readily)
2 - volatile acidity: the amount of acetic acid in wine, which at too high of levels can lead to an unpleasant, vinegar taste
3 - citric acid: found in small quantities, citric acid can add ‘freshness’ and flavor to wines
4 - residual sugar: the amount of sugar remaining after fermentation stops, it’s rare to find wines with less than 1 gram/liter and wines with greater than 45 grams/liter are considered sweet
5 - chlorides: the amount of salt in the wine
6 - free sulfur dioxide: the free form of SO2 exists in equilibrium between molecular SO2 (as a dissolved gas) and bisulfite ion; it prevents microbial growth and the oxidation of wine
7 - total sulfur dioxide: amount of free and bound forms of S02; in low concentrations, SO2 is mostly undetectable in wine, but at free SO2 concentrations over 50 ppm, SO2 becomes evident in the nose and taste of wine
8 - density: the density of water is close to that of water depending on the percent alcohol and sugar content
9 - pH: describes how acidic or basic a wine is on a scale from 0 (very acidic) to 14 (very basic); most wines are between 3-4 on the pH scale
10 - sulphates: a wine additive which can contribute to sulfur dioxide gas (S02) levels, which acts as an antimicrobial and antioxidant
11 - alcohol: the percent alcohol content of the wine
12 - quality: median of at least 3 evaluations made by wine experts; each expert graded the wine quality between 0 (very bad) and 10 (very excellent)
6 Analysis and Methods
In this section we first explore the data to get more sense of how our variables are distributed and of how good wines differ from bad wines, then we build regression models to predict quality scores and classification model to predict “high” or “low” quality of a wine.
6.1 Data exploration
We begin with an exploration of distributions. Numerical summary of the data can be found in Appendix A, but it is better to inspect the distributions visually. One important take-away from the numerical summary is that we have no missing values (NA).
6.2 Quality scores distribution:
##
## 3 4 5 6 7 8
## 10 53 679 638 199 18
The quality ranges from 3 to 8. Most of the scores are 5 and 6 (medium quality). We have very little data for scores at the tails (unbalanced data), this might be a problem. One of the approaches to address this issue is to classify quality scores by “high” and “low” (e.g., \(>5\) - high-quality, \(\leq 5\) - low-quality), in this case, we will have balanced classes. For this purpose, we will build a classification model.
Here we can explore distributions of all the variables visually.
7 Histograms:
8 Boxplots:
Here we observe that:
fixed.acidity
,volatile.acidity
,free.sulfur.dioxide
,total.sulfur.dioxide
,sulphates
,alcohol
have a heavier right tail (right-skewed distributions)density
andpH
are quite symmetricresidual.sugar
level in the samples is mostly concentrated between 1 and 3 \(g/dm^3\)chlorides
are mostly concentrated between 0.5 and 1 \(g/dm^3\)quality
scores are mostly medium (5,6) with very little amount of records at both tails (3 and 8 scores)
It is also important to know how predictors are related with our response variable:
From the boxplots above based on the interquartile ranges and median changes with respect to quality score, we observe the following patterns:
- high-quality wines tend to have smaller
volatile.acidity
level - high-quality wines tend to have larger
citric.acid
level - high-quality wines tend to have smaller
density
level - high-quality wines tend to have smaller
pH
level - high-quality wines tend to have larger
sulphates
level - high-quality wines tend to have larger
alcohol
level
Here we can examine relationships between predictors:
However, the information above can be easier summarized by the correlation plot (which reflects the measure of linear dependence):
From the correlation plot above the most correlated variables are:
fixed.acidity
andpH
(-0.68)fixed.acidity
anddensity
(0.67)fixed.acidity
andcitric.acid
(0.67)free.sulfur.dioxide
andtotal.sulfur.dioxide
(0.67)citric.acid
andpH
(-0.54)alcohol
anddensity
(-0.5)
The most correlated variables with quality
are:
volatile.acidity
(-0.39)alcohol
(0.48)
The last two points where also noticed on predictors vs. quality plot, that with quality score increase volatile acidity tend to decrease, and that with quality score increase alcohol level tend to also increase. We can test these hypotheses with ANOVA test, and then conducting post hoc for differences between group means if the result of ANOVA test was significant.
ANOVA tests in both cases gave significant results (p-values < 0.05).
In the post hoc test the differences between volatile acidity levels are almost all significant (except 8-6 and 8-7) and lower than zero, so we can say that high-quality wines tend to have lower volatile acidity levels. Appendix A.1.
In the post hoc test the differences between alcohol levels for high-quality wines vs. other groups are all significant (adjusted p-values < 0.05) and greater than zero, so we can say that high-quality wines tend to have higher alcohol level. Appendix A.2.
8.1 Modeling
First, we split the whole data into train and test in 80%/20% proportion. The train set we will use to train our models, the test set will be used for evaluation.
8.1.1 Regression
Since we have correlated predictors, it is not a good idea to fit a linear regression model with all predictors at once. Instead, we will try regularization and variable selection techniques.
8.1.1.1 Best Subset Selection
We will select models based on minimal \(Cp\) and \(BIC\) values. Code and output are in Appendix B.
The \(Cp\) model uses volatile.acidity
, chlorides
, free.sulfur.dioxide
, total.sulfur.dioxide
, pH
, sulphates
, alcohol
as predictors.
The test MSE for \(Cp\) model is 0.4085839.
The \(BIC\) model uses volatile.acidity
, chlorides
, total.sulfur.dioxide
, pH
, sulphates
, alcohol
as predictors.
The test MSE for \(BIC\) model is 0.4102632 (higher than in \(Cp\) model).
8.1.1.2 Ridge regression
Best lambda is selected based on cross-validation. Code and output are in Appendix C.
The test MSE for the ridge model is 0.4081041, lower than in previous models.
8.1.1.3 Lasso regression
Best lambda is selected based on cross-validation. Code and output are in Appendix D.
The test MSE for the lasso model is 0.4102508, not an improvement over previous models. Note that lasso dropped fixed.acidity
, citric.acid
and density
.
8.1.1.4 Ordinal Logistic Regression
Since we have ordinal responses, it is reasonable to try Ordinal Logistic Regression model. Code and output are in Appendix E.
The test MSE for this model is 0.478125, significantly higher than in previous models. We can also view how model predicts vs. actual in confusion matrix.
8.1.2 Classification
First, we simply divide quality scores into “high” and “low” categories (\(> 5\) - high-quality, \(\leq 5\) - low-quality). We name this variable qualcat
.
Here we can’t use logistic regression since we have perfect separation issue, so we decided to use LDA. Code and output are in Appendix F.
The error rate for LDA model with all predictors is 0.25. Confusion matrix is:
## pred
## high low
## high 142 38
## low 42 98
It is also interesting to compare error rates for regression models with this classification model. We can map regression responses into “high” and “low” categories (\(> 5\) - high, \(\leq 5\) - low).
The lowest error rate was obtained in ordinal logistic regression case (0.259375), it is slightly higher than with the LDA model. For other regression models error rate was above 0.41. Code and output are in Appendix G.
9 Conclusion and Summary
The analysis showed that high-quality wines tend to have lower volatile acidity level (Appendix A.1). It is not a surprise since volatile acidity indicates the amount of acetic acid in wine, which at too high levels can lead to an unpleasant, vinegar taste. Also, it showed that high-quality wines are stronger (in alcohol %) than low-quality wines (Appendix A.2).
By examining correlations and pairs plot, we observe that the quality
is positively related with alcohol
level (stronger wines are usually of higher quality), sulphates
, citric.acid
(which adds freshness and flavor to the wine), and negatively related with volatile.acidity
(which at too high levels can lead to an unpleasant, vinegar taste). We also found relationship of quality
with pH
and density
, but this is probably due to the causal relation of fixed.acidity
on pH
, and causal relation of alcohol
on density
. Surprisingly, it appears that residual.sugar
level has almost no relationship with the quality
.
The strongest correlation among variables is between fixed.acidity
and pH
(-0.68), between fixed.acidity
and density
(0.67), between fixed.acidity
and citric.acid
(0.67), between free.sulfur.dioxide
and total.sulfur.dioxide
(0.67).
The strongest correlation quality
has with alcohol
(0.48) and volatile.acidity
(-0.39).
At the modeling stage among regression models, ridge regression showed the lowest MSE on the test data. For classification, we decided to use linear discriminant analysis (LDA), since unlike logistic regression it doesn’t suffer from collinear predictors and perfect separation (remember, we divided quality by \(>5\) and \(\leq 5\)). The model based on LDA with all predictors showed 0.25 test error rate, which is not a bad result. We also tried to map predictions of the regression models to “low” and “high” quality categories to compare test error rates with LDA, but almost all models resulted above 0.41 test error rate, only ordinal logistic model showed about 0.259 test error rate which is slightly higher than with LDA.
One of the directions of future research might be in adding white wines data and performing the similar analysis to figure out how red and white wines are different.
10 References
[1] https://repositorium.sdum.uminho.pt/bitstream/1822/10029/1/wine5.pdf
[2] https://www.sciencedirect.com/science/article/pii/S0167923609001377
[3] https://archive.ics.uci.edu/ml/datasets/wine+quality
[4] https://www.kaggle.com/uciml/red-wine-quality-cortez-et-al-2009
11 Appendix
library(corrplot)
library(leaps)
library(glmnet)
library(MASS)
11.1 A
setwd("C:/Users/sakukadi/Documents/wineAnalysis/Dataset")
df = read.csv("winequality.csv")
dim(df)
## [1] 1597 12
# numerical summary of the data
summary(df)
## fixed.acidity volatile.acidity citric.acid residual.sugar chlorides
## Min. : 4.60 Min. :0.120 Min. :0.000 Min. : 0.90 Min. :0.0120
## 1st Qu.: 7.10 1st Qu.:0.390 1st Qu.:0.090 1st Qu.: 1.90 1st Qu.:0.0700
## Median : 7.90 Median :0.520 Median :0.260 Median : 2.20 Median :0.0790
## Mean : 8.32 Mean :0.528 Mean :0.271 Mean : 2.54 Mean :0.0875
## 3rd Qu.: 9.20 3rd Qu.:0.640 3rd Qu.:0.420 3rd Qu.: 2.60 3rd Qu.:0.0900
## Max. :15.90 Max. :1.580 Max. :1.000 Max. :15.50 Max. :0.6110
## free.sulfur.dioxide total.sulfur.dioxide density pH sulphates
## Min. : 1.0 Min. : 6.0 Min. :0.990 Min. :2.74 Min. :0.330
## 1st Qu.: 7.0 1st Qu.: 22.0 1st Qu.:0.996 1st Qu.:3.21 1st Qu.:0.550
## Median :14.0 Median : 38.0 Median :0.997 Median :3.31 Median :0.620
## Mean :15.8 Mean : 46.4 Mean :0.997 Mean :3.31 Mean :0.658
## 3rd Qu.:21.0 3rd Qu.: 62.0 3rd Qu.:0.998 3rd Qu.:3.40 3rd Qu.:0.730
## Max. :72.0 Max. :289.0 Max. :1.004 Max. :4.01 Max. :2.000
## alcohol quality
## Min. : 8.4 Min. :3.00
## 1st Qu.: 9.5 1st Qu.:5.00
## Median :10.2 Median :6.00
## Mean :10.4 Mean :5.64
## 3rd Qu.:11.1 3rd Qu.:6.00
## Max. :14.9 Max. :8.00
table(df$quality)
##
## 3 4 5 6 7 8
## 10 53 679 638 199 18
11.2 A.1
aov1 = aov(volatile.acidity ~ factor(quality), data=df)
summary(aov1)
## Df Sum Sq Mean Sq F value Pr(>F)
## factor(quality) 5 8.2 1.642 60.7 <2e-16 ***
## Residuals 1591 43.0 0.027
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
TukeyHSD(aov1)
## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = volatile.acidity ~ factor(quality), data = df)
##
## $`factor(quality)`
## diff lwr upr p adj
## 4-3 -0.19054 -0.35227 -0.02881 0.0103
## 5-3 -0.30761 -0.45704 -0.15819 0.0000
## 6-3 -0.38702 -0.53652 -0.23752 0.0000
## 7-3 -0.48058 -0.63260 -0.32856 0.0000
## 8-3 -0.46117 -0.64618 -0.27615 0.0000
## 5-4 -0.11708 -0.18398 -0.05017 0.0000
## 6-4 -0.19648 -0.26354 -0.12942 0.0000
## 7-4 -0.29004 -0.36255 -0.21753 0.0000
## 8-4 -0.27063 -0.39860 -0.14266 0.0000
## 6-5 -0.07940 -0.10527 -0.05354 0.0000
## 7-5 -0.17297 -0.21078 -0.13515 0.0000
## 8-5 -0.15355 -0.26557 -0.04153 0.0013
## 7-6 -0.09356 -0.13165 -0.05548 0.0000
## 8-6 -0.07415 -0.18627 0.03796 0.4105
## 8-7 0.01941 -0.09605 0.13487 0.9969
11.3 A.2
aov2 = aov(alcohol ~ factor(quality), data=df)
summary(aov2)
## Df Sum Sq Mean Sq F value Pr(>F)
## factor(quality) 5 483 96.5 115 <2e-16 ***
## Residuals 1591 1331 0.8
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
TukeyHSD(aov2)
## Tukey multiple comparisons of means
## 95% family-wise confidence level
##
## Fit: aov(formula = alcohol ~ factor(quality), data = df)
##
## $`factor(quality)`
## diff lwr upr p adj
## 4-3 0.31009 -0.589478 1.209667 0.9233
## 5-3 -0.05412 -0.885265 0.777032 1.0000
## 6-3 0.67452 -0.157017 1.506055 0.1887
## 7-3 1.51091 0.665341 2.356485 0.0000
## 8-3 2.13944 1.110370 3.168518 0.0000
## 5-4 -0.36421 -0.736334 0.007912 0.0591
## 6-4 0.36442 -0.008563 0.737413 0.0599
## 7-4 1.20082 0.797508 1.604129 0.0000
## 8-4 1.82935 1.117549 2.541151 0.0000
## 6-5 0.72864 0.584772 0.872499 0.0000
## 7-5 1.56503 1.354705 1.775354 0.0000
## 8-5 2.19356 1.570473 2.816648 0.0000
## 7-6 0.83639 0.624543 1.048244 0.0000
## 8-6 1.46493 0.841321 2.088529 0.0000
## 8-7 0.62853 -0.013669 1.270732 0.0592
11.4 B
# split to train and test, 80%/20%
set.seed(344421)
train_ind = sample.int(n=nrow(df), size=floor(0.8*nrow(df)), replace=F)
train = df[train_ind,]
test = df[-train_ind,]
# best subset selection
best.subset.regr = regsubsets(quality ~. , data=train)
best.subset.summary = summary(best.subset.regr)
best.subset.summary
## Subset selection object
## Call: regsubsets.formula(quality ~ ., data = train)
## 11 Variables (and intercept)
## Forced in Forced out
## fixed.acidity FALSE FALSE
## volatile.acidity FALSE FALSE
## citric.acid FALSE FALSE
## residual.sugar FALSE FALSE
## chlorides FALSE FALSE
## free.sulfur.dioxide FALSE FALSE
## total.sulfur.dioxide FALSE FALSE
## density FALSE FALSE
## pH FALSE FALSE
## sulphates FALSE FALSE
## alcohol FALSE FALSE
## 1 subsets of each size up to 8
## Selection Algorithm: exhaustive
## fixed.acidity volatile.acidity citric.acid residual.sugar chlorides
## 1 ( 1 ) " " " " " " " " " "
## 2 ( 1 ) " " "*" " " " " " "
## 3 ( 1 ) " " "*" " " " " " "
## 4 ( 1 ) " " "*" " " " " "*"
## 5 ( 1 ) " " "*" " " " " "*"
## 6 ( 1 ) " " "*" " " " " "*"
## 7 ( 1 ) " " "*" " " " " "*"
## 8 ( 1 ) " " "*" " " "*" "*"
## free.sulfur.dioxide total.sulfur.dioxide density pH sulphates alcohol
## 1 ( 1 ) " " " " " " " " " " "*"
## 2 ( 1 ) " " " " " " " " " " "*"
## 3 ( 1 ) " " " " " " " " "*" "*"
## 4 ( 1 ) " " " " " " " " "*" "*"
## 5 ( 1 ) " " "*" " " " " "*" "*"
## 6 ( 1 ) " " "*" " " "*" "*" "*"
## 7 ( 1 ) "*" "*" " " "*" "*" "*"
## 8 ( 1 ) "*" "*" " " "*" "*" "*"
par(mfrow=c(1,2))
plot(best.subset.summary$cp, xlab="Number of Variables", ylab="Cp", type="l")
plot(best.subset.summary$bic, xlab="Number of Variables", ylab="BIC", type="l")
# best model based on Cp
which.min(best.subset.summary$cp)
## [1] 7
coef(best.subset.regr, which.min(best.subset.summary$cp))
## (Intercept) volatile.acidity chlorides free.sulfur.dioxide
## 4.252836 -0.987651 -2.212041 0.004899
## total.sulfur.dioxide pH sulphates alcohol
## -0.003381 -0.464992 0.780754 0.308388
fit.cp = lm(quality ~ volatile.acidity + chlorides + free.sulfur.dioxide +
total.sulfur.dioxide + pH + sulphates + alcohol, data=train)
summary(fit.cp)
##
## Call:
## lm(formula = quality ~ volatile.acidity + chlorides + free.sulfur.dioxide +
## total.sulfur.dioxide + pH + sulphates + alcohol, data = train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.6546 -0.3870 -0.0457 0.4595 1.9619
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.252836 0.456424 9.32 < 2e-16 ***
## volatile.acidity -0.987651 0.112866 -8.75 < 2e-16 ***
## chlorides -2.212041 0.429634 -5.15 3.0e-07 ***
## free.sulfur.dioxide 0.004899 0.002359 2.08 0.03804 *
## total.sulfur.dioxide -0.003381 0.000745 -4.54 6.2e-06 ***
## pH -0.464992 0.133396 -3.49 0.00051 ***
## sulphates 0.780754 0.124004 6.30 4.2e-10 ***
## alcohol 0.308388 0.019032 16.20 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.651 on 1269 degrees of freedom
## Multiple R-squared: 0.363, Adjusted R-squared: 0.359
## F-statistic: 103 on 7 and 1269 DF, p-value: <2e-16
cp.pred = predict(fit.cp, test)
# test MSE for Cp model
mean((test$quality-cp.pred)^2)
## [1] 0.4086
# best model based on BIC
which.min(best.subset.summary$bic)
## [1] 6
coef(best.subset.regr, which.min(best.subset.summary$bic))
## (Intercept) volatile.acidity chlorides total.sulfur.dioxide
## 4.124048 -1.011060 -2.191147 -0.002351
## pH sulphates alcohol
## -0.416662 0.789503 0.308665
fit.bic = lm(quality ~ volatile.acidity + chlorides + total.sulfur.dioxide +
pH + sulphates + alcohol, data=train)
summary(fit.bic)
##
## Call:
## lm(formula = quality ~ volatile.acidity + chlorides + total.sulfur.dioxide +
## pH + sulphates + alcohol, data = train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.5750 -0.3703 -0.0471 0.4646 1.9430
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4.124048 0.452780 9.11 < 2e-16 ***
## volatile.acidity -1.011060 0.112448 -8.99 < 2e-16 ***
## chlorides -2.191147 0.430076 -5.09 4.0e-07 ***
## total.sulfur.dioxide -0.002351 0.000557 -4.22 2.6e-05 ***
## pH -0.416662 0.131522 -3.17 0.0016 **
## sulphates 0.789503 0.124094 6.36 2.8e-10 ***
## alcohol 0.308665 0.019056 16.20 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.652 on 1270 degrees of freedom
## Multiple R-squared: 0.36, Adjusted R-squared: 0.357
## F-statistic: 119 on 6 and 1270 DF, p-value: <2e-16
bic.pred = predict(fit.bic, test)
# test MSE
mean((test$quality-bic.pred)^2)
## [1] 0.4103
11.5 C
# prepare the data for modeling
x.train = model.matrix(quality ~. , train)[,-1]
y.train = train$quality
x.test = model.matrix(quality ~. , test)[,-1]
y.test = test$quality
# performing ridge regression
set.seed(25135)
cv.ridge = cv.glmnet(x.train, y.train, alpha=0)
par(mfrow=c(1,1))
plot(cv.ridge)
bestlam = cv.ridge$lambda.min
bestlam
## [1] 0.04388
ridge.best = glmnet(x.train, y.train, alpha=0, lambda=bestlam)
coef(ridge.best)
## 12 x 1 sparse Matrix of class "dgCMatrix"
## s0
## (Intercept) 32.752820
## fixed.acidity 0.028953
## volatile.acidity -0.970898
## citric.acid -0.039837
## residual.sugar 0.025009
## chlorides -2.027835
## free.sulfur.dioxide 0.003452
## total.sulfur.dioxide -0.003061
## density -29.130570
## pH -0.272972
## sulphates 0.788221
## alcohol 0.268435
ridge.pred = predict(ridge.best, x.test)
# test MSE for ridge model
mean((y.test-ridge.pred)^2)
## [1] 0.4081
11.6 D
# performing lasso regression
set.seed(56847)
cv.lasso = cv.glmnet(x.train, y.train, alpha=1)
par(mfrow=c(1,1))
plot(cv.lasso)
bestlam = cv.lasso$lambda.min
bestlam
## [1] 0.009675
lasso.best = glmnet(x.train, y.train, alpha=1, lambda=bestlam)
coef(lasso.best)
## 12 x 1 sparse Matrix of class "dgCMatrix"
## s0
## (Intercept) 3.925186
## fixed.acidity .
## volatile.acidity -1.001092
## citric.acid .
## residual.sugar 0.007181
## chlorides -1.877945
## free.sulfur.dioxide 0.001853
## total.sulfur.dioxide -0.002542
## density .
## pH -0.341526
## sulphates 0.712932
## alcohol 0.301817
lasso.pred = predict(lasso.best, x.test)
# test MSE for lasso model
mean((y.test-lasso.pred)^2)
## [1] 0.4103
11.7 E
# ordinal logistic regression
polr.fit = polr(factor(quality) ~., data=train, Hess=T, method="logistic")
summary(polr.fit)
## Call:
## polr(formula = factor(quality) ~ ., data = train, Hess = T, method = "logistic")
##
## Coefficients:
## Value Std. Error t value
## fixed.acidity 0.1259 0.05754 2.19
## volatile.acidity -3.2131 0.44435 -7.23
## citric.acid -0.6095 0.51869 -1.18
## residual.sugar 0.1047 0.04083 2.57
## chlorides -6.0527 1.46154 -4.14
## free.sulfur.dioxide 0.0132 0.00747 1.77
## total.sulfur.dioxide -0.0107 0.00255 -4.18
## density -82.9997 1.09813 -75.58
## pH -0.7265 0.56131 -1.29
## sulphates 2.5590 0.39688 6.45
## alcohol 0.8739 0.06748 12.95
##
## Intercepts:
## Value Std. Error t value
## 3|4 -81.669 1.126 -72.546
## 4|5 -79.720 1.124 -70.948
## 5|6 -76.027 1.128 -67.401
## 6|7 -73.239 1.142 -64.136
## 7|8 -70.060 1.179 -59.443
##
## Residual Deviance: 2465.76
## AIC: 2497.76
polr.pred = predict(polr.fit, x.test)
# test MSE
mean((y.test-as.numeric(as.character(polr.pred)))^2)
## [1] 0.4781
# confusion matrix
table(y.test, polr.pred)
## polr.pred
## y.test 3 4 5 6 7 8
## 3 0 0 2 0 0 0
## 4 0 0 6 3 0 0
## 5 0 1 89 39 0 0
## 6 0 0 40 93 10 0
## 7 0 0 1 22 9 0
## 8 0 0 0 2 3 0
11.8 F
# we divide simply: > 5 - high, <=5 - low
df$qualcat = factor(ifelse(df$quality > 5,"high","low"))
table(df$qualcat)
##
## high low
## 855 742
train = df[train_ind,]
test = df[-train_ind,]
# performing LDA
lda.fit = lda(qualcat ~ .-quality, data=train)
lda.fit
## Call:
## lda(qualcat ~ . - quality, data = train)
##
## Prior probabilities of groups:
## high low
## 0.5286 0.4714
##
## Group means:
## fixed.acidity volatile.acidity citric.acid residual.sugar chlorides
## high 8.530 0.4729 0.304 2.593 0.08242
## low 8.171 0.5892 0.240 2.558 0.09546
## free.sulfur.dioxide total.sulfur.dioxide density pH sulphates alcohol
## high 15.22 39.93 0.9966 3.309 0.6901 10.824
## low 16.67 55.55 0.9971 3.306 0.6257 9.887
##
## Coefficients of linear discriminants:
## LD1
## fixed.acidity -0.09139
## volatile.acidity 2.44733
## citric.acid 0.62783
## residual.sugar -0.04719
## chlorides 3.93835
## free.sulfur.dioxide -0.01857
## total.sulfur.dioxide 0.01335
## density 18.96496
## pH 0.23089
## sulphates -1.82943
## alcohol -0.71062
pred = predict(lda.fit, test)$class
# confusion matrix
table(test$qualcat, pred)
## pred
## high low
## high 142 38
## low 42 98
# error rate
mean(test$qualcat != pred)
## [1] 0.25
11.9 G
# best subset Cp model
pred = ifelse(cp.pred > 5,"high","low")
mean(test$qualcat != pred)
## [1] 0.4188
# best subset BIC model
pred = ifelse(bic.pred > 5,"high","low")
mean(test$qualcat != pred)
## [1] 0.4219
# ridge model
pred = ifelse(ridge.pred > 5,"high","low")
mean(test$qualcat != pred)
## [1] 0.425
# lasso model
pred = ifelse(lasso.pred > 5,"high","low")
mean(test$qualcat != pred)
## [1] 0.425
# ordinal logistic regression model
pred = ifelse(as.numeric(as.character(polr.pred)) > 5,"high","low")
mean(test$qualcat != pred)
## [1] 0.2594
Variable descriptions were taken from: https://www.kaggle.com/uciml/red-wine-quality-cortez-et-al-2009↩