The winequality.csv data set includes data on both red and white wines; for this analysis, you’ll only be focusing on the red wines. The response variable of interest is quality. 1) Subset the data to only include observations on red wines. Describe the distribution of the response. Based on the nature of the response variable, what type of regression model would you suggest starting with (and why)?
library(ggplot2)
library(MASS)
wine <- read.csv('wine.csv', stringsAsFactors = TRUE)
red_wine <- subset(wine, type == "red")
red_wine <- subset(red_wine, select = -type)
attach(red_wine)
str(red_wine)
## 'data.frame': 1599 obs. of 12 variables:
## $ fixed.acidity : num 7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
## $ volatile.acidity : num 0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
## $ citric.acid : num 0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
## $ residual.sugar : num 1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
## $ chlorides : num 0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
## $ free.sulfur.dioxide : num 11 25 15 17 11 13 15 15 9 17 ...
## $ total.sulfur.dioxide: num 34 67 54 60 34 40 59 21 18 102 ...
## $ density : num 0.998 0.997 0.997 0.998 0.998 ...
## $ pH : num 3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
## $ sulphates : num 0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
## $ alcohol : num 9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
## $ quality : int 5 5 5 6 5 5 5 7 7 5 ...
head(red_wine)
## fixed.acidity volatile.acidity citric.acid residual.sugar chlorides
## 1 7.4 0.70 0.00 1.9 0.076
## 2 7.8 0.88 0.00 2.6 0.098
## 3 7.8 0.76 0.04 2.3 0.092
## 4 11.2 0.28 0.56 1.9 0.075
## 5 7.4 0.70 0.00 1.9 0.076
## 6 7.4 0.66 0.00 1.8 0.075
## free.sulfur.dioxide total.sulfur.dioxide density pH sulphates alcohol
## 1 11 34 0.9978 3.51 0.56 9.4
## 2 25 67 0.9968 3.20 0.68 9.8
## 3 15 54 0.9970 3.26 0.65 9.8
## 4 17 60 0.9980 3.16 0.58 9.8
## 5 11 34 0.9978 3.51 0.56 9.4
## 6 13 40 0.9978 3.51 0.56 9.4
## quality
## 1 5
## 2 5
## 3 5
## 4 6
## 5 5
## 6 5
hist(red_wine$quality)
The quality ratings ranges from 3 to 8 with most wines having a quality rating of 5 or 6. The distribution is not symmetric. Given the nature of quality response is ordinal. It is recommended to start with ordinal logistic regression.
summary(red_wine)
## fixed.acidity volatile.acidity citric.acid residual.sugar
## Min. : 4.60 Min. :0.1200 Min. :0.000 Min. : 0.900
## 1st Qu.: 7.10 1st Qu.:0.3900 1st Qu.:0.090 1st Qu.: 1.900
## Median : 7.90 Median :0.5200 Median :0.260 Median : 2.200
## Mean : 8.32 Mean :0.5278 Mean :0.271 Mean : 2.539
## 3rd Qu.: 9.20 3rd Qu.:0.6400 3rd Qu.:0.420 3rd Qu.: 2.600
## Max. :15.90 Max. :1.5800 Max. :1.000 Max. :15.500
## chlorides free.sulfur.dioxide total.sulfur.dioxide density
## Min. :0.01200 Min. : 1.00 Min. : 6.00 Min. :0.9901
## 1st Qu.:0.07000 1st Qu.: 7.00 1st Qu.: 22.00 1st Qu.:0.9956
## Median :0.07900 Median :14.00 Median : 38.00 Median :0.9968
## Mean :0.08747 Mean :15.87 Mean : 46.47 Mean :0.9967
## 3rd Qu.:0.09000 3rd Qu.:21.00 3rd Qu.: 62.00 3rd Qu.:0.9978
## Max. :0.61100 Max. :72.00 Max. :289.00 Max. :1.0037
## pH sulphates alcohol quality
## Min. :2.740 Min. :0.3300 Min. : 8.40 Min. :3.000
## 1st Qu.:3.210 1st Qu.:0.5500 1st Qu.: 9.50 1st Qu.:5.000
## Median :3.310 Median :0.6200 Median :10.20 Median :6.000
## Mean :3.311 Mean :0.6581 Mean :10.42 Mean :5.636
## 3rd Qu.:3.400 3rd Qu.:0.7300 3rd Qu.:11.10 3rd Qu.:6.000
## Max. :4.010 Max. :2.0000 Max. :14.90 Max. :8.000
numeric_columns <- sapply(red_wine, is.numeric)
cor_matrix <- cor(red_wine[, numeric_columns])
corrplot::corrplot(cor_matrix, method = "color")
Alcohol has positive correlation of 0.6 with quality. This suggests that red wines with higher alcohol content tend to have higher quality ratings.
Volatile acidity has a negative correlation of 0.6 with quality. This suggests that red wines with higher volatile acidity tend to have lower quality ratings.
Sulphates has a moderate positive correlation of 0.4 with quality. This suggests that red wines with higher sulphate content might have highe quality ratings.
Density appears to have a slight negative correlation of 0.2 with quality. This suggests that red wines with lower density might have higher quality ratings.
Citric acid has a moderate positive of 0.4 with quality. this suggests that red wines with higher citric acid tend to have higher quality ratings.
red_wine$quality <- ifelse(red_wine$quality >= 7, 1, 0)
str(red_wine$quality)
## num [1:1599] 0 0 0 0 0 0 0 1 1 0 ...
red_wine$quality <- as.factor(red_wine$quality)
model <- glm(quality ~ ., data = red_wine, family = binomial)
summary(model)
##
## Call:
## glm(formula = quality ~ ., family = binomial, data = red_wine)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 2.428e+02 1.081e+02 2.247 0.024660 *
## fixed.acidity 2.750e-01 1.253e-01 2.195 0.028183 *
## volatile.acidity -2.581e+00 7.843e-01 -3.291 0.000999 ***
## citric.acid 5.678e-01 8.385e-01 0.677 0.498313
## residual.sugar 2.395e-01 7.373e-02 3.248 0.001163 **
## chlorides -8.816e+00 3.365e+00 -2.620 0.008788 **
## free.sulfur.dioxide 1.082e-02 1.223e-02 0.884 0.376469
## total.sulfur.dioxide -1.653e-02 4.894e-03 -3.378 0.000731 ***
## density -2.578e+02 1.104e+02 -2.335 0.019536 *
## pH 2.242e-01 9.984e-01 0.225 0.822327
## sulphates 3.750e+00 5.416e-01 6.924 4.39e-12 ***
## alcohol 7.533e-01 1.316e-01 5.724 1.04e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1269.92 on 1598 degrees of freedom
## Residual deviance: 870.86 on 1587 degrees of freedom
## AIC: 894.86
##
## Number of Fisher Scoring iterations: 6
probabilities <- predict(model, type = "response")
predictions <- ifelse(probabilities > 0.5, 1, 0)
table(predictions, quality)
## quality
## predictions 3 4 5 6 7 8
## 0 10 52 679 598 132 10
## 1 0 1 2 40 67 8
predicted_prob <- probabilities
rms::val.prob(predicted_prob, y = red_wine$quality=='1')
## Dxy C (ROC) R2 D D:Chi-sq
## 7.643767e-01 8.821884e-01 4.029905e-01 2.489409e-01 3.990564e+02
## D:p U U:Chi-sq U:p Q
## NA -1.250782e-03 -1.250555e-12 1.000000e+00 2.501916e-01
## Brier Intercept Slope Emax E90
## 8.422009e-02 2.021147e-10 1.000000e+00 9.259981e-02 1.636751e-02
## Eavg S:z S:p
## 8.536283e-03 3.265852e-01 7.439817e-01
temp <- data.frame("y" = red_wine$quality, "p" = predicted_prob)
cal <- glm(y ~ splines::ns(p, df = 6), data = temp,
family = binomial(link = "logit"))
xx <- seq(min(predicted_prob), max(predicted_prob), length = 200)
yy <- predict(cal, newdata = data.frame(p = xx), type = "response")
xx <- xx[!is.na(yy)]
yy <- yy[!is.na(yy)]
plot(xx, yy, xlim = c(0, 1), ylim = c(0, 1), type = "l", col = 2,
xlab = "Predicted probability",
ylab = "Calibrated probability")
abline(0, 1, lty = 2)
rug(predicted_prob, col = adjustcolor(1, alpha.f = 0.2))
Based on the confusion matrix, the accuracy of the model is (1339 + 75)
/ (1339 + 75 + 142 + 43) = 0.884.
The calibration curve appears to be reasonably well-calibrated. However, there seems to be some deviation from perfect calibration, particularly for higher predicted probabilities. This suggests the model might be slightly overestimating the probability of of a wine having a quality score of 7 or higher.
coef_alcohol <- coef(model)["alcohol"]
odds_ratio <- exp(coef_alcohol)
odds_ratio
## alcohol
## 2.124081
newdata <- data.frame(alcohol = seq(min(red_wine$alcohol), max(red_wine$alcohol), length.out = 100))
for (var in setdiff(names(red_wine), c("alcohol", "quality_binary"))) {
newdata[[var]] <- mean(red_wine[[var]], na.rm = TRUE)
}
## Warning in mean.default(red_wine[[var]], na.rm = TRUE): argument is not numeric
## or logical: returning NA
newdata$prob <- predict(model, newdata = newdata, type = "response")
ggplot(newdata, aes(x = alcohol, y = prob)) +
geom_line() +
labs(x = "Alcohol", y = "Probability of Quality >= 7", title = "Effect of Alcohol on Wine Quality")
The predicted probabilities of having a wine quality rating of 7 or higher tend to increase with increasing alcohol content.
The relationship between alcohol content and the predicted probabilities of wine quality being 7 or higher is not linear and looking like sigmoid. This is possible because the relationship is modeled by logistic regression which could capture more complex, non-linear relationship.
Using a binomial logistic regression model is not recommended in this case because the quality ratings ranges from 3 to 8 and is not a binary response. By aggregating into 2 groups of response, the model would not be able to capture the information from the other ratings. A multinomimal logistic regression would be able to better handle this specific dataset.
wine2 <- read.csv('wine.csv')
red_wine2 <- subset(wine2, type == "red")
red_wine2 <- subset(red_wine2, select = -type)
red_wine2$quality <- as.factor(red_wine2$quality)
model_ord <- polr(quality ~ ., data = red_wine2)
summary(model_ord)
##
## Re-fitting to get Hessian
## Call:
## polr(formula = quality ~ ., data = red_wine2)
##
## Coefficients:
## Value Std. Error t value
## fixed.acidity 0.10240 0.051209 2.000
## volatile.acidity -3.41794 0.400103 -8.543
## citric.acid -0.80494 0.462371 -1.741
## residual.sugar 0.07617 0.038210 1.993
## chlorides -5.17121 1.354371 -3.818
## free.sulfur.dioxide 0.01392 0.006767 2.057
## total.sulfur.dioxide -0.01119 0.002360 -4.744
## density -48.92546 0.974499 -50.206
## pH -0.98472 0.496900 -1.982
## sulphates 2.86724 0.358017 8.009
## alcohol 0.85611 0.059355 14.424
##
## Intercepts:
## Value Std. Error t value
## 3|4 -48.8787 0.9979 -48.9791
## 4|5 -46.9597 0.9959 -47.1537
## 5|6 -43.2452 0.9988 -43.2964
## 6|7 -40.3898 1.0111 -39.9450
## 7|8 -37.3837 1.0409 -35.9135
##
## Residual Deviance: 3074.928
## AIC: 3106.928
for (var in names(red_wine2)[-which(names(red_wine2) == "quality")]) {
newdata <- data.frame(matrix(ncol = length(names(red_wine2)), nrow = 100))
names(newdata) <- names(red_wine2)
newdata[[var]] <- seq(min(red_wine2[[var]], na.rm = TRUE), max(red_wine2[[var]], na.rm = TRUE), length.out = 100)
for (other_var in names(newdata)[-which(names(newdata) %in% c(var, "quality"))]) {
newdata[[other_var]] <- mean(red_wine2[[other_var]], na.rm = TRUE)
}
predicted_probs <- predict(model_ord, newdata = newdata, type = "probs")
predicted_probs_quality_ge_7 <- rowSums(predicted_probs[, which(levels(red_wine2$quality) >= 7)])
plot(newdata[[var]], predicted_probs_quality_ge_7, type = "l",
xlab = var, ylab = "Probability of Quality >= 7",
main = paste("Effect of", var, "on Probability of Quality >= 7"), ylim = c(0, 1))
}
We got error trying to use orm from rms package, so we decided to use
polr from Mass package from lecture 5 R file.
Top three predictors solely in terms of their effect on the predicted probability that quality >= 7 are sulphate content, alcohol content and volatile acidity.
Based on your fitted model from 6), provide estimates for the following quantities: • Pr(quality==7|x−0) • Pr(quality>=7|x0) • Pr(quality==9|x0) • Pr(quality>=9|x0)
# Define the new observation
x0 <- data.frame(
fixed.acidity = 7.3000,
volatile.acidity = 0.6500,
citric.acid = 0.0000,
residual.sugar = 1.2000,
chlorides = 0.0650,
free.sulfur.dioxide = 15.0000,
total.sulfur.dioxide = 21.0000,
density = 0.9946,
pH = 3.3900,
sulphates = 0.4700,
alcohol = 10.0000
)
# Predict the probabilities
probabilities <- predict(model_ord, x0, type = "probs")
# Print the probabilities
print(probabilities)
## 3 4 5 6 7 8
## 0.006414165 0.035719884 0.601381876 0.325598623 0.029310854 0.001574598
• Pr(quality==7|x0) = 0.0293 • Pr(quality>=7|x0) = 0.0309 • Pr(quality==9|x0) = 0 • Pr(quality>=9|x0) = 0
The highest rating for quality is 8 so there’s no probabilities or quality = 9 or higher.
Using the model trained on red wine to predict white wine quality might not be feasible because of the potential differences in data characteristics. In practice, it is recommended to use a model trained on white wine data to predict white wine quality ratings or use a combined model with wine type as a preditor.