Homework 3

The winequality.csv data set includes data on both red and white wines; for this analysis, you’ll only be focusing on the red wines. The response variable of interest is quality. 1) Subset the data to only include observations on red wines. Describe the distribution of the response. Based on the nature of the response variable, what type of regression model would you suggest starting with (and why)?

library(ggplot2)
library(MASS)

wine <- read.csv('wine.csv', stringsAsFactors = TRUE)

red_wine <- subset(wine, type == "red")
red_wine <- subset(red_wine, select = -type)


attach(red_wine)
str(red_wine)

## 'data.frame':    1599 obs. of  12 variables:
##  $ fixed.acidity       : num  7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
##  $ volatile.acidity    : num  0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
##  $ citric.acid         : num  0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
##  $ residual.sugar      : num  1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
##  $ chlorides           : num  0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
##  $ free.sulfur.dioxide : num  11 25 15 17 11 13 15 15 9 17 ...
##  $ total.sulfur.dioxide: num  34 67 54 60 34 40 59 21 18 102 ...
##  $ density             : num  0.998 0.997 0.997 0.998 0.998 ...
##  $ pH                  : num  3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
##  $ sulphates           : num  0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
##  $ alcohol             : num  9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
##  $ quality             : int  5 5 5 6 5 5 5 7 7 5 ...

head(red_wine)

##   fixed.acidity volatile.acidity citric.acid residual.sugar chlorides
## 1           7.4             0.70        0.00            1.9     0.076
## 2           7.8             0.88        0.00            2.6     0.098
## 3           7.8             0.76        0.04            2.3     0.092
## 4          11.2             0.28        0.56            1.9     0.075
## 5           7.4             0.70        0.00            1.9     0.076
## 6           7.4             0.66        0.00            1.8     0.075
##   free.sulfur.dioxide total.sulfur.dioxide density   pH sulphates alcohol
## 1                  11                   34  0.9978 3.51      0.56     9.4
## 2                  25                   67  0.9968 3.20      0.68     9.8
## 3                  15                   54  0.9970 3.26      0.65     9.8
## 4                  17                   60  0.9980 3.16      0.58     9.8
## 5                  11                   34  0.9978 3.51      0.56     9.4
## 6                  13                   40  0.9978 3.51      0.56     9.4
##   quality
## 1       5
## 2       5
## 3       5
## 4       6
## 5       5
## 6       5

hist(red_wine$quality)

The quality ratings ranges from 3 to 8 with most wines having a quality rating of 5 or 6. The distribution is not symmetric. Given the nature of quality response is ordinal. It is recommended to start with ordinal logistic regression.

Explore the data using summary statistics and graphics. Do any of the variables appear to have an association with the overal quality score? If so, which? Describe the nature of these associations.

summary(red_wine)

##  fixed.acidity   volatile.acidity  citric.acid    residual.sugar  
##  Min.   : 4.60   Min.   :0.1200   Min.   :0.000   Min.   : 0.900  
##  1st Qu.: 7.10   1st Qu.:0.3900   1st Qu.:0.090   1st Qu.: 1.900  
##  Median : 7.90   Median :0.5200   Median :0.260   Median : 2.200  
##  Mean   : 8.32   Mean   :0.5278   Mean   :0.271   Mean   : 2.539  
##  3rd Qu.: 9.20   3rd Qu.:0.6400   3rd Qu.:0.420   3rd Qu.: 2.600  
##  Max.   :15.90   Max.   :1.5800   Max.   :1.000   Max.   :15.500  
##    chlorides       free.sulfur.dioxide total.sulfur.dioxide    density      
##  Min.   :0.01200   Min.   : 1.00       Min.   :  6.00       Min.   :0.9901  
##  1st Qu.:0.07000   1st Qu.: 7.00       1st Qu.: 22.00       1st Qu.:0.9956  
##  Median :0.07900   Median :14.00       Median : 38.00       Median :0.9968  
##  Mean   :0.08747   Mean   :15.87       Mean   : 46.47       Mean   :0.9967  
##  3rd Qu.:0.09000   3rd Qu.:21.00       3rd Qu.: 62.00       3rd Qu.:0.9978  
##  Max.   :0.61100   Max.   :72.00       Max.   :289.00       Max.   :1.0037  
##        pH          sulphates         alcohol         quality     
##  Min.   :2.740   Min.   :0.3300   Min.   : 8.40   Min.   :3.000  
##  1st Qu.:3.210   1st Qu.:0.5500   1st Qu.: 9.50   1st Qu.:5.000  
##  Median :3.310   Median :0.6200   Median :10.20   Median :6.000  
##  Mean   :3.311   Mean   :0.6581   Mean   :10.42   Mean   :5.636  
##  3rd Qu.:3.400   3rd Qu.:0.7300   3rd Qu.:11.10   3rd Qu.:6.000  
##  Max.   :4.010   Max.   :2.0000   Max.   :14.90   Max.   :8.000

numeric_columns <- sapply(red_wine, is.numeric)

cor_matrix <- cor(red_wine[, numeric_columns])

corrplot::corrplot(cor_matrix, method = "color")

Alcohol has positive correlation of 0.6 with quality. This suggests that red wines with higher alcohol content tend to have higher quality ratings.

Volatile acidity has a negative correlation of 0.6 with quality. This suggests that red wines with higher volatile acidity tend to have lower quality ratings.

Sulphates has a moderate positive correlation of 0.4 with quality. This suggests that red wines with higher sulphate content might have highe quality ratings.

Density appears to have a slight negative correlation of 0.2 with quality. This suggests that red wines with lower density might have higher quality ratings.

Citric acid has a moderate positive of 0.4 with quality. this suggests that red wines with higher citric acid tend to have higher quality ratings.

Construct a binary response according to the following rule (1, quality >= 7 0, quality < 7) Fit a logistic regression to the data using all possible main effects (i.e., include each variable, but no interaction effects). Assess the performance of this model. Does the model seem well-calibrated? Discuss and provide a plot of a calibration curve.

red_wine$quality <- ifelse(red_wine$quality >= 7, 1, 0)
str(red_wine$quality)

##  num [1:1599] 0 0 0 0 0 0 0 1 1 0 ...

red_wine$quality <- as.factor(red_wine$quality)


model <- glm(quality ~ ., data = red_wine, family = binomial)

summary(model)

## 
## Call:
## glm(formula = quality ~ ., family = binomial, data = red_wine)
## 
## Coefficients:
##                        Estimate Std. Error z value Pr(>|z|)    
## (Intercept)           2.428e+02  1.081e+02   2.247 0.024660 *  
## fixed.acidity         2.750e-01  1.253e-01   2.195 0.028183 *  
## volatile.acidity     -2.581e+00  7.843e-01  -3.291 0.000999 ***
## citric.acid           5.678e-01  8.385e-01   0.677 0.498313    
## residual.sugar        2.395e-01  7.373e-02   3.248 0.001163 ** 
## chlorides            -8.816e+00  3.365e+00  -2.620 0.008788 ** 
## free.sulfur.dioxide   1.082e-02  1.223e-02   0.884 0.376469    
## total.sulfur.dioxide -1.653e-02  4.894e-03  -3.378 0.000731 ***
## density              -2.578e+02  1.104e+02  -2.335 0.019536 *  
## pH                    2.242e-01  9.984e-01   0.225 0.822327    
## sulphates             3.750e+00  5.416e-01   6.924 4.39e-12 ***
## alcohol               7.533e-01  1.316e-01   5.724 1.04e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1269.92  on 1598  degrees of freedom
## Residual deviance:  870.86  on 1587  degrees of freedom
## AIC: 894.86
## 
## Number of Fisher Scoring iterations: 6

probabilities <- predict(model, type = "response")

predictions <- ifelse(probabilities > 0.5, 1, 0)

table(predictions, quality)

##            quality
## predictions   3   4   5   6   7   8
##           0  10  52 679 598 132  10
##           1   0   1   2  40  67   8

predicted_prob <- probabilities


rms::val.prob(predicted_prob, y = red_wine$quality=='1')

##           Dxy       C (ROC)            R2             D      D:Chi-sq 
##  7.643767e-01  8.821884e-01  4.029905e-01  2.489409e-01  3.990564e+02 
##           D:p             U      U:Chi-sq           U:p             Q 
##            NA -1.250782e-03 -1.250555e-12  1.000000e+00  2.501916e-01 
##         Brier     Intercept         Slope          Emax           E90 
##  8.422009e-02  2.021147e-10  1.000000e+00  9.259981e-02  1.636751e-02 
##          Eavg           S:z           S:p 
##  8.536283e-03  3.265852e-01  7.439817e-01

temp <- data.frame("y" = red_wine$quality, "p" = predicted_prob)
cal <- glm(y ~ splines::ns(p, df = 6), data = temp,
           family = binomial(link = "logit"))


xx <- seq(min(predicted_prob), max(predicted_prob), length = 200)
yy <- predict(cal, newdata = data.frame(p = xx), type = "response")
xx <- xx[!is.na(yy)]
yy <- yy[!is.na(yy)]
plot(xx, yy, xlim = c(0, 1), ylim = c(0, 1), type = "l", col = 2,
     xlab = "Predicted probability",
     ylab = "Calibrated probability")
abline(0, 1, lty = 2)
rug(predicted_prob, col = adjustcolor(1, alpha.f = 0.2))

Based on the confusion matrix, the accuracy of the model is (1339 + 75) / (1339 + 75 + 142 + 43) = 0.884.

The calibration curve appears to be reasonably well-calibrated. However, there seems to be some deviation from perfect calibration, particularly for higher predicted probabilities. This suggests the model might be slightly overestimating the probability of of a wine having a quality score of 7 or higher.

Interpret the effect of the predictor alcohol on the odds that quality >= 7. Construct an effect plot visualizing the effect of alcohol on the probability that quality >= 7 and describe the relationship. Does this plot look linear or nonlinear? If nonlinear, discuss how this is possible.

coef_alcohol <- coef(model)["alcohol"]

odds_ratio <- exp(coef_alcohol)

odds_ratio

##  alcohol 
## 2.124081

newdata <- data.frame(alcohol = seq(min(red_wine$alcohol), max(red_wine$alcohol), length.out = 100))

for (var in setdiff(names(red_wine), c("alcohol", "quality_binary"))) {
  newdata[[var]] <- mean(red_wine[[var]], na.rm = TRUE)
}

## Warning in mean.default(red_wine[[var]], na.rm = TRUE): argument is not numeric
## or logical: returning NA

newdata$prob <- predict(model, newdata = newdata, type = "response")

ggplot(newdata, aes(x = alcohol, y = prob)) +
  geom_line() +
  labs(x = "Alcohol", y = "Probability of Quality >= 7", title = "Effect of Alcohol on Wine Quality")

The predicted probabilities of having a wine quality rating of 7 or higher tend to increase with increasing alcohol content.

The relationship between alcohol content and the predicted probabilities of wine quality being 7 or higher is not linear and looking like sigmoid. This is possible because the relationship is modeled by logistic regression which could capture more complex, non-linear relationship.

Discuss reasons why the modeling approach used in 3) is ill-advised for modeling these data.

Using a binomial logistic regression model is not recommended in this case because the quality ratings ranges from 3 to 8 and is not a binary response. By aggregating into 2 groups of response, the model would not be able to capture the information from the other ratings. A multinomimal logistic regression would be able to better handle this specific dataset.

Fit an ordinal regression model to the data using the original response (i.e., quality) using the orm() function from R package rms. Construct an effect plot for each predictor showing the effect on the predicted probability that quality >= 7. From these, try to determine the top three predictors solely in terms of their effect on the predicted probability that quality >= 7.

wine2 <- read.csv('wine.csv')

red_wine2 <- subset(wine2, type == "red")
red_wine2 <- subset(red_wine2, select = -type)

red_wine2$quality <- as.factor(red_wine2$quality)

model_ord <- polr(quality ~ ., data = red_wine2)

summary(model_ord)

## 
## Re-fitting to get Hessian

## Call:
## polr(formula = quality ~ ., data = red_wine2)
## 
## Coefficients:
##                          Value Std. Error t value
## fixed.acidity          0.10240   0.051209   2.000
## volatile.acidity      -3.41794   0.400103  -8.543
## citric.acid           -0.80494   0.462371  -1.741
## residual.sugar         0.07617   0.038210   1.993
## chlorides             -5.17121   1.354371  -3.818
## free.sulfur.dioxide    0.01392   0.006767   2.057
## total.sulfur.dioxide  -0.01119   0.002360  -4.744
## density              -48.92546   0.974499 -50.206
## pH                    -0.98472   0.496900  -1.982
## sulphates              2.86724   0.358017   8.009
## alcohol                0.85611   0.059355  14.424
## 
## Intercepts:
##     Value    Std. Error t value 
## 3|4 -48.8787   0.9979   -48.9791
## 4|5 -46.9597   0.9959   -47.1537
## 5|6 -43.2452   0.9988   -43.2964
## 6|7 -40.3898   1.0111   -39.9450
## 7|8 -37.3837   1.0409   -35.9135
## 
## Residual Deviance: 3074.928 
## AIC: 3106.928

for (var in names(red_wine2)[-which(names(red_wine2) == "quality")]) {
 
  newdata <- data.frame(matrix(ncol = length(names(red_wine2)), nrow = 100))
  names(newdata) <- names(red_wine2)
  newdata[[var]] <- seq(min(red_wine2[[var]], na.rm = TRUE), max(red_wine2[[var]], na.rm = TRUE), length.out = 100)
  
 
  for (other_var in names(newdata)[-which(names(newdata) %in% c(var, "quality"))]) {
    newdata[[other_var]] <- mean(red_wine2[[other_var]], na.rm = TRUE)
  }
  
 
  predicted_probs <- predict(model_ord, newdata = newdata, type = "probs")
  
  
  predicted_probs_quality_ge_7 <- rowSums(predicted_probs[, which(levels(red_wine2$quality) >= 7)])
  
  
  plot(newdata[[var]], predicted_probs_quality_ge_7, type = "l", 
     xlab = var, ylab = "Probability of Quality >= 7",
     main = paste("Effect of", var, "on Probability of Quality >= 7"), ylim = c(0, 1))
}

We got error trying to use orm from rms package, so we decided to use polr from Mass package from lecture 5 R file.

Top three predictors solely in terms of their effect on the predicted probability that quality >= 7 are sulphate content, alcohol content and volatile acidity.

Consider an observation x0 (i.e., a single red wine) with the following characteristics: ## ## fixed.acidity 0.9946 ## volatile.acidity 3.3900 ## citric.acid 0.4700 ## residual.sugar 10.0000 ## chlorides 7.3000 ## free.sulfur.dioxide 15.0000 ## total.sulfur.dioxide 21.0000 ## density 0.6500 ## pH 0.0000 ## sulphates 1.2000 ## alcohol 0.0650

Based on your fitted model from 6), provide estimates for the following quantities: • Pr(quality==7|x−0) • Pr(quality>=7|x0) • Pr(quality==9|x0) • Pr(quality>=9|x0)

# Define the new observation
x0 <- data.frame(
  fixed.acidity = 7.3000,
  volatile.acidity = 0.6500,
  citric.acid = 0.0000,
  residual.sugar = 1.2000,
  chlorides = 0.0650,
  free.sulfur.dioxide = 15.0000,
  total.sulfur.dioxide = 21.0000,
  density = 0.9946,
  pH = 3.3900,
  sulphates = 0.4700,
  alcohol = 10.0000
)

# Predict the probabilities
probabilities <- predict(model_ord, x0, type = "probs")

# Print the probabilities
print(probabilities)

##           3           4           5           6           7           8 
## 0.006414165 0.035719884 0.601381876 0.325598623 0.029310854 0.001574598

• Pr(quality==7|x0) = 0.0293 • Pr(quality>=7|x0) = 0.0309 • Pr(quality==9|x0) = 0 • Pr(quality>=9|x0) = 0

The highest rating for quality is 8 so there’s no probabilities or quality = 9 or higher.

You’re asked to use your model from part 6) to provide predictions for the white wines included in the original data. Discuss whether or not you think this is a reasonable request and why. What would you do in practice (e.g., what if this was your boss asking)?

Using the model trained on red wine to predict white wine quality might not be feasible because of the potential differences in data characteristics. In practice, it is recommended to use a model trained on white wine data to predict white wine quality ratings or use a combined model with wine type as a preditor.

Homework 3

2024-02-17