Homework 3

Context

This homework project dives into the Wine Quality dataset from the UCI Machine Learning Repository. The dataset allows for the modelling of wine quality based on physicochemical properties. Information on both red and white wines is available in this dataset, for this homework, only red wine will be analyzed. The response variables focused on for this homework will be quality.

Data Dictionary

Numeric Variables

fixed.acidity: Numeric variable representing the fixed acidity of the wine.
volatile.acidity: Numeric variable indicating the volatile acidity of the wine.
citric.acid: Numeric variable denoting the citric acid content in the wine.
residual.sugar: Numeric variable representing the residual sugar content in the wine.
chlorides: Numeric variable indicating the chloride content in the wine.
free.sulfur.dioxide: Numeric variable specifying the free sulfur dioxide content in the wine.
total.sulfur.dioxide: Numeric variable representing the total sulfur dioxide content in the wine.
density: Numeric variable indicating the density of the wine.
pH: Numeric variable representing the pH level of the wine.
sulphates: Numeric variable denoting the sulphate content in the wine.
alcohol: Numeric variable indicating the alcohol content in the wine.

Character Variable

type: Character variable specifying the type of wine (e.g., “red” or “white”).

Response Variable

quality: Integer variable representing the quality score of the wine.

Libraries Used

# Set a default CRAN mirror
options(repos = c(CRAN = "https://cloud.r-project.org"))

# Install the "corrplot" package
install.packages("ggplot2")

## 
## The downloaded binary packages are in
##  /var/folders/fk/51mk7gc95yv17dsrhy1rlgz00000gn/T//RtmpFAnnjT/downloaded_packages

install.packages("ROCR")

## 
## The downloaded binary packages are in
##  /var/folders/fk/51mk7gc95yv17dsrhy1rlgz00000gn/T//RtmpFAnnjT/downloaded_packages

install.packages("rms")

## 
## The downloaded binary packages are in
##  /var/folders/fk/51mk7gc95yv17dsrhy1rlgz00000gn/T//RtmpFAnnjT/downloaded_packages

install.packages("pdp")

## 
## The downloaded binary packages are in
##  /var/folders/fk/51mk7gc95yv17dsrhy1rlgz00000gn/T//RtmpFAnnjT/downloaded_packages

install.packages("corrgram")

## 
## The downloaded binary packages are in
##  /var/folders/fk/51mk7gc95yv17dsrhy1rlgz00000gn/T//RtmpFAnnjT/downloaded_packages

# Load the required libraries
library(ggplot2)
library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(ROCR)
library(rms)

## Loading required package: Hmisc

## 
## Attaching package: 'Hmisc'

## The following objects are masked from 'package:dplyr':
## 
##     src, summarize

## The following objects are masked from 'package:base':
## 
##     format.pval, units

library(pdp)
library(corrgram)

Question 1

wine<- read.csv("/Users/keerthichereddy/Documents/Sem2/StatisticalModelling/wine.csv", stringsAsFactors = TRUE)
str(wine)

## 'data.frame':    6497 obs. of  13 variables:
##  $ fixed.acidity       : num  7.4 7.8 7.8 11.2 7.4 7.4 7.9 7.3 7.8 7.5 ...
##  $ volatile.acidity    : num  0.7 0.88 0.76 0.28 0.7 0.66 0.6 0.65 0.58 0.5 ...
##  $ citric.acid         : num  0 0 0.04 0.56 0 0 0.06 0 0.02 0.36 ...
##  $ residual.sugar      : num  1.9 2.6 2.3 1.9 1.9 1.8 1.6 1.2 2 6.1 ...
##  $ chlorides           : num  0.076 0.098 0.092 0.075 0.076 0.075 0.069 0.065 0.073 0.071 ...
##  $ free.sulfur.dioxide : num  11 25 15 17 11 13 15 15 9 17 ...
##  $ total.sulfur.dioxide: num  34 67 54 60 34 40 59 21 18 102 ...
##  $ density             : num  0.998 0.997 0.997 0.998 0.998 ...
##  $ pH                  : num  3.51 3.2 3.26 3.16 3.51 3.51 3.3 3.39 3.36 3.35 ...
##  $ sulphates           : num  0.56 0.68 0.65 0.58 0.56 0.56 0.46 0.47 0.57 0.8 ...
##  $ alcohol             : num  9.4 9.8 9.8 9.8 9.4 9.4 9.4 10 9.5 10.5 ...
##  $ quality             : int  5 5 5 6 5 5 5 7 7 5 ...
##  $ type                : Factor w/ 2 levels "red","white": 1 1 1 1 1 1 1 1 1 1 ...

library(ggplot2)

ggplot(red_wine, aes(x = quality)) +
  geom_histogram(binwidth = 1, fill = "blue", color = "black", alpha = 0.7) +
  labs(title = "Distribution of Wine Quality")

When looking only at the observations of red wine, the distribution of quality appears to be normally distributed which would make linear regression a strong choice.

Question 2

red_wine <- subset(red_wine, select = -type)
plot(red_wine)

First, type was removed from the dataset as it is now filtered to be “red” for every record. Several of the variables appear to have a relationship with the response variable, quality. Each relationship is outlined below

Fixed.acidity appears to have a slight positive correlation
Volatile.acidity appears to have a slight negative correlation
Citric.acidity appears to have no correlation
Residual.sugar appears to have no correlation
Chlorides appears to have no correlation
Free.sulfur.dioxide appears to have no correlation
Total.sulfur.dioxide appears to have no correlation
Density appears to have no correlation
pH appears to have no correlation
Sulphates appears to have no correlation
Acohol appears to have a positive correlation

Question 3:

Construct a binary response according to the following rule Y =(1, quality >= 7 0, quality < 7 Fit a logistic regression to the data using all possible main effects (i.e., include each variable, but no interaction effects). Assess the performance of this model. Does the model seem well-calibrated? Discuss and provide a plot of a calibration curve.

quality_binary<- as.factor(red_wine$quality)
red_wine<- red_wine%>%mutate(quality = as.factor(case_when(
    quality >=7 ~ 1,
    TRUE ~ 0
  ))) 

#Fit a logistic regression to the data 
lm <- glm(quality ~ ., family = binomial(link = "logit"), data = red_wine)
summary(lm)

## 
## Call:
## glm(formula = quality ~ ., family = binomial(link = "logit"), 
##     data = red_wine)
## 
## Coefficients:
##                        Estimate Std. Error z value Pr(>|z|)    
## (Intercept)           2.428e+02  1.081e+02   2.247 0.024660 *  
## fixed.acidity         2.750e-01  1.253e-01   2.195 0.028183 *  
## volatile.acidity     -2.581e+00  7.843e-01  -3.291 0.000999 ***
## citric.acid           5.678e-01  8.385e-01   0.677 0.498313    
## residual.sugar        2.395e-01  7.373e-02   3.248 0.001163 ** 
## chlorides            -8.816e+00  3.365e+00  -2.620 0.008788 ** 
## free.sulfur.dioxide   1.082e-02  1.223e-02   0.884 0.376469    
## total.sulfur.dioxide -1.653e-02  4.894e-03  -3.378 0.000731 ***
## density              -2.578e+02  1.104e+02  -2.335 0.019536 *  
## pH                    2.242e-01  9.984e-01   0.225 0.822327    
## sulphates             3.750e+00  5.416e-01   6.924 4.39e-12 ***
## alcohol               7.533e-01  1.316e-01   5.724 1.04e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1269.92  on 1598  degrees of freedom
## Residual deviance:  870.86  on 1587  degrees of freedom
## AIC: 894.86
## 
## Number of Fisher Scoring iterations: 6

pred_lm<- predict(lm, type="response")

pred <- prediction(pred_lm, red_wine$quality)
perf <- performance(pred, "tpr", "fpr")
plot(perf, colorize=TRUE)

# AUC Value
AUC <- as.numeric(performance(pred, "auc")@y.values)
AUC

## [1] 0.8821717

The calibration curve indicates a reasonably well-calibrated model,overall performance remains satisfactory, as evidenced by an AUC exceeding 0.7 in the full logistic model.

Question 4:

Interpret the effect of the predictor alcohol on the odds that quality >= 7. Construct an effect plot visualizing the effect of alcohol on the probability that quality >= 7 and describe the relationship.Does this plot look linear or nonlinear? If nonlinear, discuss how this is possible.

boxplot(red_wine$alcohol~red_wine$quality)

boxplot(red_wine$alcohol~quality_binary)

Upon analyzing the data using box plots, it’s evident that the relationship between alcohol content and the predicted probabilities of wine quality being 7 or higher is nonlinear.This complexity may arise from interactions between predictors or varying effects of predictors across different levels. This highlights the need for deeper exploration into potential interactions and contextual variations.

Question 5:

Discuss reasons why the modeling approach used in 3) is ill-advised for modeling these data.

Logical regression is well suited for binary responses but in our wine dataset we have variety of values in response variable “quality”. Logistic regression model will not be able to predict the values appropriately.

Question 6:

Fit an ordinal regression model to the data using the original response (i.e., quality) using the orm() function from R package rms. Construct an effect plot for each predictor showing the effect on the predicted probability that quality >= 7. From these, try to determine the top three predictors solely in terms of their effect on the predicted probability that quality >= 7

wine<- read.csv("/Users/keerthichereddy/Documents/Sem2/StatisticalModelling/wine.csv", stringsAsFactors = TRUE)

red_wine_final <- subset(wine, type == "red")
red_wine_final$quality<- as.factor(red_wine_final$quality)
head(red_wine_final)

##   fixed.acidity volatile.acidity citric.acid residual.sugar chlorides
## 1           7.4             0.70        0.00            1.9     0.076
## 2           7.8             0.88        0.00            2.6     0.098
## 3           7.8             0.76        0.04            2.3     0.092
## 4          11.2             0.28        0.56            1.9     0.075
## 5           7.4             0.70        0.00            1.9     0.076
## 6           7.4             0.66        0.00            1.8     0.075
##   free.sulfur.dioxide total.sulfur.dioxide density   pH sulphates alcohol
## 1                  11                   34  0.9978 3.51      0.56     9.4
## 2                  25                   67  0.9968 3.20      0.68     9.8
## 3                  15                   54  0.9970 3.26      0.65     9.8
## 4                  17                   60  0.9980 3.16      0.58     9.8
## 5                  11                   34  0.9978 3.51      0.56     9.4
## 6                  13                   40  0.9978 3.51      0.56     9.4
##   quality type
## 1       5  red
## 2       5  red
## 3       5  red
## 4       6  red
## 5       5  red
## 6       5  red

library(rms)


red_wine_final <- red_wine_final[,-c(13,14)]
ord_model <- orm(quality ~ pH + chlorides + fixed.acidity + volatile.acidity + citric.acid + residual.sugar + free.sulfur.dioxide + total.sulfur.dioxide + density + sulphates + alcohol, data = red_wine_final)
ord_model

## Logistic (Proportional Odds) Ordinal Regression Model
## 
## orm(formula = quality ~ pH + chlorides + fixed.acidity + volatile.acidity + 
##     citric.acid + residual.sugar + free.sulfur.dioxide + total.sulfur.dioxide + 
##     density + sulphates + alcohol, data = red_wine_final)
## 
## 
## Frequencies of Responses
## 
##   3   4   5   6   7   8 
##  10  53 681 638 199  18 
## 
##                        Model Likelihood               Discrimination    Rank Discrim.    
##                              Ratio Test                      Indexes          Indexes    
## Obs          1599    LR chi2     713.68    R2                  0.397    rho     0.602    
## Distinct Y      6    d.f.            11    R2(11,1599)         0.356                     
## Median Y        4    Pr(> chi2) <0.0001    R2(11,1370.8)       0.401                     
## max |deriv| 0.003    Score chi2  680.97    |Pr(Y>=median)-0.5| 0.244                     
##                      Pr(> chi2) <0.0001                                                  
## 
##                      Coef     S.E.    Wald Z Pr(>|Z|)
## y>=4                  75.7197 66.9588  1.13  0.2581  
## y>=5                  73.8023 66.9612  1.10  0.2704  
## y>=6                  70.0889 66.9674  1.05  0.2953  
## y>=7                  67.2306 66.9597  1.00  0.3154  
## y>=8                  64.2216 66.9558  0.96  0.3375  
## pH                    -0.8485  0.6009 -1.41  0.1579  
## chlorides             -5.1429  1.3595 -3.78  0.0002  
## fixed.acidity          0.1282  0.0823  1.56  0.1194  
## volatile.acidity      -3.3959  0.4031 -8.43  <0.0001 
## citric.acid           -0.8022  0.4622 -1.74  0.0826  
## residual.sugar         0.0878  0.0480  1.83  0.0672  
## free.sulfur.dioxide    0.0137  0.0068  2.01  0.0444  
## total.sulfur.dioxide  -0.0111  0.0024 -4.70  <0.0001 
## density              -76.3267 68.3661 -1.12  0.2642  
## sulphates              2.9017  0.3675  7.90  <0.0001 
## alcohol                0.8310  0.0852  9.75  <0.0001

pfun.orm <- function(object, newdata) {
  colMeans(predict(object, newdata = newdata, type = "fitted"))
}
pd.pH <- partial(ord_model, pred.var = "pH", pred.fun = pfun.orm)
ggplot(pd.pH, aes(x = pH, y = yhat, linetype = yhat.id, color = yhat.id)) +
  geom_line(size = 2) +
  ggtitle("Effect Plot for pH") +
  xlab("pH") +
  ylab("Partial dependence") +
  geom_rug(data = data.frame("pH" = quantile(red_wine_final$pH, prob = 1:9/10)),
           aes(x = pH), inherit.aes = FALSE)

## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

pd.alcohol <- partial(ord_model, pred.var = "alcohol", pred.fun = pfun.orm)
ggplot(pd.alcohol, aes(x = alcohol, y = yhat, linetype = yhat.id, color = yhat.id)) +
  geom_line(size = 2) +
  ggtitle("Effect Plot for Alcohol") +
  xlab("alcohol") +
  ylab("Partial dependence") +
  geom_rug(data = data.frame("alcohol" = quantile(red_wine_final$alcohol, prob = 1:9/10)),
           aes(x = alcohol), inherit.aes = FALSE)

pd.chlorides <- partial(ord_model, pred.var = "chlorides", pred.fun = pfun.orm)
ggplot(pd.chlorides, aes(x = chlorides, y = yhat, linetype = yhat.id, color = yhat.id)) +
  geom_line(size = 2) +
  ggtitle("Effect Plot for Chlorides") +
  xlab("chlorides") +
  ylab("Partial dependence") +
  geom_rug(data = data.frame("chlorides" = quantile(red_wine_final$chlorides, prob = 1:9/10)),
           aes(x = chlorides), inherit.aes = FALSE)

pd.fixed.acidity <- partial(ord_model, pred.var = "fixed.acidity", pred.fun = pfun.orm)
ggplot(pd.fixed.acidity, aes(x = fixed.acidity, y = yhat, linetype = yhat.id, color = yhat.id)) +
  geom_line(size = 2) +
  ggtitle("Effect Plot for Fixed Acidity") +
  xlab("Fixed Acidity") +
  ylab("Partial dependence") +
  geom_rug(data = data.frame("fixed.acidity" = quantile(red_wine_final$fixed.acidity, prob = 1:9/10)),
           aes(x = fixed.acidity), inherit.aes = FALSE)

pd.volatile.acidity <- partial(ord_model, pred.var = "volatile.acidity", pred.fun = pfun.orm)
ggplot(pd.volatile.acidity, aes(x = volatile.acidity, y = yhat, linetype = yhat.id, color = yhat.id)) +
  geom_line(size = 2) +
  ggtitle("Effect Plot for Volatile Acidity") +
  xlab("Volatile Acidity") +
  ylab("Partial dependence") +
  geom_rug(data = data.frame("volatile.acidity" = quantile(red_wine_final$volatile.acidity, prob = 1:9/10)),
           aes(x = volatile.acidity), inherit.aes = FALSE)

pd.citric.acid <- partial(ord_model, pred.var = "citric.acid", pred.fun = pfun.orm)
ggplot(pd.citric.acid, aes(x = citric.acid, y = yhat, linetype = yhat.id, color = yhat.id)) +
  geom_line(size = 2) +
  ggtitle("Effect Plot for Citric Acid") +
  xlab("Citric Acid") +
  ylab("Partial dependence") +
  geom_rug(data = data.frame("citric.acid" = quantile(red_wine_final$citric.acid, prob = 1:9/10)),
           aes(x = citric.acid), inherit.aes = FALSE)

pd.residual.sugar <- partial(ord_model, pred.var = "residual.sugar", pred.fun = pfun.orm)
ggplot(pd.residual.sugar, aes(x = residual.sugar, y = yhat, linetype = yhat.id, color = yhat.id)) +
  geom_line(size = 2) +
  ggtitle("Effect Plot for Residual Sugar") +
  xlab("Residual Sugar") +
  ylab("Partial dependence") +
  geom_rug(data = data.frame("residual.sugar" = quantile(red_wine_final$residual.sugar, prob = 1:9/10)),
           aes(x = residual.sugar), inherit.aes = FALSE)

pd.free.sulfur.dioxide <- partial(ord_model, pred.var = "free.sulfur.dioxide", pred.fun = pfun.orm)
ggplot(pd.free.sulfur.dioxide, aes(x = free.sulfur.dioxide, y = yhat, linetype = yhat.id, color = yhat.id)) +
  geom_line(size = 2) +
  ggtitle("Effect Plot for Free Sulfur Dioxide") +
  xlab("Free Sulfur Dioxide") +
  ylab("Partial dependence") +
  geom_rug(data = data.frame("free.sulfur.dioxide" = quantile(red_wine_final$free.sulfur.dioxide, prob = 1:9/10)),
           aes(x = free.sulfur.dioxide), inherit.aes = FALSE)

pd.total.sulfur.dioxide <- partial(ord_model, pred.var = "total.sulfur.dioxide", pred.fun = pfun.orm)
ggplot(pd.total.sulfur.dioxide, aes(x = total.sulfur.dioxide, y = yhat, linetype = yhat.id, color = yhat.id)) +
  geom_line(size = 2) +
  ggtitle("Effect Plot for Total Sulfur Dioxide") +
  xlab("Total Sulfur Dioxide") +
  ylab("Partial dependence") +
  geom_rug(data = data.frame("total.sulfur.dioxide" = quantile(red_wine_final$total.sulfur.dioxide, prob = 1:9/10)),
           aes(x = total.sulfur.dioxide), inherit.aes = FALSE)

pd.density <- partial(ord_model, pred.var = "density", pred.fun = pfun.orm)
ggplot(pd.density, aes(x = density, y = yhat, linetype = yhat.id, color = yhat.id)) +
  geom_line(size = 2) +
  ggtitle("Effect Plot for Density") +
  xlab("Density") +
  ylab("Partial dependence") +
  geom_rug(data = data.frame("density" = quantile(red_wine_final$density, prob = 1:9/10)),
           aes(x = density), inherit.aes = FALSE)

pd.sulphates <- partial(ord_model, pred.var = "sulphates", pred.fun = pfun.orm)
ggplot(pd.sulphates, aes(x = sulphates, y = yhat, linetype = yhat.id, color = yhat.id)) +
  geom_line(size = 2) +
  ggtitle("Effect Plot for Sulphates") +
  xlab("Sulphates") +
  ylab("Partial dependence") +
  geom_rug(data = data.frame("sulphates" = quantile(red_wine_final$sulphates, prob = 1:9/10)),
           aes(x = sulphates), inherit.aes = FALSE)

The top 3 predictors based on the plots above are sulphates,volatile acidity, and alcohol.

7.Consider an observation x0 (i.e., a single red wine) with the following characteristics: ## ## fixed.acidity 0.9946 ## volatile.acidity 3.3900 ## citric.acid 0.4700 ## residual.sugar 10.0000 ## chlorides 7.3000 ## free.sulfur.dioxide 15.0000 ## total.sulfur.dioxide 21.0000 ## density 0.6500 ## pH 0.0000 ## sulphates 1.2000 ## alcohol 0.0650 Based on your fitted model from 6), provide estimates for the following quantities: • Pr(quality==7|x−0) • Pr(quality>=7|x0) • Pr(quality==9|x0) • Pr(quality>=9|x0)

# Create a data frame for the new observation
new_data <- data.frame(
  fixed.acidity = 7.3000,
  volatile.acidity = 0.6500,
  citric.acid = 0.0000,
  residual.sugar = 1.2000,
  chlorides = 0.0650,
  free.sulfur.dioxide = 15.0000,
  total.sulfur.dioxide = 21.0000,
  density = 0.9946,
  pH = 3.3900,
  sulphates = 0.4700,
  alcohol = 10.0000
)

# Use the predict function to get probabilities for each category
probabilities <- predict(ord_model, newdata = new_data, type = "fitted.ind")

# Extract the probability for wine quality equal to 7
probability_quality_7 <- probabilities["quality=7"]

# Print the result
cat("Probability of wine quality equal to 7:", probability_quality_7, "\n")

## Probability of wine quality equal to 7: 0.03018793

# Use the predict function to get cumulative probabilities for each category
cumulative_probabilities <- predict(ord_model, newdata = new_data, type = "fitted")

# Extract the cumulative probability for wine quality greater than or equal to 7
cumulative_probability_7_or_higher <- cumulative_probabilities["y>=7"]

# Print the result
cat("Cumulative Probability of wine quality greater than or equal to 7:", cumulative_probability_7_or_higher, "\n")

## Cumulative Probability of wine quality greater than or equal to 7: 0.03180613

cat("Probability of wine quality equal to 9:", 0, "\n")

## Probability of wine quality equal to 9: 0

cat("Cumulative Probability of wine quality greater than or equal to 9:", 0, "\n")

## Cumulative Probability of wine quality greater than or equal to 9: 0

The results show that the probabilities of getting a wine quality = 9 or wine quality >= 9 are both zero because our model is trained on a data set that doesn’t have such high quality observations.

You’re asked to use your model from part 6) to provide predictions for the white wines included in the original data. Discuss whether or not you think this is a reasonable request and why. What would you do in practice (e.g., what if this was your boss asking)?

Applying a model trained exclusively on red wine data to predict white wine quality may not be reasonable due to the differences in physicochemical properties and quality determinants between red and white wines. These differences could lead to inaccurate predictions. Its necessary to build a separate model specifically for white wines or a combined model that includes wine type as a predictor to account for the variations between red and white wines.