1. Data Preparation and Initial Analysis

Show/Hide Code
# Load the dataset
wine_data <- read.csv("C:/Users/shaik/OneDrive/Desktop/Stat/winequality-1.csv")

# Subset to red wines
red_wines <- subset(wine_data, type == "red")
red_wines$type <- NULL  # Remove the 'type' column

# Convert quality to an ordered factor
red_wines$quality <- factor(red_wines$quality, ordered = TRUE)

Analysis for Question 1: The response variable (quality) is an ordered categorical variable with values ranging from 3 to 8. Based on the nature of this response:

  1. It has a natural ordering (3 is worse than 4, etc.)
  2. The intervals between categories may not be equal
  3. There are multiple ordered categories

Therefore, an ordinal regression model would be most appropriate as it: - Preserves the ordered nature of the response - Doesn’t assume equal spacing between categories - Can handle multiple response levels - Makes fewer assumptions than treating quality as continuous

2. Exploratory Data Analysis

2.1 Distribution of Wine Quality

Show/Hide Code
ggplot(red_wines, aes(x = quality)) +
  geom_bar(fill = "maroon", alpha = 0.7) +
  theme_minimal() +
  labs(title = "Distribution of Wine Quality",
       x = "Quality Score", y = "Count")

Analysis for Question 2: The distribution of wine quality shows: - Most wines are rated between 5 and 6 - Very few wines receive extreme ratings (3 or 8) - The distribution is approximately normal but slightly left-skewed - Quality scores of 7 or higher are relatively rare, making them “premium” wines

2.2 Correlation Matrix

Show/Hide Code
corr_matrix <- cor(red_wines %>% select_if(is.numeric))
corrplot(corr_matrix, method = "color", type = "lower", tl.cex = 0.7)

Key Variable Associations with Quality: 1. Alcohol shows the strongest positive correlation with quality 2. Volatile acidity has a notable negative correlation 3. Sulphates show a moderate positive correlation 4. Total sulfur dioxide shows a weak negative correlation 5. Other variables show weaker correlations with quality

3. Logistic Regression & Calibration Analysis

Show/Hide Code
# Create binary response
red_wines$high_quality <- ifelse(as.numeric(as.character(red_wines$quality)) >= 7, 1, 0)

# Fit logistic regression
logit_model <- glm(high_quality ~ . - quality, data = red_wines, family = binomial)

# Get predictions
pred_probs <- predict(logit_model, type = "response")

# ROC Curve
roc_obj <- roc(red_wines$high_quality, pred_probs)
auc_value <- auc(roc_obj)

# Plot ROC curve
plot(roc_obj, main = "ROC Curve", 
     col = "maroon", 
     lwd = 2)
abline(a = 0, b = 1, lty = 2)

# Calibration Curve
calibration_data <- data.frame(
  predicted = pred_probs,
  actual = red_wines$high_quality
)

ggplot(calibration_data, aes(x = predicted, y = actual)) +
  geom_smooth(method = "loess", se = FALSE, color = "maroon") +
  geom_abline(linetype = "dashed") +
  theme_minimal() +
  labs(title = "Calibration Curve",
       x = "Predicted Probability",
       y = "Actual Probability")

Analysis for Question 3: Model Performance Assessment: 1. ROC Curve Analysis: - The curve shows good separation from the diagonal line - AUC value indicates moderate-to-good discriminative ability

  1. Calibration Analysis:
    • The calibration curve shows some deviation from the ideal 45-degree line
    • The model tends to slightly overestimate probabilities in the mid-range
    • Calibration is better at lower probabilities but becomes less reliable at higher probabilities
    • Overall, the model would benefit from recalibration for more accurate probability estimates

4. Alcohol Effect Analysis and Effect Plots

Show/Hide Code
# Fit ordinal regression model
red_wines$quality <- as.ordered(red_wines$quality)
dd <- datadist(red_wines)
options(datadist = "dd")
ord_model <- orm(quality ~ ., data = red_wines)

# Effect plots for predictors
effect_plot <- Predict(ord_model)
plot(effect_plot)

# Additional analysis for alcohol effect
alcohol_effect <- coef(logit_model)["alcohol"]
odds_ratio <- exp(alcohol_effect)

Analysis for Question 4: Interpretation of Alcohol Effect: 1. Odds Ratio Interpretation: - For each one-unit increase in alcohol percentage, the odds of a wine being high quality (≥7) increase by approximately 112.4% - This is a substantial effect, confirming alcohol content as a key predictor of wine quality

  1. Effect Plot Analysis:
    • The relationship between alcohol and quality probability is nonlinear
    • The nonlinearity arises from the logistic link function, which transforms the linear predictor to probabilities
    • The steepest increase in probability occurs in the middle range of alcohol content (11-13%)
    • The effect becomes less pronounced at very low and very high alcohol levels, creating an S-shaped curve
    • This nonlinearity is expected and appropriate, as it reflects the bounded nature of probabilities (0-1)

5. Limitations of Logistic Regression

Analysis for Question 5: The logistic regression approach is problematic for these data because:

  1. Information Loss:
    • Converting the ordinal quality scale (3-8) to binary (≥7 or <7) discards valuable information about quality differences
    • Cannot distinguish between wines rated 3 vs 6, or 7 vs 8
  2. Threshold Arbitrariness:
    • The choice of 7 as the threshold is somewhat arbitrary
    • Different thresholds could lead to different conclusions
    • Important patterns near the threshold may be masked
  3. Ordinal Nature Violation:
    • Ignores the inherent ordering of wine quality scores
    • Treats all wines below 7 as equally “not premium”
    • Fails to capture the progressive nature of wine quality

6. Ordinal Regression Analysis

6.1 Top 3 Predictors for Quality >= 7

Analysis: Based on the effect plots, the three most important predictors are:

  1. Alcohol (strongest positive effect)
    • Shows the largest range in predicted probabilities
    • Most consistent positive relationship with quality
    • Effect is particularly strong between 10% and 13%
  2. Sulphates (moderate positive effect)
    • Second most influential predictor
    • Shows a clear positive relationship with quality
    • Effect plateaus at higher concentrations
  3. Volatile Acidity (strong negative effect)
    • Higher levels consistently decrease quality probability
    • Effect is particularly pronounced above 0.5 g/dm³
    • Important for quality control in wine production

7. Prediction Analysis for Given Wine Sample

Show/Hide Code
# Load necessary library
library(rms)
ord_model <- orm(quality ~ fixed.acidity + volatile.acidity + citric.acid + residual.sugar + chlorides + free.sulfur.dioxide + total.sulfur.dioxide + density + pH + sulphates + alcohol, data = red_wines)

# Define the new observation x0
x0 <- data.frame(
  fixed.acidity = 7.3000,
  volatile.acidity = 0.6500,
  citric.acid = 0.0000,
  residual.sugar = 1.2000,
  chlorides = 0.0650,
  free.sulfur.dioxide = 15.0000,
  total.sulfur.dioxide = 21.0000,
  density = 0.9946,
  pH = 3.3900,
  sulphates = 0.4700,
  alcohol = 10.0000
)

# Predict probabilities for each quality level
pred_probs <- predict(ord_model, newdata = x0, type = "fitted.ind")

# Access probabilities using the correct names
P_quality_7 <- pred_probs["quality=7"]  # P(quality = 7 | x0)
P_quality_ge_7 <- sum(pred_probs[c("quality=7", "quality=8")])  # P(quality >= 7 | x0)
P_quality_9 <- 0  # P(quality = 9 | x0) is 0 because quality levels are 3-8
P_quality_le_9 <- 1  # P(quality <= 9 | x0) is 1 because quality levels are 3-8

# Print the results
cat("P(quality = 7 | x0):", P_quality_7, "\n")
## P(quality = 7 | x0): 0.03018776
cat("P(quality >= 7 | x0):", P_quality_ge_7, "\n")
## P(quality >= 7 | x0): 0.03180595
cat("P(quality = 9 | x0):", P_quality_9, "\n")
## P(quality = 9 | x0): 0
cat("P(quality <= 9 | x0):", P_quality_le_9, "\n")
## P(quality <= 9 | x0): 1

Analysis for Question 7: For the given wine sample: - The probability of exactly quality 7 is relatively low - The combined probability of high quality (≥7) is also low - P(quality = 9) is 0 as expected, since 9 isn’t in the possible quality range - P(quality ≤ 9) is 1, as all possible quality scores are ≤ 9 - These predictions align with the wine’s characteristics: * Moderate alcohol content (10%) * Relatively high volatile acidity (0.65) * Low sulphates (0.47) These values suggest a wine of moderate quality, explaining the low probabilities of high quality scores.

8. Model Application to White Wines

Analysis for Question 8: Applying the red wine model to white wines would be problematic for several reasons:

  1. Chemical Composition Differences:
    • Red and white wines have fundamentally different chemical profiles
    • Tannin content varies significantly
    • Different fermentation processes affect chemical relationships
  2. Different Quality Drivers:
    • Factors that indicate quality in red wines may not apply to white wines
    • Optimal levels of compounds (e.g., acids, sulfites) differ between types
    • Consumer expectations and evaluation criteria vary
  3. Practical Recommendation: If this were a real-world business request, I would:
    • Explain the technical limitations of applying the red wine model to white wines
    • Propose developing a separate model for white wines
    • Suggest collecting white wine-specific expert ratings
    • Recommend a pilot study to validate the approach
    • Offer to create a white wine model using the same methodology