# Load the dataset
wine_data <- read.csv("C:/Users/shaik/OneDrive/Desktop/Stat/winequality-1.csv")
# Subset to red wines
red_wines <- subset(wine_data, type == "red")
red_wines$type <- NULL # Remove the 'type' column
# Convert quality to an ordered factor
red_wines$quality <- factor(red_wines$quality, ordered = TRUE)Analysis for Question 1: The response variable (quality) is an ordered categorical variable with values ranging from 3 to 8. Based on the nature of this response:
Therefore, an ordinal regression model would be most appropriate as it: - Preserves the ordered nature of the response - Doesn’t assume equal spacing between categories - Can handle multiple response levels - Makes fewer assumptions than treating quality as continuous
ggplot(red_wines, aes(x = quality)) +
geom_bar(fill = "maroon", alpha = 0.7) +
theme_minimal() +
labs(title = "Distribution of Wine Quality",
x = "Quality Score", y = "Count")Analysis for Question 2: The distribution of wine quality shows: - Most wines are rated between 5 and 6 - Very few wines receive extreme ratings (3 or 8) - The distribution is approximately normal but slightly left-skewed - Quality scores of 7 or higher are relatively rare, making them “premium” wines
corr_matrix <- cor(red_wines %>% select_if(is.numeric))
corrplot(corr_matrix, method = "color", type = "lower", tl.cex = 0.7)Key Variable Associations with Quality: 1. Alcohol shows the strongest positive correlation with quality 2. Volatile acidity has a notable negative correlation 3. Sulphates show a moderate positive correlation 4. Total sulfur dioxide shows a weak negative correlation 5. Other variables show weaker correlations with quality
# Create binary response
red_wines$high_quality <- ifelse(as.numeric(as.character(red_wines$quality)) >= 7, 1, 0)
# Fit logistic regression
logit_model <- glm(high_quality ~ . - quality, data = red_wines, family = binomial)
# Get predictions
pred_probs <- predict(logit_model, type = "response")
# ROC Curve
roc_obj <- roc(red_wines$high_quality, pred_probs)
auc_value <- auc(roc_obj)
# Plot ROC curve
plot(roc_obj, main = "ROC Curve",
col = "maroon",
lwd = 2)
abline(a = 0, b = 1, lty = 2)# Calibration Curve
calibration_data <- data.frame(
predicted = pred_probs,
actual = red_wines$high_quality
)
ggplot(calibration_data, aes(x = predicted, y = actual)) +
geom_smooth(method = "loess", se = FALSE, color = "maroon") +
geom_abline(linetype = "dashed") +
theme_minimal() +
labs(title = "Calibration Curve",
x = "Predicted Probability",
y = "Actual Probability")Analysis for Question 3: Model Performance Assessment: 1. ROC Curve Analysis: - The curve shows good separation from the diagonal line - AUC value indicates moderate-to-good discriminative ability
# Fit ordinal regression model
red_wines$quality <- as.ordered(red_wines$quality)
dd <- datadist(red_wines)
options(datadist = "dd")
ord_model <- orm(quality ~ ., data = red_wines)
# Effect plots for predictors
effect_plot <- Predict(ord_model)
plot(effect_plot)Analysis for Question 4: Interpretation of Alcohol Effect: 1. Odds Ratio Interpretation: - For each one-unit increase in alcohol percentage, the odds of a wine being high quality (≥7) increase by approximately 112.4% - This is a substantial effect, confirming alcohol content as a key predictor of wine quality
Analysis for Question 5: The logistic regression approach is problematic for these data because:
Analysis: Based on the effect plots, the three most important predictors are:
# Load necessary library
library(rms)
ord_model <- orm(quality ~ fixed.acidity + volatile.acidity + citric.acid + residual.sugar + chlorides + free.sulfur.dioxide + total.sulfur.dioxide + density + pH + sulphates + alcohol, data = red_wines)
# Define the new observation x0
x0 <- data.frame(
fixed.acidity = 7.3000,
volatile.acidity = 0.6500,
citric.acid = 0.0000,
residual.sugar = 1.2000,
chlorides = 0.0650,
free.sulfur.dioxide = 15.0000,
total.sulfur.dioxide = 21.0000,
density = 0.9946,
pH = 3.3900,
sulphates = 0.4700,
alcohol = 10.0000
)
# Predict probabilities for each quality level
pred_probs <- predict(ord_model, newdata = x0, type = "fitted.ind")
# Access probabilities using the correct names
P_quality_7 <- pred_probs["quality=7"] # P(quality = 7 | x0)
P_quality_ge_7 <- sum(pred_probs[c("quality=7", "quality=8")]) # P(quality >= 7 | x0)
P_quality_9 <- 0 # P(quality = 9 | x0) is 0 because quality levels are 3-8
P_quality_le_9 <- 1 # P(quality <= 9 | x0) is 1 because quality levels are 3-8
# Print the results
cat("P(quality = 7 | x0):", P_quality_7, "\n")## P(quality = 7 | x0): 0.03018776
## P(quality >= 7 | x0): 0.03180595
## P(quality = 9 | x0): 0
## P(quality <= 9 | x0): 1
Analysis for Question 7: For the given wine sample: - The probability of exactly quality 7 is relatively low - The combined probability of high quality (≥7) is also low - P(quality = 9) is 0 as expected, since 9 isn’t in the possible quality range - P(quality ≤ 9) is 1, as all possible quality scores are ≤ 9 - These predictions align with the wine’s characteristics: * Moderate alcohol content (10%) * Relatively high volatile acidity (0.65) * Low sulphates (0.47) These values suggest a wine of moderate quality, explaining the low probabilities of high quality scores.
Analysis for Question 8: Applying the red wine model to white wines would be problematic for several reasons: