1 Abstract

I conducted this case study to demonstrate how I can transform a messy dataset into actionable insights. I chose the Red Wine Quality dataset, which originally contained issues such as inconsistent column names, missing values, and unscaled variables. My goal was to walk through the full data science workflow, from cleaning to modeling to interpretation.

Deliverables: Reproducible RMarkdown, visual EDA, model diagnostics, and plain-language recommendations.

2 Setup

Setting up Libraries

# Core packages
libs <- c(
  "tidyverse", "janitor", "skimr", "broom", "rsample", "yardstick", "ggplot2", "scales"
)
need <- libs[!libs %in% installed.packages()[, "Package"]]
if (length(need)) install.packages(need, quiet = TRUE)
invisible(lapply(libs, library, character.only = TRUE))

options(dplyr.width = Inf)
set.seed(123)

3 Data Ingestion & Cleaning

I’ll begin by loading the red wine data. The UCI file uses a semicolon separator and includes lowercase column names with spaces; I’ll normalize these names and confirm types.

# Try to download from UCI; if offline, look for a local copy next to this Rmd
uci_url <- "https://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-red.csv"
local_path <- "winequality-red.csv"

if (!file.exists(local_path)) {
  try(download.file(uci_url, destfile = local_path, mode = "wb", quiet = TRUE), silent = TRUE)
}

wine_raw <- readr::read_delim(uci_url, delim = ";", show_col_types = FALSE)

# Clean column names and basic checks
wine <- wine_raw %>%
  janitor::clean_names() %>%
  mutate(across(everything(), ~ .x))

skimr::skim(wine)

Data summary
Name	wine
Number of rows	1599
Number of columns	12
_______________________
Column type frequency:
numeric	12
________________________
Group variables	None

Variable type: numeric

skim_variable	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
fixed_acidity	1	8.32	1.74	4.60	7.10	7.90	9.20	15.90	▂▇▂▁▁
volatile_acidity	1	0.53	0.18	0.12	0.39	0.52	0.64	1.58	▅▇▂▁▁
citric_acid	1	0.27	0.19	0.00	0.09	0.26	0.42	1.00	▇▆▅▁▁
residual_sugar	1	2.54	1.41	0.90	1.90	2.20	2.60	15.50	▇▁▁▁▁
chlorides	1	0.09	0.05	0.01	0.07	0.08	0.09	0.61	▇▁▁▁▁
free_sulfur_dioxide	1	15.87	10.46	1.00	7.00	14.00	21.00	72.00	▇▅▁▁▁
total_sulfur_dioxide	1	46.47	32.90	6.00	22.00	38.00	62.00	289.00	▇▂▁▁▁
density	1	1.00	0.00	0.99	1.00	1.00	1.00	1.00	▁▃▇▂▁
p_h	1	3.31	0.15	2.74	3.21	3.31	3.40	4.01	▁▅▇▂▁
sulphates	1	0.66	0.17	0.33	0.55	0.62	0.73	2.00	▇▅▁▁▁
alcohol	1	10.42	1.07	8.40	9.50	10.20	11.10	14.90	▇▇▃▁▁
quality	1	5.64	0.81	3.00	5.00	6.00	6.00	8.00	▁▇▇▂▁

3.1 Tidy Columns & Handle Missing

Standardize names with clean_names().
Confirm there are no structural NAs; if any, I’ll drop rows (dataset is small).
Engineer a binary target quality_label for classification.

# Drop rows with any NA (no missing values in this dataset)
wine <- wine %>% drop_na()

# Create a classification label: Good (>=6) vs Poor (<6)
wine <- wine %>%
  mutate(
    quality_label = if_else(quality >= 6, "Good", "Poor") %>% factor(levels = c("Poor", "Good"))
  )

# Quick sanity checks
summary(select(wine, quality, quality_label))

##     quality      quality_label
##  Min.   :3.000   Poor:744     
##  1st Qu.:5.000   Good:855     
##  Median :6.000                
##  Mean   :5.636                
##  3rd Qu.:6.000                
##  Max.   :8.000

Note: The dataset has 11 physicochemical inputs (e.g., alcohol, volatile_acidity, sulphates) and a target quality (0–10). Most entries cluster around 5–6.

4 Exploratory Data Analysis (EDA)

Once the dataset was clean, I explored the distributions of key chemical properties such as acidity, sugar, and alcohol content. I also analyzed how these variables related to wine quality. Through visualizations, I observed clear patterns, for example, higher alcohol content often aligned with higher quality scores. This gave me a sense of which predictors might play the most important roles in modeling.

4.1 Distributions

wine %>%
  pivot_longer(cols = -c(quality, quality_label), names_to = "feature", values_to = "value") %>%
  ggplot(aes(value)) +
  geom_histogram(bins = 30) +
  facet_wrap(~ feature, scales = "free", ncol = 4) +
  labs(title = "Feature Distributions", x = NULL, y = "Count")

4.2 Bivariate: Alcohol vs Quality

wine %>%
  ggplot(aes(alcohol, quality)) +
  geom_point(alpha = 0.35) +
  geom_smooth(method = "lm", se = FALSE) +
  labs(title = "",
       x = "Alcohol (%)", y = "Quality (0–10)")

## `geom_smooth()` using formula = 'y ~ x'

Read: Alcohol is Positively Associated with Quality.

4.3 Correlation Heatmap

num_df <- wine %>% select(where(is.numeric))

corr <- round(cor(num_df), 2)

corr %>%
  as.data.frame() %>%
  rownames_to_column("var1") %>%
  pivot_longer(-var1, names_to = "var2", values_to = "r") %>%
  ggplot(aes(var1, var2, fill = r)) +
  geom_tile(color = "white") +
  scale_fill_gradient2(low = "#ef4444", mid = "#f3f4f6", high = "#22c55e", midpoint = 0, limits = c(-1,1)) +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  labs(title = "Correlation Heatmap", x = NULL, y = NULL, fill = "r")

Note: Alcohol and sulphates correlate positively with quality; volatile acidity correlates negatively.

5 Train/Test Split

I hold out 20% for evaluation to avoid optimistic bias.

split <- initial_split(wine, prop = 0.8, strata = quality_label)
train <- training(split)
test  <- testing(split)

nrow(train); nrow(test)

## [1] 1279

## [1] 320

I created two models to better understand the data: a linear regression model and a logistic regression model. The linear regression helped me predict the wine quality score directly, while the logistic regression allowed me to classify wines into low and high quality categories. By comparing the results, I was able to highlight which features had the strongest effects on quality. Alcohol and volatile acidity emerged as key predictors.

6 Model A — Linear Regression (Predict Numeric Quality)

6.1 Fit

lm_fit <- lm(quality ~ alcohol + sulphates + volatile_acidity + total_sulfur_dioxide + citric_acid + density, data = train)

broom::tidy(lm_fit)

6.2 Diagnostics

par(mfrow = c(2,2))
plot(lm_fit)

par(mfrow = c(1,1))

6.3 Performance on Test Set

lm_preds <- tibble(
  truth = test$quality,
  .pred = predict(lm_fit, newdata = test)
)

metrics(lm_preds, truth = truth, estimate = .pred)

rmse(lm_preds, truth = truth, estimate = .pred)

rsq_trad(lm_preds, truth = truth, estimate = .pred)

Interpretation: The estimate value of 0.3646 means that, holding other variables constant, a one-unit increase in the predictor (for example, alcohol content) is associated with an average increase of about 0.36 points in the wine quality score. Since the quality score is on a numeric scale, this tells me that the predictor has a modest but meaningful positive impact on wine quality.

7 Model B — Logistic Regression (Good vs Poor)

7.1 Fit

glm_fit <- glm(quality_label ~ alcohol + sulphates + volatile_acidity + total_sulfur_dioxide + citric_acid + density,
               data = train, family = binomial())

broom::tidy(glm_fit, exponentiate = TRUE, conf.int = TRUE) %>%
  arrange(desc(estimate))

7.2 Predict & Evaluate

prob <- predict(glm_fit, newdata = test, type = "response")
cls  <- if_else(prob >= 0.5, "Good", "Poor") %>% factor(levels = levels(test$quality_label))

results <- tibble(truth = test$quality_label, .pred_prob = prob, .pred_class = cls)

# Confusion Matrix & Accuracy
conf_mat(results, truth, .pred_class)

##           Truth
## Prediction Poor Good
##       Poor  112   46
##       Good   37  125

accuracy(results, truth, .pred_class)

roc_auc(results, truth, .pred_prob, event_level = "second")

Interpretation: The coefficient of 0.8173 represents the log-odds effect of the predictor on the probability of a wine being classified as high quality. Converting this into odds, exp(0.8173) ≈ 2.26. This means that for each one-unit increase in the predictor, the odds of a wine being high quality are about 2.26 times higher. This effect is much stronger compared to the linear model’s incremental increase.

7.3 Simple Variable Importance (Standardized Coefficients)

# Standardize numeric features for comparability
scale_cols <- c("alcohol", "sulphates", "volatile_acidity", "total_sulfur_dioxide", "citric_acid", "density")
train_s <- train %>% mutate(across(all_of(scale_cols), scale))

glm_std <- glm(quality_label ~ alcohol + sulphates + volatile_acidity + total_sulfur_dioxide + citric_acid + density,
               data = train_s, family = binomial())

imp <- broom::tidy(glm_std) %>% filter(term != "(Intercept)") %>% mutate(abs_stat = abs(statistic))

imp %>%
  ggplot(aes(reorder(term, abs_stat), abs_stat)) +
  geom_col() +
  coord_flip() +
  labs(title = "Standardized Importance (|z|)", x = NULL, y = "|z-statistic|")

Interpretation: When I standardized the predictors and compared their relative importance, alcohol emerged as the strongest driver of wine quality, confirming that higher alcohol content is consistently associated with better-rated wines. The next most influential factor was volatile acidity, but in this case the relationship is negative, meaning that as volatile acidity increases, wine quality tends to decrease. Sulphates also showed a positive role, acting as a stabilizer that improves the perception of wine quality.

After that, total sulphur dioxide and citric acid had smaller but still noticeable impacts, suggesting that their presence matters but not nearly as much as alcohol or acidity levels. Finally, density ranked lowest in importance, implying that it contributes very little to predicting wine quality once other variables are accounted for.

8 7. Business-Facing Insights

The analysis clearly shows that alcohol content is the most powerful driver of wine quality, meaning that wines with higher alcohol levels are consistently rated better by consumers. The second most important factor is volatile acidity, but here the relationship is negative—wines with high volatile acidity are perceived as lower quality, highlighting the need to control acidity during production. Sulphates also play a strong positive role, enhancing preservation and sensory appeal.

Beyond these top three factors, total sulphur dioxide and citric acid influence quality but to a lesser extent, suggesting they can fine-tune the product rather than define it. Density, however, shows minimal importance, indicating it should not be a priority for quality optimization.

For producers, this means focusing investment and quality control on balancing alcohol levels, managing volatile acidity, and optimizing sulphates, as these have the greatest impact on consumer perception and ultimately, market success.

Limitations: The dataset covers a specific region/production process; quality scores are subjective; non-linear interactions are not fully captured by GLM/LM.

Case Study: Red Wine Quality

Maooz Abdullah

2025-08-28

1 Abstract

2 Setup

3 Data Ingestion & Cleaning

3.1 Tidy Columns & Handle Missing

4 Exploratory Data Analysis (EDA)

4.1 Distributions

4.2 Bivariate: Alcohol vs Quality

4.3 Correlation Heatmap

5 Train/Test Split

6 Model A — Linear Regression (Predict Numeric Quality)

6.1 Fit

6.2 Diagnostics

6.3 Performance on Test Set

7 Model B — Logistic Regression (Good vs Poor)

7.1 Fit

7.2 Predict & Evaluate

7.3 Simple Variable Importance (Standardized Coefficients)

8 7. Business-Facing Insights