I conducted this case study to demonstrate how I can transform a messy dataset into actionable insights. I chose the Red Wine Quality dataset, which originally contained issues such as inconsistent column names, missing values, and unscaled variables. My goal was to walk through the full data science workflow, from cleaning to modeling to interpretation.
Deliverables: Reproducible RMarkdown, visual EDA, model diagnostics, and plain-language recommendations.
Setting up Libraries
# Core packages
libs <- c(
"tidyverse", "janitor", "skimr", "broom", "rsample", "yardstick", "ggplot2", "scales"
)
need <- libs[!libs %in% installed.packages()[, "Package"]]
if (length(need)) install.packages(need, quiet = TRUE)
invisible(lapply(libs, library, character.only = TRUE))
options(dplyr.width = Inf)
set.seed(123)I’ll begin by loading the red wine data. The UCI file uses a semicolon separator and includes lowercase column names with spaces; I’ll normalize these names and confirm types.
# Try to download from UCI; if offline, look for a local copy next to this Rmd
uci_url <- "https://archive.ics.uci.edu/ml/machine-learning-databases/wine-quality/winequality-red.csv"
local_path <- "winequality-red.csv"
if (!file.exists(local_path)) {
try(download.file(uci_url, destfile = local_path, mode = "wb", quiet = TRUE), silent = TRUE)
}
wine_raw <- readr::read_delim(uci_url, delim = ";", show_col_types = FALSE)
# Clean column names and basic checks
wine <- wine_raw %>%
janitor::clean_names() %>%
mutate(across(everything(), ~ .x))
skimr::skim(wine)| Name | wine |
| Number of rows | 1599 |
| Number of columns | 12 |
| _______________________ | |
| Column type frequency: | |
| numeric | 12 |
| ________________________ | |
| Group variables | None |
Variable type: numeric
| skim_variable | n_missing | complete_rate | mean | sd | p0 | p25 | p50 | p75 | p100 | hist |
|---|---|---|---|---|---|---|---|---|---|---|
| fixed_acidity | 0 | 1 | 8.32 | 1.74 | 4.60 | 7.10 | 7.90 | 9.20 | 15.90 | ▂▇▂▁▁ |
| volatile_acidity | 0 | 1 | 0.53 | 0.18 | 0.12 | 0.39 | 0.52 | 0.64 | 1.58 | ▅▇▂▁▁ |
| citric_acid | 0 | 1 | 0.27 | 0.19 | 0.00 | 0.09 | 0.26 | 0.42 | 1.00 | ▇▆▅▁▁ |
| residual_sugar | 0 | 1 | 2.54 | 1.41 | 0.90 | 1.90 | 2.20 | 2.60 | 15.50 | ▇▁▁▁▁ |
| chlorides | 0 | 1 | 0.09 | 0.05 | 0.01 | 0.07 | 0.08 | 0.09 | 0.61 | ▇▁▁▁▁ |
| free_sulfur_dioxide | 0 | 1 | 15.87 | 10.46 | 1.00 | 7.00 | 14.00 | 21.00 | 72.00 | ▇▅▁▁▁ |
| total_sulfur_dioxide | 0 | 1 | 46.47 | 32.90 | 6.00 | 22.00 | 38.00 | 62.00 | 289.00 | ▇▂▁▁▁ |
| density | 0 | 1 | 1.00 | 0.00 | 0.99 | 1.00 | 1.00 | 1.00 | 1.00 | ▁▃▇▂▁ |
| p_h | 0 | 1 | 3.31 | 0.15 | 2.74 | 3.21 | 3.31 | 3.40 | 4.01 | ▁▅▇▂▁ |
| sulphates | 0 | 1 | 0.66 | 0.17 | 0.33 | 0.55 | 0.62 | 0.73 | 2.00 | ▇▅▁▁▁ |
| alcohol | 0 | 1 | 10.42 | 1.07 | 8.40 | 9.50 | 10.20 | 11.10 | 14.90 | ▇▇▃▁▁ |
| quality | 0 | 1 | 5.64 | 0.81 | 3.00 | 5.00 | 6.00 | 6.00 | 8.00 | ▁▇▇▂▁ |
clean_names().quality_label for
classification.# Drop rows with any NA (no missing values in this dataset)
wine <- wine %>% drop_na()
# Create a classification label: Good (>=6) vs Poor (<6)
wine <- wine %>%
mutate(
quality_label = if_else(quality >= 6, "Good", "Poor") %>% factor(levels = c("Poor", "Good"))
)
# Quick sanity checks
summary(select(wine, quality, quality_label))## quality quality_label
## Min. :3.000 Poor:744
## 1st Qu.:5.000 Good:855
## Median :6.000
## Mean :5.636
## 3rd Qu.:6.000
## Max. :8.000
Note: The dataset has 11 physicochemical inputs
(e.g., alcohol, volatile_acidity, sulphates)
and a target quality (0–10). Most entries cluster around
5–6.
Once the dataset was clean, I explored the distributions of key chemical properties such as acidity, sugar, and alcohol content. I also analyzed how these variables related to wine quality. Through visualizations, I observed clear patterns, for example, higher alcohol content often aligned with higher quality scores. This gave me a sense of which predictors might play the most important roles in modeling.
wine %>%
pivot_longer(cols = -c(quality, quality_label), names_to = "feature", values_to = "value") %>%
ggplot(aes(value)) +
geom_histogram(bins = 30) +
facet_wrap(~ feature, scales = "free", ncol = 4) +
labs(title = "Feature Distributions", x = NULL, y = "Count")wine %>%
ggplot(aes(alcohol, quality)) +
geom_point(alpha = 0.35) +
geom_smooth(method = "lm", se = FALSE) +
labs(title = "",
x = "Alcohol (%)", y = "Quality (0–10)")## `geom_smooth()` using formula = 'y ~ x'
Read: Alcohol is Positively Associated with Quality.
num_df <- wine %>% select(where(is.numeric))
corr <- round(cor(num_df), 2)
corr %>%
as.data.frame() %>%
rownames_to_column("var1") %>%
pivot_longer(-var1, names_to = "var2", values_to = "r") %>%
ggplot(aes(var1, var2, fill = r)) +
geom_tile(color = "white") +
scale_fill_gradient2(low = "#ef4444", mid = "#f3f4f6", high = "#22c55e", midpoint = 0, limits = c(-1,1)) +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
labs(title = "Correlation Heatmap", x = NULL, y = NULL, fill = "r")Note: Alcohol and sulphates correlate positively with quality; volatile acidity correlates negatively.
I hold out 20% for evaluation to avoid optimistic bias.
split <- initial_split(wine, prop = 0.8, strata = quality_label)
train <- training(split)
test <- testing(split)
nrow(train); nrow(test)## [1] 1279
## [1] 320
I created two models to better understand the data: a linear regression model and a logistic regression model. The linear regression helped me predict the wine quality score directly, while the logistic regression allowed me to classify wines into low and high quality categories. By comparing the results, I was able to highlight which features had the strongest effects on quality. Alcohol and volatile acidity emerged as key predictors.
lm_fit <- lm(quality ~ alcohol + sulphates + volatile_acidity + total_sulfur_dioxide + citric_acid + density, data = train)
broom::tidy(lm_fit)lm_preds <- tibble(
truth = test$quality,
.pred = predict(lm_fit, newdata = test)
)
metrics(lm_preds, truth = truth, estimate = .pred)Interpretation: The estimate value of 0.3646 means that, holding other variables constant, a one-unit increase in the predictor (for example, alcohol content) is associated with an average increase of about 0.36 points in the wine quality score. Since the quality score is on a numeric scale, this tells me that the predictor has a modest but meaningful positive impact on wine quality.
glm_fit <- glm(quality_label ~ alcohol + sulphates + volatile_acidity + total_sulfur_dioxide + citric_acid + density,
data = train, family = binomial())
broom::tidy(glm_fit, exponentiate = TRUE, conf.int = TRUE) %>%
arrange(desc(estimate))prob <- predict(glm_fit, newdata = test, type = "response")
cls <- if_else(prob >= 0.5, "Good", "Poor") %>% factor(levels = levels(test$quality_label))
results <- tibble(truth = test$quality_label, .pred_prob = prob, .pred_class = cls)
# Confusion Matrix & Accuracy
conf_mat(results, truth, .pred_class)## Truth
## Prediction Poor Good
## Poor 112 46
## Good 37 125
Interpretation: The coefficient of 0.8173 represents the log-odds effect of the predictor on the probability of a wine being classified as high quality. Converting this into odds, exp(0.8173) ≈ 2.26. This means that for each one-unit increase in the predictor, the odds of a wine being high quality are about 2.26 times higher. This effect is much stronger compared to the linear model’s incremental increase.
# Standardize numeric features for comparability
scale_cols <- c("alcohol", "sulphates", "volatile_acidity", "total_sulfur_dioxide", "citric_acid", "density")
train_s <- train %>% mutate(across(all_of(scale_cols), scale))
glm_std <- glm(quality_label ~ alcohol + sulphates + volatile_acidity + total_sulfur_dioxide + citric_acid + density,
data = train_s, family = binomial())
imp <- broom::tidy(glm_std) %>% filter(term != "(Intercept)") %>% mutate(abs_stat = abs(statistic))
imp %>%
ggplot(aes(reorder(term, abs_stat), abs_stat)) +
geom_col() +
coord_flip() +
labs(title = "Standardized Importance (|z|)", x = NULL, y = "|z-statistic|")Interpretation: When I standardized the predictors and compared their relative importance, alcohol emerged as the strongest driver of wine quality, confirming that higher alcohol content is consistently associated with better-rated wines. The next most influential factor was volatile acidity, but in this case the relationship is negative, meaning that as volatile acidity increases, wine quality tends to decrease. Sulphates also showed a positive role, acting as a stabilizer that improves the perception of wine quality.
After that, total sulphur dioxide and citric acid had smaller but still noticeable impacts, suggesting that their presence matters but not nearly as much as alcohol or acidity levels. Finally, density ranked lowest in importance, implying that it contributes very little to predicting wine quality once other variables are accounted for.
The analysis clearly shows that alcohol content is the most powerful driver of wine quality, meaning that wines with higher alcohol levels are consistently rated better by consumers. The second most important factor is volatile acidity, but here the relationship is negative—wines with high volatile acidity are perceived as lower quality, highlighting the need to control acidity during production. Sulphates also play a strong positive role, enhancing preservation and sensory appeal.
Beyond these top three factors, total sulphur dioxide and citric acid influence quality but to a lesser extent, suggesting they can fine-tune the product rather than define it. Density, however, shows minimal importance, indicating it should not be a priority for quality optimization.
For producers, this means focusing investment and quality control on balancing alcohol levels, managing volatile acidity, and optimizing sulphates, as these have the greatest impact on consumer perception and ultimately, market success.
Limitations: The dataset covers a specific region/production process; quality scores are subjective; non-linear interactions are not fully captured by GLM/LM.