Instructions
Exploratory analysis and essay
Assignment
Choose a dataset
You get to decide which dataset you want to work on. The data set must be different from the ones used in previous homeworks You can work on a problem from your job, or something you are interested in. You may also obtain a dataset from sites such as Kaggle, Data.Gov, Census Bureau, USGS or other open data portals.
Select one of the methodologies studied in weeks 1-10, and another methodology from weeks 11-15 to apply in the new dataset selected.
To complete this task:
Describe the problem you are trying to solve.
Describe your datases and what you did to prepare the data for analysis.
Methodologies you used for analyzing the data
What’s the purpose of the analysis performed
Make your conclusions from your analysis. Please be sure to address the business impact (it could be of any domain) of your solution.
Introduction
We load a dataset of red wine quality to predict whether a wine is “High” or “Low” quality based on chemical properties. Predicting wine quality is valuable for producers to optimize blends, manage production processes, and guide marketing decisions. We will compare classification models using all original features versus a reduced set of features derived via Principal Component Analysis (PCA). PCA can reduce correlated predictors and simplify the feature space, potentially improving predictive performance.
From a business perspective, a reliable quality predictor has three practical uses. First, it can flag batches that need further inspection before bottling, lowering the risk of costly recalls. Second, it can guide recipe adjustments to increase the share of high-quality bottles. Third, it provides marketing and pricing signals by identifying which chemical factors are most associated with premium ratings.
Technically, the analysis compares two approaches. One approach trains classifiers directly on the original features. The other reduces dimensionality with principal component analysis and then trains classifiers on the principal components. We implement a Random Forest and a Support Vector Machine for each approach so we can compare model performance, robustness to correlated inputs, and interpretability.
The dataset used for this analysis is the Red Wine Quality dataset, publicly available on Kaggle (https://www.kaggle.com/datasets/uciml/red-wine-quality-cortez-et-al-2009). This dataset was selected because it contains detailed chemical measurements for over 1,500 red wine samples.
This dashboard will walk through the data, show exploratory analysis and preprocessing decisions, present model training and evaluation, and close with recommendations tied to business outcomes.
# Set up plotting theme and palette
theme_set(theme_classic())
pal <- RColorBrewer::brewer.pal(n = 8, name = "Dark2")
# Load data
wine_qual_df <- read.csv("winequality.csv")
# Clean column names to snake_case
wine_qual_df <- janitor::clean_names(wine_qual_df)
# Rename p_h to ph for readability
wine_qual_df <- wine_qual_df |>
rename(ph = p_h) |>
# Reorder columns for readability
relocate(quality, .before = fixed_acidity) |>
relocate(ph, .before = sulphates)Below are the first 10 observations in the dataset, and for the sake of readability, we only display the first 10 columns.
| quality | fixed_acidity | volatile_acidity | citric_acid | residual_sugar | chlorides | free_sulfur_dioxide | total_sulfur_dioxide | density | ph |
|---|---|---|---|---|---|---|---|---|---|
| 5 | 7.4 | 0.70 | 0.00 | 1.9 | 0.076 | 11 | 34 | 0.9978 | 3.51 |
| 5 | 7.8 | 0.88 | 0.00 | 2.6 | 0.098 | 25 | 67 | 0.9968 | 3.20 |
| 5 | 7.8 | 0.76 | 0.04 | 2.3 | 0.092 | 15 | 54 | 0.9970 | 3.26 |
| 6 | 11.2 | 0.28 | 0.56 | 1.9 | 0.075 | 17 | 60 | 0.9980 | 3.16 |
| 5 | 7.4 | 0.70 | 0.00 | 1.9 | 0.076 | 11 | 34 | 0.9978 | 3.51 |
| 5 | 7.4 | 0.66 | 0.00 | 1.8 | 0.075 | 13 | 40 | 0.9978 | 3.51 |
| 5 | 7.9 | 0.60 | 0.06 | 1.6 | 0.069 | 15 | 59 | 0.9964 | 3.30 |
| 7 | 7.3 | 0.65 | 0.00 | 1.2 | 0.065 | 15 | 21 | 0.9946 | 3.39 |
| 7 | 7.8 | 0.58 | 0.02 | 2.0 | 0.073 | 9 | 18 | 0.9968 | 3.36 |
| 5 | 7.5 | 0.50 | 0.36 | 6.1 | 0.071 | 17 | 102 | 0.9978 | 3.35 |
The first column, quality, is the response variable we aim to predict. Although wine quality is scored from 0 to 10, this dataset only includes ratings between 3 and 8. To support a supervised classification task, we convert the original numeric score into a binary factor. Wines rated below 6 are labeled “Low,” and wines rated 6 or higher are labeled “High.” This threshold is commonly used in wine-quality studies and helps simplify the prediction problem while preserving meaningful differences in quality.
In addition to the response variable, there are 11 numeric predictor variables:wine_qual_df <- wine_qual_df |>
mutate(quality = factor(ifelse(quality < 6, "Low", "High"),
levels = c("Low", "High")))
classes <- as.data.frame(unlist(lapply(wine_qual_df, class))) |>
rownames_to_column()
cols <- c("Variable", "Class")
colnames(classes) <- cols
classes_summary <- classes |>
group_by(Class) |>
summarize(Count = n(),
Variables = paste(sort(unique(Variable)),collapse=", ")) |>
filter(Class == "numeric")
kable(classes_summary, format = "simple")| Class | Count | Variables |
|---|---|---|
| numeric | 11 | alcohol, chlorides, citric_acid, density, fixed_acidity, free_sulfur_dioxide, ph, residual_sugar, sulphates, total_sulfur_dioxide, volatile_acidity |
Exploratory Data Analysis
Before building any models, it’s important to understand the structure of the red wine dataset and review the basic properties of the predictors and the target variable. This step helps confirm data quality, check class balance, and identify patterns that may influence model performance. Looking at distributions, correlations, and summary statistics gives a clearer sense of which chemical features differ between high- and low-quality wines and highlights where PCA may help by reducing overlap and multicollinearity among predictors.
rem <- c("discrete_columns", "continuous_columns",
"total_observations", "memory_usage")
completeness <- introduce(wine_qual_df) |>
select(-all_of(rem))
knitr::kable(t(completeness), format = "simple")| rows | 1599 |
| columns | 12 |
| all_missing_columns | 0 |
| total_missing_values | 0 |
| complete_rows | 1599 |
There are no missing values to address in this dataset. We check the distribution of the response variable to see if there’s a class imbalance between high-quality wines and low-quality wines.
qual_cols <- pal[c(2, 3)]
names(qual_cols) <- c("Low", "High")
obs = nrow(wine_qual_df)
p1 <- wine_qual_df |>
ggplot(aes(x = quality)) +
geom_histogram(aes(color = quality, fill = quality), stat = "count") +
geom_text(stat = "count", aes(label = paste0(round(
after_stat(count) / obs * 100, 1), "%")),
size = 5, color = "white", vjust = 2, fontface = "bold") +
scale_color_manual(values = qual_cols) +
scale_fill_manual(values = qual_cols) +
scale_y_continuous(labels = scales::comma) +
labs(title = "Distribution of Low- & High- Quality Wines",
x = "QUALITY",
y = "COUNT") +
theme(legend.position = "none")
p1There are pretty similar numbers of low-quality and high-quality wines in the dataset, so there is no class imbalance issue to address here. We summarize the distributions of the predictors below.
rem <- c("vars", "n", "trimmed", "mad", "skew", "kurtosis", "se")
excl <- c("quality*")
describe <- describe(wine_qual_df) |>
select(-all_of(rem))
describe <- describe |>
filter(!rownames(describe) %in% excl)
knitr::kable(describe, format = "simple")| mean | sd | median | min | max | range | |
|---|---|---|---|---|---|---|
| fixed_acidity | 8.3196373 | 1.7410963 | 7.90000 | 4.60000 | 15.90000 | 11.30000 |
| volatile_acidity | 0.5278205 | 0.1790597 | 0.52000 | 0.12000 | 1.58000 | 1.46000 |
| citric_acid | 0.2709756 | 0.1948011 | 0.26000 | 0.00000 | 1.00000 | 1.00000 |
| residual_sugar | 2.5388055 | 1.4099281 | 2.20000 | 0.90000 | 15.50000 | 14.60000 |
| chlorides | 0.0874665 | 0.0470653 | 0.07900 | 0.01200 | 0.61100 | 0.59900 |
| free_sulfur_dioxide | 15.8749218 | 10.4601570 | 14.00000 | 1.00000 | 72.00000 | 71.00000 |
| total_sulfur_dioxide | 46.4677924 | 32.8953245 | 38.00000 | 6.00000 | 289.00000 | 283.00000 |
| density | 0.9967467 | 0.0018873 | 0.99675 | 0.99007 | 1.00369 | 0.01362 |
| ph | 3.3111132 | 0.1543865 | 3.31000 | 2.74000 | 4.01000 | 1.27000 |
| sulphates | 0.6581488 | 0.1695070 | 0.62000 | 0.33000 | 2.00000 | 1.67000 |
| alcohol | 10.4229831 | 1.0656676 | 10.20000 | 8.40000 | 14.90000 | 6.50000 |
None of the distributions appear degenerate, but we confirm there are no near-zero variance predictors nonetheless.
[1] TRUE
There are indeed no near-zero variance predictors that we need to remove, so we visualize the distributions for all predictors below.
skip <- c("quality")
wine_qual_piv <- wine_qual_df |>
pivot_longer(cols = !all_of(skip), names_to = "PREDICTOR",
values_to = "VALUE")
p2 <- wine_qual_piv |>
ggplot(aes(x = VALUE, color = quality, fill = quality)) +
geom_histogram(data = subset(wine_qual_piv, quality == "Low"),
alpha = 0.5) +
geom_histogram(data = subset(wine_qual_piv, quality == "High"),
alpha = 0.5) +
scale_color_manual(values = qual_cols) +
scale_fill_manual(values = qual_cols) +
scale_y_continuous(labels = scales::comma) +
facet_wrap(PREDICTOR ~ ., ncol = 5, scales = "free_x") +
labs(title = "Distribution of Predictors by Wine Quality",
y = "COUNT") +
theme(legend.position = "top")
p2We can see the means for alcohol and volatile_acidity are different for low-quality and high-quality wines. We can also see some skewed distributions, but we won’t be making any transformations since the focus of our analysis is really what impact reducing the feature space has on model performance.
We check for correlations between the predictors and the response variable, as well as any predictor-predictor correlations. In the interest of ignoring clutter, only correlations greater than 0.1 (in absolute value) are displayed.
plot_corr_range <- function(df,mn = 0.1, mx = 1.0,excl = c(NA),subtitle = NULL,caption = NULL) {
palette <- RColorBrewer::brewer.pal(n = 7, name = "BrBG")[c(1, 4, 7)]
# Main title
main_title <- sprintf("Correlation Heatmap (|r| Between %.1f and %.1f)", mn, mx)
# Correlation matrix
r <- model.matrix(~0 + ., data = df) |>
cor() |>
round(2)
is.na(r) <- abs(r) > mx
is.na(r) <- abs(r) < mn
if (!is.na(excl)) {
r <- as.data.frame(r) |>
select(-all_of(excl)) |>
filter(!rownames(r) %in% excl)
}
p <- r |>
ggcorrplot::ggcorrplot(show.diag = FALSE,
type = "lower",
lab = TRUE,
lab_size = 3,
tl.cex = 10,
tl.srt = 90,
colors = palette,
outline.color = "white") +
labs(
title = main_title,
subtitle = subtitle,
caption = caption
) +
theme(
plot.title.position = "plot",
plot.subtitle = ggplot2::element_text(size = 10, margin = ggplot2::margin(b = 5)),
plot.caption = ggplot2::element_text(size = 9, color = "gray40")
)
p
}
# Example call
p3 <- plot_corr_range(
df = wine_qual_df,
excl = c("qualityLow"),
subtitle = "Only showing correlations with absolute value greater than 0.1",
caption = "Correlations based on numeric predictors in the red wine dataset"
)
p3Alcohol is the strongest positive correlation with high-quality wines, and volatile acidity is the strongest negative correlation. Smaller positive correlations appear for fixed acidity, citric acid, and sulphates, while chlorides, total sulfur dioxide, and density show negative correlations.
There are also notable predictor–predictor correlations, such as:
fixed acidity with citric acid, density, and pH
total sulfur dioxide with free sulfur dioxide
We keep all predictors in the dataset because the main goal is to compare models train
Data Preparation
In this section, we prepare the dataset for modeling. We randomly split the data into a training set (70%) and a test set (30%) using a fixed seed to ensure reproducibility. After splitting, we check that the class distributions of high and low-quality wines are similar across the original, training, and test sets. This ensures that the models trained on the training set are representative and can be reliably evaluated on the test set.
set.seed(1006)
train_df <- wine_qual_df |> slice_sample(prop = 0.7)
test_df <- setdiff(wine_qual_df, train_df)We confirm the class distributions are similar in the original, train, and test sets.
class_dist <- tibble(
Quality = c("Low", "High"),
Original = round(prop.table(table(wine_qual_df$quality)), 2),
Train = round(prop.table(table(train_df$quality)), 2),
Test = round(prop.table(table(test_df$quality)), 2)
)
kable(class_dist, format = "simple")| Quality | Original | Train | Test |
|---|---|---|---|
| Low | 0.47 | 0.47 | 0.46 |
| High | 0.53 | 0.53 | 0.54 |
The class distributions of low- and high-quality wines are consistent across the original dataset, the training set, and the test set. This indicates that the random split maintained the balance between the two classes, ensuring that the models trained on the training set will be representative and that performance evaluation on the test set will be reliable. There is no need for additional balancing techniques before modeling.
Model Building
we create predictive models to classify wines as high or low quality based on their chemical properties. We start with two widely used approaches: a Random Forest (RF) model and a Support Vector Machine (SVM) model.
The Random Forest model is an ensemble method that combines multiple decision trees to improve predictive accuracy and reduce overfitting.
The SVM model, on the other hand, identifies the optimal hyperplane that separates the two classes in the feature space.
Both models are built using all original features, which allows us to establish a baseline for comparison before exploring feature reduction techniques like Principal Component Analysis (PCA).
Random Forest Model (Original Features)
We trained a Random Forest model using all the original features to classify wine quality. The model provides a measure of each predictor’s contribution to the classification. As expected from the earlier correlation analysis, alcohol is the most important feature for predicting wine quality. Interestingly, residual_sugar has an importance of 0, meaning it did not contribute to any splits in the forest and was effectively ignored by the model. This aligns with the idea that some predictors may have little influence on the target even if they are part of the dataset.
fn <- "rf_orig.rds"
if (!file.exists(fn)){
rf_orig <- train(
quality ~ .,
data = train_df,
metric = "Accuracy",
method = "rf",
trControl = trainControl(method = "none")
)
saveRDS(rf_orig, fn)
} else {
rf_orig <- readRDS(fn)
}
# Feature importance
rf_orig_imp <- varImp(rf_orig, scale = TRUE)$importance |>
rownames_to_column("Predictor") |>
arrange(desc(Overall))
rf_orig_imp$Overall <- round(rf_orig_imp$Overall, 3)
kable(rf_orig_imp, format = "simple")| Predictor | Overall |
|---|---|
| alcohol | 100.000 |
| sulphates | 55.745 |
| volatile_acidity | 49.291 |
| total_sulfur_dioxide | 39.393 |
| density | 28.842 |
| fixed_acidity | 19.203 |
| chlorides | 17.145 |
| ph | 12.120 |
| citric_acid | 7.938 |
| free_sulfur_dioxide | 2.028 |
| residual_sugar | 0.000 |
The Random Forest model confirms our earlier observations from the correlation analysis: alcohol is the strongest predictor of wine quality, while residual sugar has little to no impact. This highlights that certain chemical properties carry more predictive weight than others, and the model relies primarily on these key features to distinguish between low- and high-quality wines.
Support Vector Machine Model (Original Features)
Next, we build a Support Vector Machine (SVM) model using a radial basis function (RBF) kernel. SVMs are effective for classification tasks, especially when the data is not linearly separable, because the RBF kernel can capture complex relationships between predictors and the response. We use all the original features in this model and tune hyperparameters to achieve the best classification performance. The summary below shows the performance of the tuned SVM model.
ctrl <- tune.control(sampling = "cross", cross = 10, nrepeat = 5)
tune_grid <- list(cost = c(0.1, 1, 10, 100, 1000),
gamma = c(0.5, 1, 2, 3, 4))
fn <- "svm_orig.rds"
if (!file.exists(fn)){
svm_tune_orig <- tune(svm, quality ~ .,
data = train_df, kernel = "radial",
ranges = tune_grid, tunecontrol = ctrl)
svm_orig <- svm_tune_orig$best.model
saveRDS(svm_orig, fn)
}else{
svm_orig <- readRDS(fn)
}
summarize_svm <- function(svm_model){
col1 <- c("call", "cost", "gamma", "num_classes", "classes",
"support_vectors_total", "support_vectors_split")
subset <- c("call", "cost", "gamma", "nclasses", "levels",
"tot.nSV", "nSV")
col2 <- svm_model[subset]
copy <- col2
for (i in 1:length(copy)){
if (is.vector(copy[[i]])){
col2[[i]] <- paste(col2[[i]], collapse = ", ")
}
}
summ <- as.data.frame(cbind(col1, col2))
rownames(summ) <- NULL
colnames(summ) <- c("Parameter", "Value")
summ
}
summ <- summarize_svm(svm_orig)
kable(summ, format = "simple")| Parameter | Value |
|---|---|
| call | best.tune(METHOD = svm, train.x = quality ~ ., data = train_df, , ranges = tune_grid, tunecontrol = ctrl, kernel = “radial”) |
| cost | 1 |
| gamma | 0.5 |
| num_classes | 2 |
| classes | Low, High |
| support_vectors_total | 877 |
| support_vectors_split | 442, 435 |
The SVM model performed best with a cost of 1 and gamma of 0.5, capturing nonlinear relationships while using all original features.
Principal Component Analysis (PCA)
We use PCA to reduce feature dimensionality while scaling all predictors to avoid dominance by high-variance variables. The first few components capture moderate variance, so we retain up to eight for modeling. Biplots show some class separation, and we create new train and test sets with PCA features to build Random Forest and SVM models.
mean_df <- round(as.data.frame(apply(wine_qual_df |> select(-quality), 2, mean)), 2)
colnames(mean_df) <- c("Mean")
mean_df <- mean_df |>
rownames_to_column(var = "Predictor")
var_df <- round(as.data.frame(apply(wine_qual_df |> select(-quality), 2, var)), 2)
colnames(var_df) <- c("Variance")
var_df <- var_df |>
rownames_to_column(var = "Predictor")
mean_var_df <- mean_df |>
left_join(var_df, by = join_by(Predictor))
kable(mean_var_df, format = "simple")| Predictor | Mean | Variance |
|---|---|---|
| fixed_acidity | 8.32 | 3.03 |
| volatile_acidity | 0.53 | 0.03 |
| citric_acid | 0.27 | 0.04 |
| residual_sugar | 2.54 | 1.99 |
| chlorides | 0.09 | 0.00 |
| free_sulfur_dioxide | 15.87 | 109.41 |
| total_sulfur_dioxide | 46.47 | 1082.10 |
| density | 1.00 | 0.00 |
| ph | 3.31 | 0.02 |
| sulphates | 0.66 | 0.03 |
| alcohol | 10.42 | 1.14 |
Both free_sulfur_dioxide and total_sulfur_dioxide exhibit much larger variance than the other variables, so we center and scale them to prevent them from dominating the principal components. Below is the rotation, ie. the matrix of variable loading.
pca.out = prcomp(wine_qual_df |> select(-quality), scale = TRUE)
kable(round(as.data.frame(pca.out$rotation), 2), format = "simple")| PC1 | PC2 | PC3 | PC4 | PC5 | PC6 | PC7 | PC8 | PC9 | PC10 | PC11 | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| fixed_acidity | 0.49 | -0.11 | 0.12 | -0.23 | 0.08 | -0.10 | 0.35 | -0.18 | 0.19 | 0.25 | 0.64 |
| volatile_acidity | -0.24 | 0.27 | 0.45 | 0.08 | -0.22 | -0.41 | 0.53 | -0.08 | -0.13 | -0.37 | 0.00 |
| citric_acid | 0.46 | -0.15 | -0.24 | -0.08 | 0.06 | -0.07 | -0.11 | -0.38 | -0.38 | -0.62 | -0.07 |
| residual_sugar | 0.15 | 0.27 | -0.10 | -0.37 | -0.73 | -0.05 | -0.29 | 0.30 | 0.01 | -0.09 | 0.18 |
| chlorides | 0.21 | 0.15 | 0.09 | 0.67 | -0.25 | -0.30 | -0.37 | -0.36 | 0.11 | 0.22 | 0.05 |
| free_sulfur_dioxide | -0.04 | 0.51 | -0.43 | -0.04 | 0.16 | 0.01 | 0.12 | -0.20 | 0.64 | -0.25 | -0.05 |
| total_sulfur_dioxide | 0.02 | 0.57 | -0.32 | -0.03 | 0.22 | -0.14 | 0.09 | 0.02 | -0.59 | 0.37 | 0.07 |
| density | 0.40 | 0.23 | 0.34 | -0.17 | -0.16 | 0.39 | 0.17 | -0.24 | 0.02 | 0.24 | -0.57 |
| ph | -0.44 | 0.01 | -0.06 | 0.00 | -0.27 | 0.52 | 0.03 | -0.56 | -0.17 | 0.01 | 0.34 |
| sulphates | 0.24 | -0.04 | -0.28 | 0.55 | -0.23 | 0.38 | 0.45 | 0.37 | -0.06 | -0.11 | 0.07 |
| alcohol | -0.11 | -0.39 | -0.47 | -0.12 | -0.35 | -0.36 | 0.33 | -0.22 | 0.04 | 0.30 | -0.31 |
We generate a scree plot so that we can see the variance explained by each principal component.
scree_plot_df = as.data.frame(pca.out$sdev^2 / sum(pca.out$sdev^2))
colnames(scree_plot_df) = c("var_explained")
scree_plot_df <- scree_plot_df |>
rowid_to_column(var = "pc")
p4 <- scree_plot_df |>
ggplot(aes(x = pc, y = var_explained)) +
geom_line(color = "steelblue", size = 1.5) +
geom_point(size = 4, color = "darkred") +
scale_x_continuous(limits = c(1, 11), breaks = seq(1, 11, 1)) +
scale_y_continuous(limits = c(0, 1), breaks = seq(0, 1, 0.1), labels = scales::percent) +
labs(
title = "Variance Explained by Principal Components",
subtitle = "Scree plot showing the proportion of variance captured by each component",
x = "Principal Component",
y = "Variance Explained",
caption = "Data: Red Wine Quality Dataset"
) +
theme_minimal(base_size = 15) +
theme(
plot.title = element_text(face = "bold", size = 14),
plot.subtitle = element_text(size = 11, color = "gray30"),
plot.caption = element_text(size = 12, color = "gray50"),
axis.text = element_text(size = 12),
axis.title = element_text(size = 12)
)
p4The first principal component explains a little less than 30 percent of the variance in our data, and the second principal component explains a little less than 20 percent of the variance in our data. Since those values are relatively low, we don’t expect building models using only the first two principal components will be sufficient, so we will go up to eight principal components.
For the same reason, we don’t expect the biplot of the first two principal components we generate below to show great separation between high-quality and low-quality wines.
p5 <- pca.out |>
ggbiplot(obs.scale = 1, var.scale = 1,
groups = wine_qual_df$quality,
ellipse = TRUE,
ellipse.alpha = 0.0) +
theme(legend.direction = "horizontal",
legend.position = "top") +
labs(fill = "Quality",
color = "Quality",
title = "Biplot of First Two Principal Components",
subtitle = "Shows separation of low- and high-quality wines",
caption = "PC1 and PC2 explain ~50% of the total variance"
) +
scale_color_discrete(type = qual_cols)
p5There is a lot of overlap in low-quality and high-quality wines plotted by just the first two principal components, but we do see some separation between the classes, which is promising.
We create new versions of the train and test sets using the features derived via PCA instead of the original features.
train_df_pca <- as.data.frame(predict(pca.out, train_df)) |>
bind_cols(train_df |> select(quality)) |>
relocate(quality, .before = "PC1")
test_df_pca <- as.data.frame(predict(pca.out, test_df)) |>
bind_cols(test_df |> select(quality)) |>
relocate(quality, .before = "PC1")Then we build our second set of models: Random Forest and Support Vector Machine models using the first two, first four, first six, or first eight principal components.
Random Forest Model (Principal Components)
Next, we build a Random Forest model using the features derived from Principal Component Analysis. By replacing the original correlated predictors with principal components, we aim to simplify the feature space while retaining most of the variance in the data. This model will help us evaluate whether using a reduced set of components can achieve similar classification performance compared to using all original features.
fn <- "rf_2pc.rds"
if (!file.exists(fn)){
rf_2pc <- train(quality ~ PC1 + PC2,
data = train_df_pca, metric = "Accuracy",
method = "rf", trControl = trainControl(method = "none"),
tuneGrid = expand.grid(.mtry = 1))
saveRDS(rf_2pc, fn)
}else{
rf_2pc <- readRDS(fn)
}
fn <- "rf_4pc.rds"
if (!file.exists(fn)){
rf_4pc <- train(quality ~ PC1 + PC2 + PC3 + PC4,
data = train_df_pca, metric = "Accuracy",
method = "rf", trControl = trainControl(method = "none"),
tuneGrid = expand.grid(.mtry = 2))
saveRDS(rf_4pc, fn)
}else{
rf_4pc <- readRDS(fn)
}
fn <- "rf_6pc.rds"
if (!file.exists(fn)){
rf_6pc <- train(quality ~ PC1 + PC2 + PC3 + PC4 + PC5 + PC6,
data = train_df_pca, metric = "Accuracy",
method = "rf", trControl = trainControl(method = "none"),
tuneGrid = expand.grid(.mtry = 2))
saveRDS(rf_6pc, fn)
}else{
rf_6pc <- readRDS(fn)
}
fn <- "rf_8pc.rds"
if (!file.exists(fn)){
rf_8pc <- train(quality ~ PC1 + PC2 + PC3 + PC4 + PC5 + PC6 + PC7 + PC8,
data = train_df_pca, metric = "Accuracy",
method = "rf", trControl = trainControl(method = "none"),
tuneGrid = expand.grid(.mtry = 2))
saveRDS(rf_8pc, fn)
}else{
rf_8pc <- readRDS(fn)
}A summary of the relative importance of the principal components to each of these Random Forest models we have trained is below.
rf_2pc_imp <- varImp(rf_2pc, scale = TRUE)
rf_2pc_imp <- rf_2pc_imp$importance |>
rownames_to_column()
cols <- c("Predictor", "Importance")
colnames(rf_2pc_imp) <- cols
rf_2pc_imp <- rf_2pc_imp |>
arrange(desc(Importance)) |>
rowid_to_column(var = "Rank") |>
mutate(Model = "Random Forest (2 Principal Components)") |>
select(-Importance)
rf_4pc_imp <- varImp(rf_4pc, scale = TRUE)
rf_4pc_imp <- rf_4pc_imp$importance |>
rownames_to_column()
cols <- c("Predictor", "Importance")
colnames(rf_4pc_imp) <- cols
rf_4pc_imp <- rf_4pc_imp |>
arrange(desc(Importance)) |>
rowid_to_column(var = "Rank") |>
mutate(Model = "Random Forest (4 Principal Components)") |>
select(-Importance)
rf_6pc_imp <- varImp(rf_6pc, scale = TRUE)
rf_6pc_imp <- rf_6pc_imp$importance |>
rownames_to_column()
cols <- c("Predictor", "Importance")
colnames(rf_6pc_imp) <- cols
rf_6pc_imp <- rf_6pc_imp |>
arrange(desc(Importance)) |>
rowid_to_column(var = "Rank") |>
mutate(Model = "Random Forest (6 Principal Components)") |>
select(-Importance)
rf_8pc_imp <- varImp(rf_8pc, scale = TRUE)
rf_8pc_imp <- rf_8pc_imp$importance |>
rownames_to_column()
cols <- c("Predictor", "Importance")
colnames(rf_8pc_imp) <- cols
rf_8pc_imp <- rf_8pc_imp |>
arrange(desc(Importance)) |>
rowid_to_column(var = "Rank") |>
mutate(Model = "Random Forest (8 Principal Components)") |>
select(-Importance)
rf_pc_imp_df <- rf_2pc_imp |>
bind_rows(rf_4pc_imp, rf_6pc_imp, rf_8pc_imp) |>
pivot_wider(names_from = Rank, values_from = Predictor,
names_prefix = "Rank_")
kable(rf_pc_imp_df, format = "simple")| Model | Rank_1 | Rank_2 | Rank_3 | Rank_4 | Rank_5 | Rank_6 | Rank_7 | Rank_8 |
|---|---|---|---|---|---|---|---|---|
| Random Forest (2 Principal Components) | PC2 | PC1 | NA | NA | NA | NA | NA | NA |
| Random Forest (4 Principal Components) | PC2 | PC3 | PC1 | PC4 | NA | NA | NA | NA |
| Random Forest (6 Principal Components) | PC2 | PC3 | PC1 | PC5 | PC4 | PC6 | NA | NA |
| Random Forest (8 Principal Components) | PC2 | PC3 | PC5 | PC1 | PC7 | PC4 | PC8 | PC6 |
All of these Random Forest models rank the second principal component as their most important feature, and most of them rank the third principal component as their second-most important feature. Only the Random Forest model built using only the first two principal components includes the first principal component in its top two features.
Support Vector Machine Model (Principal Components)
Next we train four Support Vector Machine models using increasing numbers of principal components. Summaries of the best Support Vector Machine models using radial basis kernels that we arrived at during tuning are below:
fn <- "svm_2pc.rds"
if (!file.exists(fn)){
svm_tune_2pc <- tune(svm, quality ~ PC1 + PC2,
data = train_df_pca, kernel = "radial",
ranges = tune_grid, tunecontrol = ctrl)
svm_2pc <- svm_tune_2pc$best.model
saveRDS(svm_2pc, fn)
}else{
svm_2pc <- readRDS(fn)
}
fn <- "svm_4pc.rds"
if (!file.exists(fn)){
svm_tune_4pc <- tune(svm, quality ~ PC1 + PC2 + PC3 + PC4,
data = train_df_pca, kernel = "radial",
ranges = tune_grid, tunecontrol = ctrl)
svm_4pc <- svm_tune_4pc$best.model
saveRDS(svm_4pc, fn)
}else{
svm_4pc <- readRDS(fn)
}
fn <- "svm_6pc.rds"
if (!file.exists(fn)){
svm_tune_6pc <- tune(svm, quality ~ PC1 + PC2 + PC3 + PC4 + PC5 + PC6,
data = train_df_pca, kernel = "radial",
ranges = tune_grid, tunecontrol = ctrl)
svm_6pc <- svm_tune_6pc$best.model
saveRDS(svm_6pc, fn)
}else{
svm_6pc <- readRDS(fn)
}
fn <- "svm_8pc.rds"
if (!file.exists(fn)){
svm_tune_8pc <- tune(svm,
quality ~ PC1 + PC2 + PC3 + PC4 + PC5 + PC6 + PC7 + PC8,
data = train_df_pca, kernel = "radial",
ranges = tune_grid, tunecontrol = ctrl)
svm_8pc <- svm_tune_8pc$best.model
saveRDS(svm_8pc, fn)
}else{
svm_8pc <- readRDS(fn)
}
summ_2pc <- summarize_svm(svm_2pc) |>
filter(Parameter != "call") |>
mutate(Model = "Support Vector Machine (2 Principal Components)") |>
pivot_wider(names_from = Parameter, values_from = Value)
summ_4pc <- summarize_svm(svm_4pc) |>
filter(Parameter != "call") |>
mutate(Model = "Support Vector Machine (4 Principal Components)") |>
pivot_wider(names_from = Parameter, values_from = Value)
summ_6pc <- summarize_svm(svm_6pc) |>
filter(Parameter != "call") |>
mutate(Model = "Support Vector Machine (6 Principal Components)") |>
pivot_wider(names_from = Parameter, values_from = Value)
summ_8pc <- summarize_svm(svm_8pc) |>
filter(Parameter != "call") |>
mutate(Model = "Support Vector Machine (8 Principal Components)") |>
pivot_wider(names_from = Parameter, values_from = Value)
summ <- summ_2pc |>
bind_rows(summ_4pc, summ_6pc, summ_8pc)
kable(summ, format = "simple")| Model | cost | gamma | num_classes | classes | support_vectors_total | support_vectors_split |
|---|---|---|---|---|---|---|
| Support Vector Machine (2 Principal Components) | 10 | 2 | 2 | Low, High | 810 | 407, 403 |
| Support Vector Machine (4 Principal Components) | 0.1 | 0.5 | 2 | Low, High | 824 | 413, 411 |
| Support Vector Machine (6 Principal Components) | 1 | 3 | 2 | Low, High | 1017 | 527, 490 |
| Support Vector Machine (8 Principal Components) | 100 | 1 | 2 | Low, High | 846 | 445, 401 |
They use various costs between 0.1 and 10, but they all use the same gamma: 0.5.
Modal Evaluation
In this section, we evaluate the performance of all models on the test dataset. First, we generate confusion matrices for each model, which show how many wines were correctly or incorrectly classified as high- or low-quality. Each matrix includes counts of true positives, true negatives, false positives, and false negatives.
Next, we calculate key performance metrics—accuracy, precision, recall, and F1 score—for each model. These metrics provide a more quantitative comparison, helping us assess whether models built with principal components perform similarly to those using all original features. The results will indicate which modeling approach best balances correct classification and error rates.
#Random Forest Orig: predictions/confusion matrix
pred_rf_orig <- predict(rf_orig, test_df)
rf_orig_cm_complete <- confusionMatrix(pred_rf_orig, test_df$quality,
positive = "High")
rf_orig_cm <- as.data.frame(rf_orig_cm_complete$table)
rf_orig_cm$Reference <- factor(rf_orig_cm$Reference,
levels = rev(levels(rf_orig_cm$Reference)))
rf_orig_cm <- rf_orig_cm |>
mutate(
Label = case_when(
Prediction == "Low" & Reference == "Low" ~ "TN",
Prediction == "High" & Reference == "High" ~ "TP",
Prediction == "Low" & Reference == "High" ~ "FN",
Prediction == "High" & Reference == "Low" ~ "FP"),
Model = "RF (Orig. Feat.)")
#Random Forest 2pc: predictions/confusion matrix
pred_rf_2pc <- predict(rf_2pc, test_df_pca)
rf_2pc_cm_complete <- confusionMatrix(pred_rf_2pc, test_df_pca$quality,
positive = "High")
rf_2pc_cm <- as.data.frame(rf_2pc_cm_complete$table)
rf_2pc_cm$Reference <- factor(rf_2pc_cm$Reference,
levels = rev(levels(rf_2pc_cm$Reference)))
rf_2pc_cm <- rf_2pc_cm |>
mutate(
Label = case_when(
Prediction == "Low" & Reference == "Low" ~ "TN",
Prediction == "High" & Reference == "High" ~ "TP",
Prediction == "Low" & Reference == "High" ~ "FN",
Prediction == "High" & Reference == "Low" ~ "FP"),
Model = "RF (2 PC)")
#Random Forest 4pc: predictions/confusion matrix
pred_rf_4pc <- predict(rf_4pc, test_df_pca)
rf_4pc_cm_complete <- confusionMatrix(pred_rf_4pc, test_df_pca$quality,
positive = "High")
rf_4pc_cm <- as.data.frame(rf_4pc_cm_complete$table)
rf_4pc_cm$Reference <- factor(rf_4pc_cm$Reference,
levels = rev(levels(rf_4pc_cm$Reference)))
rf_4pc_cm <- rf_4pc_cm |>
mutate(
Label = case_when(
Prediction == "Low" & Reference == "Low" ~ "TN",
Prediction == "High" & Reference == "High" ~ "TP",
Prediction == "Low" & Reference == "High" ~ "FN",
Prediction == "High" & Reference == "Low" ~ "FP"),
Model = "RF (4 PC)")
#Random Forest 6pc: predictions/confusion matrix
pred_rf_6pc <- predict(rf_6pc, test_df_pca)
rf_6pc_cm_complete <- confusionMatrix(pred_rf_6pc, test_df_pca$quality,
positive = "High")
rf_6pc_cm <- as.data.frame(rf_6pc_cm_complete$table)
rf_6pc_cm$Reference <- factor(rf_6pc_cm$Reference,
levels = rev(levels(rf_6pc_cm$Reference)))
rf_6pc_cm <- rf_6pc_cm |>
mutate(
Label = case_when(
Prediction == "Low" & Reference == "Low" ~ "TN",
Prediction == "High" & Reference == "High" ~ "TP",
Prediction == "Low" & Reference == "High" ~ "FN",
Prediction == "High" & Reference == "Low" ~ "FP"),
Model = "RF (6 PC)")
#Random Forest 8pc: predictions/confusion matrix
pred_rf_8pc <- predict(rf_8pc, test_df_pca)
rf_8pc_cm_complete <- confusionMatrix(pred_rf_8pc, test_df_pca$quality,
positive = "High")
rf_8pc_cm <- as.data.frame(rf_8pc_cm_complete$table)
rf_8pc_cm$Reference <- factor(rf_8pc_cm$Reference,
levels = rev(levels(rf_8pc_cm$Reference)))
rf_8pc_cm <- rf_8pc_cm |>
mutate(
Label = case_when(
Prediction == "Low" & Reference == "Low" ~ "TN",
Prediction == "High" & Reference == "High" ~ "TP",
Prediction == "Low" & Reference == "High" ~ "FN",
Prediction == "High" & Reference == "Low" ~ "FP"),
Model = "RF (8C)")
#Support Vector Machine Orig: predictions/confusion matrix
pred_svm_orig <- predict(svm_orig, test_df, type = "class")
svm_orig_cm_complete <- confusionMatrix(pred_svm_orig, test_df$quality,
positive = "High")
svm_orig_cm <- as.data.frame(svm_orig_cm_complete$table)
svm_orig_cm$Reference <- factor(svm_orig_cm$Reference,
levels = rev(levels(svm_orig_cm$Reference)))
svm_orig_cm <- svm_orig_cm |>
mutate(
Label = case_when(
Prediction == "Low" & Reference == "Low" ~ "TN",
Prediction == "High" & Reference == "High" ~ "TP",
Prediction == "Low" & Reference == "High" ~ "FN",
Prediction == "High" & Reference == "Low" ~ "FP"),
Model = "SVM (Orig. Feat.)")
#Support Vector Machine 2pc: predictions/confusion matrix
pred_svm_2pc <- predict(svm_2pc, test_df_pca, type = "class")
svm_2pc_cm_complete <- confusionMatrix(pred_svm_2pc, test_df_pca$quality,
positive = "High")
svm_2pc_cm <- as.data.frame(svm_2pc_cm_complete$table)
svm_2pc_cm$Reference <- factor(svm_2pc_cm$Reference,
levels = rev(levels(svm_2pc_cm$Reference)))
svm_2pc_cm <- svm_2pc_cm |>
mutate(
Label = case_when(
Prediction == "Low" & Reference == "Low" ~ "TN",
Prediction == "High" & Reference == "High" ~ "TP",
Prediction == "Low" & Reference == "High" ~ "FN",
Prediction == "High" & Reference == "Low" ~ "FP"),
Model = "SVM (2 PC)")
#Support Vector Machine 4pc: predictions/confusion matrix
pred_svm_4pc <- predict(svm_4pc, test_df_pca, type = "class")
svm_4pc_cm_complete <- confusionMatrix(pred_svm_4pc, test_df_pca$quality,
positive = "High")
svm_4pc_cm <- as.data.frame(svm_4pc_cm_complete$table)
svm_4pc_cm$Reference <- factor(svm_4pc_cm$Reference,
levels = rev(levels(svm_4pc_cm$Reference)))
svm_4pc_cm <- svm_4pc_cm |>
mutate(
Label = case_when(
Prediction == "Low" & Reference == "Low" ~ "TN",
Prediction == "High" & Reference == "High" ~ "TP",
Prediction == "Low" & Reference == "High" ~ "FN",
Prediction == "High" & Reference == "Low" ~ "FP"),
Model = "SVM (4 PC)")
#Support Vector Machine 6pc: predictions/confusion matrix
pred_svm_6pc <- predict(svm_6pc, test_df_pca, type = "class")
svm_6pc_cm_complete <- confusionMatrix(pred_svm_6pc, test_df_pca$quality,
positive = "High")
svm_6pc_cm <- as.data.frame(svm_6pc_cm_complete$table)
svm_6pc_cm$Reference <- factor(svm_6pc_cm$Reference,
levels = rev(levels(svm_6pc_cm$Reference)))
svm_6pc_cm <- svm_6pc_cm |>
mutate(
Label = case_when(
Prediction == "Low" & Reference == "Low" ~ "TN",
Prediction == "High" & Reference == "High" ~ "TP",
Prediction == "Low" & Reference == "High" ~ "FN",
Prediction == "High" & Reference == "Low" ~ "FP"),
Model = "SVM (6 PC)")
#Support Vector Machine 8pc: predictions/confusion matrix
pred_svm_8pc <- predict(svm_8pc, test_df_pca, type = "class")
svm_8pc_cm_complete <- confusionMatrix(pred_svm_8pc, test_df_pca$quality,
positive = "High")
svm_8pc_cm <- as.data.frame(svm_8pc_cm_complete$table)
svm_8pc_cm$Reference <- factor(svm_8pc_cm$Reference,
levels = rev(levels(svm_8pc_cm$Reference)))
svm_8pc_cm <- svm_8pc_cm |>
mutate(
Label = case_when(
Prediction == "Low" & Reference == "Low" ~ "TN",
Prediction == "High" & Reference == "High" ~ "TP",
Prediction == "Low" & Reference == "High" ~ "FN",
Prediction == "High" & Reference == "Low" ~ "FP"),
Model = "SVM (8 PC)")
cm <- bind_rows(rf_orig_cm, rf_2pc_cm, rf_4pc_cm, rf_6pc_cm, rf_8pc_cm,
svm_orig_cm, svm_2pc_cm, svm_4pc_cm, svm_6pc_cm, svm_8pc_cm)
p6 <- cm |>
ggplot(aes(x = Reference, y = Prediction, fill = Freq)) +
geom_tile(col = "black") +
geom_text(aes(label = Freq)) +
geom_text(aes(label = Label), vjust = 2) +
scale_fill_gradient(low = "white", high = pal[1]) +
scale_x_discrete(position = "top") +
facet_wrap(Model ~ ., ncol = 5, strip.position = "bottom") +
labs(title = "Confusion Matrices for All Models") +
theme(axis.line.x = element_blank(),
axis.line.y = element_blank(),
axis.text.y = element_text(angle = 90, hjust = 0.5),
axis.ticks = element_blank(),
legend.position = "bottom",
strip.placement = "outside")
p6Then we calculate the performance metrics for each of the models.
metrics <- as.data.frame(cbind(rbind(rf_orig_cm_complete$byClass,rf_2pc_cm_complete$byClass,rf_4pc_cm_complete$byClass,
rf_6pc_cm_complete$byClass,rf_8pc_cm_complete$byClass,svm_orig_cm_complete$byClass,
svm_2pc_cm_complete$byClass,svm_4pc_cm_complete$byClass,svm_6pc_cm_complete$byClass,
svm_8pc_cm_complete$byClass),rbind(rf_orig_cm_complete$overall,
rf_2pc_cm_complete$overall,rf_4pc_cm_complete$overall,rf_6pc_cm_complete$overall,
rf_8pc_cm_complete$overall,svm_orig_cm_complete$overall,svm_2pc_cm_complete$overall,
svm_4pc_cm_complete$overall,svm_6pc_cm_complete$overall,svm_8pc_cm_complete$overall)))
rownames(metrics) <- c("Random Forest (Original Features)","Random Forest (2 Principal Components)",
"Random Forest (4 Principal Components)","Random Forest (6 Principal Components)",
"Random Forest (8 Principal Components)","Support Vector Machine (Original Features)",
"Support Vector Machine (2 Principal Components)","Support Vector Machine (4 Principal Components)",
"Support Vector Machine (6 Principal Components)","Support Vector Machine (8 Principal Components)")
keep <- c("Accuracy", "Precision", "Recall", "F1")
metrics <- metrics |>
select(all_of(keep)) |>
round(3)
kable(metrics, format = "simple")| Accuracy | Precision | Recall | F1 | |
|---|---|---|---|---|
| Random Forest (Original Features) | 0.749 | 0.752 | 0.800 | 0.776 |
| Random Forest (2 Principal Components) | 0.638 | 0.652 | 0.711 | 0.680 |
| Random Forest (4 Principal Components) | 0.735 | 0.746 | 0.774 | 0.760 |
| Random Forest (6 Principal Components) | 0.749 | 0.743 | 0.821 | 0.780 |
| Random Forest (8 Principal Components) | 0.755 | 0.765 | 0.789 | 0.777 |
| Support Vector Machine (Original Features) | 0.744 | 0.745 | 0.800 | 0.772 |
| Support Vector Machine (2 Principal Components) | 0.638 | 0.674 | 0.642 | 0.658 |
| Support Vector Machine (4 Principal Components) | 0.718 | 0.731 | 0.758 | 0.744 |
| Support Vector Machine (6 Principal Components) | 0.709 | 0.683 | 0.863 | 0.763 |
| Support Vector Machine (8 Principal Components) | 0.707 | 0.697 | 0.811 | 0.749 |
Conclusion
Both the Random Forest and Support Vector Machine models perform best on this dataset using the original features rather than any number of principal components. Using only four principal components does get the Random Forest model close to the Accuracy and F1 Score of the Random Forest model using the original features though. The Support Vector Machine model, in contrast, requires twice as many, i.e. eight, principal components to get close to the Accuracy and F1 Score of the Support Vector Machine model using the original features.
Since we only had eleven numeric predictors in the dataset to begin with, reducing the feature space wasn’t really imperative, but we now have a framework for performing PCA on datasets with tons of features, where reducing the feature space is more necessary.
There are often a lot of benefits to reducing the feature space, especially when businesses track a lot of features for the products they sell. Jeans in particular come to mind as a product with lots of features (i.e. color, size, fit, style, stretch, whether it has a button-fly or a zipper, whether it comes distressed, etc.). If you can represent a product line with lots of features like that with just four or fewer principal components instead, you will be eliminating a lot of noise, your models will be faster, and your predictions can be as good as or better than predictions from models that use more features. It’s also amazing to be able to see data with many dimensions reduced to two dimensions, even in a classification problem like ours where we knew perfect wine quality class separation wouldn’t be visible in just two dimensions.