DATA 622 : Final Project - Wine Quality Classification: Evaluating Random Forest and SVM with PCA and Original Features

Author: Rupendra Shrestha | Dec 07, 2025

Column

Column

Instructions

Exploratory analysis and essay

Assignment

Choose a dataset

You get to decide which dataset you want to work on. The data set must be different from the ones used in previous homeworks You can work on a problem from your job, or something you are interested in. You may also obtain a dataset from sites such as Kaggle, Data.Gov, Census Bureau, USGS or other open data portals.

Select one of the methodologies studied in weeks 1-10, and another methodology from weeks 11-15 to apply in the new dataset selected.

To complete this task:

  • Describe the problem you are trying to solve.

  • Describe your datases and what you did to prepare the data for analysis.

  • Methodologies you used for analyzing the data

  • What’s the purpose of the analysis performed

  • Make your conclusions from your analysis. Please be sure to address the business impact (it could be of any domain) of your solution.

Introduction

We load a dataset of red wine quality to predict whether a wine is “High” or “Low” quality based on chemical properties. Predicting wine quality is valuable for producers to optimize blends, manage production processes, and guide marketing decisions. We will compare classification models using all original features versus a reduced set of features derived via Principal Component Analysis (PCA). PCA can reduce correlated predictors and simplify the feature space, potentially improving predictive performance.

From a business perspective, a reliable quality predictor has three practical uses. First, it can flag batches that need further inspection before bottling, lowering the risk of costly recalls. Second, it can guide recipe adjustments to increase the share of high-quality bottles. Third, it provides marketing and pricing signals by identifying which chemical factors are most associated with premium ratings.

Technically, the analysis compares two approaches. One approach trains classifiers directly on the original features. The other reduces dimensionality with principal component analysis and then trains classifiers on the principal components. We implement a Random Forest and a Support Vector Machine for each approach so we can compare model performance, robustness to correlated inputs, and interpretability.

The dataset used for this analysis is the Red Wine Quality dataset, publicly available on Kaggle (https://www.kaggle.com/datasets/uciml/red-wine-quality-cortez-et-al-2009). This dataset was selected because it contains detailed chemical measurements for over 1,500 red wine samples.

This dashboard will walk through the data, show exploratory analysis and preprocessing decisions, present model training and evaluation, and close with recommendations tied to business outcomes.

# Set up plotting theme and palette
theme_set(theme_classic())
pal <- RColorBrewer::brewer.pal(n = 8, name = "Dark2")

# Load data
wine_qual_df <- read.csv("winequality.csv")

# Clean column names to snake_case
wine_qual_df <- janitor::clean_names(wine_qual_df)

# Rename p_h to ph for readability
wine_qual_df <- wine_qual_df |>
  rename(ph = p_h) |>
  # Reorder columns for readability
  relocate(quality, .before = fixed_acidity) |>
  relocate(ph, .before = sulphates)

Below are the first 10 observations in the dataset, and for the sake of readability, we only display the first 10 columns.

kable(wine_qual_df[1:10, 1:10], format = "simple")
quality fixed_acidity volatile_acidity citric_acid residual_sugar chlorides free_sulfur_dioxide total_sulfur_dioxide density ph
5 7.4 0.70 0.00 1.9 0.076 11 34 0.9978 3.51
5 7.8 0.88 0.00 2.6 0.098 25 67 0.9968 3.20
5 7.8 0.76 0.04 2.3 0.092 15 54 0.9970 3.26
6 11.2 0.28 0.56 1.9 0.075 17 60 0.9980 3.16
5 7.4 0.70 0.00 1.9 0.076 11 34 0.9978 3.51
5 7.4 0.66 0.00 1.8 0.075 13 40 0.9978 3.51
5 7.9 0.60 0.06 1.6 0.069 15 59 0.9964 3.30
7 7.3 0.65 0.00 1.2 0.065 15 21 0.9946 3.39
7 7.8 0.58 0.02 2.0 0.073 9 18 0.9968 3.36
5 7.5 0.50 0.36 6.1 0.071 17 102 0.9978 3.35

The first column, quality, is the response variable we aim to predict. Although wine quality is scored from 0 to 10, this dataset only includes ratings between 3 and 8. To support a supervised classification task, we convert the original numeric score into a binary factor. Wines rated below 6 are labeled “Low,” and wines rated 6 or higher are labeled “High.” This threshold is commonly used in wine-quality studies and helps simplify the prediction problem while preserving meaningful differences in quality.

In addition to the response variable, there are 11 numeric predictor variables:
wine_qual_df <- wine_qual_df |>
    mutate(quality = factor(ifelse(quality < 6, "Low", "High"),
                            levels = c("Low", "High")))

classes <- as.data.frame(unlist(lapply(wine_qual_df, class))) |>
    rownames_to_column()
cols <- c("Variable", "Class")
colnames(classes) <- cols
classes_summary <- classes |>
    group_by(Class) |>
    summarize(Count = n(),
              Variables = paste(sort(unique(Variable)),collapse=", ")) |>
    filter(Class == "numeric")
kable(classes_summary, format = "simple")
Class Count Variables
numeric 11 alcohol, chlorides, citric_acid, density, fixed_acidity, free_sulfur_dioxide, ph, residual_sugar, sulphates, total_sulfur_dioxide, volatile_acidity

Exploratory Data Analysis

Before building any models, it’s important to understand the structure of the red wine dataset and review the basic properties of the predictors and the target variable. This step helps confirm data quality, check class balance, and identify patterns that may influence model performance. Looking at distributions, correlations, and summary statistics gives a clearer sense of which chemical features differ between high- and low-quality wines and highlights where PCA may help by reducing overlap and multicollinearity among predictors.

rem <- c("discrete_columns", "continuous_columns",
         "total_observations", "memory_usage")
completeness <- introduce(wine_qual_df) |>
    select(-all_of(rem))
knitr::kable(t(completeness), format = "simple")
rows 1599
columns 12
all_missing_columns 0
total_missing_values 0
complete_rows 1599

There are no missing values to address in this dataset. We check the distribution of the response variable to see if there’s a class imbalance between high-quality wines and low-quality wines.

qual_cols <- pal[c(2, 3)]
names(qual_cols) <- c("Low", "High")
obs = nrow(wine_qual_df)
p1 <- wine_qual_df |>
    ggplot(aes(x = quality)) +
    geom_histogram(aes(color = quality, fill = quality), stat = "count") +
    geom_text(stat = "count", aes(label = paste0(round(
        after_stat(count) / obs * 100, 1), "%")),
              size = 5, color = "white", vjust = 2, fontface = "bold") + 
    scale_color_manual(values = qual_cols) +
    scale_fill_manual(values = qual_cols) +
    scale_y_continuous(labels = scales::comma) +
    labs(title = "Distribution of Low- & High- Quality Wines",
         x = "QUALITY",
         y = "COUNT") +
    theme(legend.position = "none")
p1

There are pretty similar numbers of low-quality and high-quality wines in the dataset, so there is no class imbalance issue to address here. We summarize the distributions of the predictors below.

rem <- c("vars", "n", "trimmed", "mad", "skew", "kurtosis", "se")
excl <- c("quality*")
describe <- describe(wine_qual_df) |>
    select(-all_of(rem))
describe <- describe |>
    filter(!rownames(describe) %in% excl)
knitr::kable(describe, format = "simple")
mean sd median min max range
fixed_acidity 8.3196373 1.7410963 7.90000 4.60000 15.90000 11.30000
volatile_acidity 0.5278205 0.1790597 0.52000 0.12000 1.58000 1.46000
citric_acid 0.2709756 0.1948011 0.26000 0.00000 1.00000 1.00000
residual_sugar 2.5388055 1.4099281 2.20000 0.90000 15.50000 14.60000
chlorides 0.0874665 0.0470653 0.07900 0.01200 0.61100 0.59900
free_sulfur_dioxide 15.8749218 10.4601570 14.00000 1.00000 72.00000 71.00000
total_sulfur_dioxide 46.4677924 32.8953245 38.00000 6.00000 289.00000 283.00000
density 0.9967467 0.0018873 0.99675 0.99007 1.00369 0.01362
ph 3.3111132 0.1543865 3.31000 2.74000 4.01000 1.27000
sulphates 0.6581488 0.1695070 0.62000 0.33000 2.00000 1.67000
alcohol 10.4229831 1.0656676 10.20000 8.40000 14.90000 6.50000

None of the distributions appear degenerate, but we confirm there are no near-zero variance predictors nonetheless.

nzv <- nearZeroVar(wine_qual_df, names = TRUE, saveMetrics = FALSE)
length(nzv) == 0
[1] TRUE

There are indeed no near-zero variance predictors that we need to remove, so we visualize the distributions for all predictors below.

skip <- c("quality")
wine_qual_piv <- wine_qual_df |>
    pivot_longer(cols = !all_of(skip), names_to = "PREDICTOR",
                 values_to = "VALUE")
p2 <- wine_qual_piv |>
    ggplot(aes(x = VALUE, color = quality, fill = quality)) +
    geom_histogram(data = subset(wine_qual_piv, quality == "Low"),
                   alpha = 0.5) +
    geom_histogram(data = subset(wine_qual_piv, quality == "High"),
                   alpha = 0.5) +
    scale_color_manual(values = qual_cols) +
    scale_fill_manual(values = qual_cols) +
    scale_y_continuous(labels = scales::comma) +
    facet_wrap(PREDICTOR ~ ., ncol = 5, scales = "free_x") +
    labs(title = "Distribution of Predictors by Wine Quality",
         y = "COUNT") +
    theme(legend.position = "top")
p2

We can see the means for alcohol and volatile_acidity are different for low-quality and high-quality wines. We can also see some skewed distributions, but we won’t be making any transformations since the focus of our analysis is really what impact reducing the feature space has on model performance.

We check for correlations between the predictors and the response variable, as well as any predictor-predictor correlations. In the interest of ignoring clutter, only correlations greater than 0.1 (in absolute value) are displayed.

plot_corr_range <- function(df,mn = 0.1, mx = 1.0,excl = c(NA),subtitle = NULL,caption = NULL) {
    palette <- RColorBrewer::brewer.pal(n = 7, name = "BrBG")[c(1, 4, 7)]
    
    # Main title
    main_title <- sprintf("Correlation Heatmap (|r| Between %.1f and %.1f)", mn, mx)
    
    # Correlation matrix
    r <- model.matrix(~0 + ., data = df) |>
        cor() |>
        round(2)

    is.na(r) <- abs(r) > mx
    is.na(r) <- abs(r) < mn

    if (!is.na(excl)) {
        r <- as.data.frame(r) |>
            select(-all_of(excl)) |>
            filter(!rownames(r) %in% excl)
    }

    p <- r |>
        ggcorrplot::ggcorrplot(show.diag = FALSE,
                               type = "lower",
                               lab = TRUE,
                               lab_size = 3,
                               tl.cex = 10,
                               tl.srt = 90,
                               colors = palette,
                               outline.color = "white") +
        labs(
            title = main_title,
            subtitle = subtitle,
            caption = caption
        ) +
        theme(
            plot.title.position = "plot",
            plot.subtitle = ggplot2::element_text(size = 10, margin = ggplot2::margin(b = 5)),
            plot.caption = ggplot2::element_text(size = 9, color = "gray40")
        )

    p
}

# Example call
p3 <- plot_corr_range(
    df = wine_qual_df,
    excl = c("qualityLow"),
    subtitle = "Only showing correlations with absolute value greater than 0.1",
    caption = "Correlations based on numeric predictors in the red wine dataset"
)
p3

Alcohol is the strongest positive correlation with high-quality wines, and volatile acidity is the strongest negative correlation. Smaller positive correlations appear for fixed acidity, citric acid, and sulphates, while chlorides, total sulfur dioxide, and density show negative correlations.

There are also notable predictor–predictor correlations, such as:

  • fixed acidity with citric acid, density, and pH

  • total sulfur dioxide with free sulfur dioxide

We keep all predictors in the dataset because the main goal is to compare models train

Data Preparation

In this section, we prepare the dataset for modeling. We randomly split the data into a training set (70%) and a test set (30%) using a fixed seed to ensure reproducibility. After splitting, we check that the class distributions of high and low-quality wines are similar across the original, training, and test sets. This ensures that the models trained on the training set are representative and can be reliably evaluated on the test set.

set.seed(1006)
train_df <- wine_qual_df |> slice_sample(prop = 0.7)
test_df <- setdiff(wine_qual_df, train_df)

We confirm the class distributions are similar in the original, train, and test sets.

class_dist <- tibble(
  Quality = c("Low", "High"),
  Original = round(prop.table(table(wine_qual_df$quality)), 2),
  Train = round(prop.table(table(train_df$quality)), 2),
  Test = round(prop.table(table(test_df$quality)), 2)
)
kable(class_dist, format = "simple")
Quality Original Train Test
Low 0.47 0.47 0.46
High 0.53 0.53 0.54

The class distributions of low- and high-quality wines are consistent across the original dataset, the training set, and the test set. This indicates that the random split maintained the balance between the two classes, ensuring that the models trained on the training set will be representative and that performance evaluation on the test set will be reliable. There is no need for additional balancing techniques before modeling.

Model Building

we create predictive models to classify wines as high or low quality based on their chemical properties. We start with two widely used approaches: a Random Forest (RF) model and a Support Vector Machine (SVM) model.

  • The Random Forest model is an ensemble method that combines multiple decision trees to improve predictive accuracy and reduce overfitting.

  • The SVM model, on the other hand, identifies the optimal hyperplane that separates the two classes in the feature space.

Both models are built using all original features, which allows us to establish a baseline for comparison before exploring feature reduction techniques like Principal Component Analysis (PCA).

Random Forest Model (Original Features)

We trained a Random Forest model using all the original features to classify wine quality. The model provides a measure of each predictor’s contribution to the classification. As expected from the earlier correlation analysis, alcohol is the most important feature for predicting wine quality. Interestingly, residual_sugar has an importance of 0, meaning it did not contribute to any splits in the forest and was effectively ignored by the model. This aligns with the idea that some predictors may have little influence on the target even if they are part of the dataset.

fn <- "rf_orig.rds"
if (!file.exists(fn)){
    rf_orig <- train(
        quality ~ ., 
        data = train_df, 
        metric = "Accuracy",
        method = "rf", 
        trControl = trainControl(method = "none")
    )
    saveRDS(rf_orig, fn)
} else {
    rf_orig <- readRDS(fn)
}

# Feature importance
rf_orig_imp <- varImp(rf_orig, scale = TRUE)$importance |>
    rownames_to_column("Predictor") |>
    arrange(desc(Overall))
rf_orig_imp$Overall <- round(rf_orig_imp$Overall, 3)
kable(rf_orig_imp, format = "simple")
Predictor Overall
alcohol 100.000
sulphates 55.745
volatile_acidity 49.291
total_sulfur_dioxide 39.393
density 28.842
fixed_acidity 19.203
chlorides 17.145
ph 12.120
citric_acid 7.938
free_sulfur_dioxide 2.028
residual_sugar 0.000

The Random Forest model confirms our earlier observations from the correlation analysis: alcohol is the strongest predictor of wine quality, while residual sugar has little to no impact. This highlights that certain chemical properties carry more predictive weight than others, and the model relies primarily on these key features to distinguish between low- and high-quality wines.

Support Vector Machine Model (Original Features)

Next, we build a Support Vector Machine (SVM) model using a radial basis function (RBF) kernel. SVMs are effective for classification tasks, especially when the data is not linearly separable, because the RBF kernel can capture complex relationships between predictors and the response. We use all the original features in this model and tune hyperparameters to achieve the best classification performance. The summary below shows the performance of the tuned SVM model.

ctrl <-  tune.control(sampling = "cross", cross = 10, nrepeat = 5)
    tune_grid <- list(cost = c(0.1, 1, 10, 100, 1000),
                      gamma = c(0.5, 1, 2, 3, 4))
fn <- "svm_orig.rds"
if (!file.exists(fn)){
    svm_tune_orig <- tune(svm, quality ~ .,
                          data = train_df, kernel = "radial",
                          ranges = tune_grid, tunecontrol = ctrl)
    svm_orig <- svm_tune_orig$best.model
    saveRDS(svm_orig, fn)
}else{
    svm_orig <- readRDS(fn)
}
summarize_svm <- function(svm_model){
    col1 <- c("call", "cost", "gamma", "num_classes", "classes",
              "support_vectors_total", "support_vectors_split")
    subset <- c("call", "cost", "gamma", "nclasses", "levels",
              "tot.nSV", "nSV")
    col2 <- svm_model[subset]
    copy <- col2
    for (i in 1:length(copy)){
        if (is.vector(copy[[i]])){
            col2[[i]] <- paste(col2[[i]], collapse = ", ")
        }
    }
    summ <- as.data.frame(cbind(col1, col2))
    rownames(summ) <- NULL
    colnames(summ) <- c("Parameter", "Value")
    summ
}
summ <- summarize_svm(svm_orig)
kable(summ, format = "simple")
Parameter Value
call best.tune(METHOD = svm, train.x = quality ~ ., data = train_df, , ranges = tune_grid, tunecontrol = ctrl, kernel = “radial”)
cost 1
gamma 0.5
num_classes 2
classes Low, High
support_vectors_total 877
support_vectors_split 442, 435

The SVM model performed best with a cost of 1 and gamma of 0.5, capturing nonlinear relationships while using all original features.

Principal Component Analysis (PCA)

We use PCA to reduce feature dimensionality while scaling all predictors to avoid dominance by high-variance variables. The first few components capture moderate variance, so we retain up to eight for modeling. Biplots show some class separation, and we create new train and test sets with PCA features to build Random Forest and SVM models.

mean_df <- round(as.data.frame(apply(wine_qual_df |> select(-quality), 2, mean)), 2)
colnames(mean_df) <- c("Mean")
mean_df <- mean_df |>
    rownames_to_column(var = "Predictor")
var_df <- round(as.data.frame(apply(wine_qual_df |> select(-quality), 2, var)), 2)
colnames(var_df) <- c("Variance")
var_df <- var_df |>
    rownames_to_column(var = "Predictor")
mean_var_df <- mean_df |>
    left_join(var_df, by = join_by(Predictor))
kable(mean_var_df, format = "simple")
Predictor Mean Variance
fixed_acidity 8.32 3.03
volatile_acidity 0.53 0.03
citric_acid 0.27 0.04
residual_sugar 2.54 1.99
chlorides 0.09 0.00
free_sulfur_dioxide 15.87 109.41
total_sulfur_dioxide 46.47 1082.10
density 1.00 0.00
ph 3.31 0.02
sulphates 0.66 0.03
alcohol 10.42 1.14

Both free_sulfur_dioxide and total_sulfur_dioxide exhibit much larger variance than the other variables, so we center and scale them to prevent them from dominating the principal components. Below is the rotation, ie. the matrix of variable loading.

pca.out = prcomp(wine_qual_df |> select(-quality), scale = TRUE)
kable(round(as.data.frame(pca.out$rotation), 2), format = "simple")
PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8 PC9 PC10 PC11
fixed_acidity 0.49 -0.11 0.12 -0.23 0.08 -0.10 0.35 -0.18 0.19 0.25 0.64
volatile_acidity -0.24 0.27 0.45 0.08 -0.22 -0.41 0.53 -0.08 -0.13 -0.37 0.00
citric_acid 0.46 -0.15 -0.24 -0.08 0.06 -0.07 -0.11 -0.38 -0.38 -0.62 -0.07
residual_sugar 0.15 0.27 -0.10 -0.37 -0.73 -0.05 -0.29 0.30 0.01 -0.09 0.18
chlorides 0.21 0.15 0.09 0.67 -0.25 -0.30 -0.37 -0.36 0.11 0.22 0.05
free_sulfur_dioxide -0.04 0.51 -0.43 -0.04 0.16 0.01 0.12 -0.20 0.64 -0.25 -0.05
total_sulfur_dioxide 0.02 0.57 -0.32 -0.03 0.22 -0.14 0.09 0.02 -0.59 0.37 0.07
density 0.40 0.23 0.34 -0.17 -0.16 0.39 0.17 -0.24 0.02 0.24 -0.57
ph -0.44 0.01 -0.06 0.00 -0.27 0.52 0.03 -0.56 -0.17 0.01 0.34
sulphates 0.24 -0.04 -0.28 0.55 -0.23 0.38 0.45 0.37 -0.06 -0.11 0.07
alcohol -0.11 -0.39 -0.47 -0.12 -0.35 -0.36 0.33 -0.22 0.04 0.30 -0.31

We generate a scree plot so that we can see the variance explained by each principal component.

scree_plot_df = as.data.frame(pca.out$sdev^2 / sum(pca.out$sdev^2))
colnames(scree_plot_df) = c("var_explained")
scree_plot_df <- scree_plot_df |>
    rowid_to_column(var = "pc")

p4 <- scree_plot_df |>
    ggplot(aes(x = pc, y = var_explained)) +
    geom_line(color = "steelblue", size = 1.5) + 
    geom_point(size = 4, color = "darkred") +
    scale_x_continuous(limits = c(1, 11), breaks = seq(1, 11, 1)) +
    scale_y_continuous(limits = c(0, 1), breaks = seq(0, 1, 0.1), labels = scales::percent) +
    labs(
        title = "Variance Explained by Principal Components",
        subtitle = "Scree plot showing the proportion of variance captured by each component",
        x = "Principal Component",
        y = "Variance Explained",
        caption = "Data: Red Wine Quality Dataset"
    ) +
    theme_minimal(base_size = 15) +
    theme(
        plot.title = element_text(face = "bold", size = 14),
        plot.subtitle = element_text(size = 11, color = "gray30"),
        plot.caption = element_text(size = 12, color = "gray50"),
        axis.text = element_text(size = 12),
        axis.title = element_text(size = 12)
    )

p4

The first principal component explains a little less than 30 percent of the variance in our data, and the second principal component explains a little less than 20 percent of the variance in our data. Since those values are relatively low, we don’t expect building models using only the first two principal components will be sufficient, so we will go up to eight principal components.

For the same reason, we don’t expect the biplot of the first two principal components we generate below to show great separation between high-quality and low-quality wines.

p5 <- pca.out |>
    ggbiplot(obs.scale = 1, var.scale = 1,
             groups = wine_qual_df$quality,
             ellipse = TRUE,
             ellipse.alpha = 0.0) +
    theme(legend.direction = "horizontal",
          legend.position = "top") +
    labs(fill = "Quality",
         color = "Quality",
         title = "Biplot of First Two Principal Components",
         subtitle = "Shows separation of low- and high-quality wines",
        caption = "PC1 and PC2 explain ~50% of the total variance"
         ) +
    scale_color_discrete(type = qual_cols)
p5

There is a lot of overlap in low-quality and high-quality wines plotted by just the first two principal components, but we do see some separation between the classes, which is promising.

We create new versions of the train and test sets using the features derived via PCA instead of the original features.

train_df_pca <- as.data.frame(predict(pca.out, train_df)) |>
    bind_cols(train_df |> select(quality)) |>
    relocate(quality, .before = "PC1")
test_df_pca <- as.data.frame(predict(pca.out, test_df)) |>
    bind_cols(test_df |> select(quality)) |>
    relocate(quality, .before = "PC1")

Then we build our second set of models: Random Forest and Support Vector Machine models using the first two, first four, first six, or first eight principal components.

Random Forest Model (Principal Components)

Next, we build a Random Forest model using the features derived from Principal Component Analysis. By replacing the original correlated predictors with principal components, we aim to simplify the feature space while retaining most of the variance in the data. This model will help us evaluate whether using a reduced set of components can achieve similar classification performance compared to using all original features.

fn <- "rf_2pc.rds"
if (!file.exists(fn)){
    rf_2pc <- train(quality ~ PC1 + PC2,
                    data = train_df_pca, metric = "Accuracy",
                    method = "rf", trControl = trainControl(method = "none"),
                    tuneGrid = expand.grid(.mtry = 1))
    saveRDS(rf_2pc, fn)
}else{
    rf_2pc <- readRDS(fn)
}
fn <- "rf_4pc.rds"
if (!file.exists(fn)){
    rf_4pc <- train(quality ~ PC1 + PC2 + PC3 + PC4,
                    data = train_df_pca, metric = "Accuracy",
                    method = "rf", trControl = trainControl(method = "none"),
                    tuneGrid = expand.grid(.mtry = 2))
    saveRDS(rf_4pc, fn)
}else{
    rf_4pc <- readRDS(fn)
}
fn <- "rf_6pc.rds"
if (!file.exists(fn)){
    rf_6pc <- train(quality ~ PC1 + PC2 + PC3 + PC4 + PC5 + PC6,
                    data = train_df_pca, metric = "Accuracy",
                    method = "rf", trControl = trainControl(method = "none"),
                    tuneGrid = expand.grid(.mtry = 2))
    saveRDS(rf_6pc, fn)
}else{
    rf_6pc <- readRDS(fn)
}
fn <- "rf_8pc.rds"
if (!file.exists(fn)){
    rf_8pc <- train(quality ~ PC1 + PC2 + PC3 + PC4 + PC5 + PC6 + PC7 + PC8,
                    data = train_df_pca, metric = "Accuracy",
                    method = "rf", trControl = trainControl(method = "none"),
                    tuneGrid = expand.grid(.mtry = 2))
    saveRDS(rf_8pc, fn)
}else{
    rf_8pc <- readRDS(fn)
}

A summary of the relative importance of the principal components to each of these Random Forest models we have trained is below.

rf_2pc_imp <- varImp(rf_2pc, scale = TRUE)
rf_2pc_imp <- rf_2pc_imp$importance |>
    rownames_to_column()
cols <- c("Predictor", "Importance")
colnames(rf_2pc_imp) <- cols
rf_2pc_imp <- rf_2pc_imp |>
    arrange(desc(Importance)) |>
    rowid_to_column(var = "Rank") |>
    mutate(Model = "Random Forest (2 Principal Components)") |>
    select(-Importance)
rf_4pc_imp <- varImp(rf_4pc, scale = TRUE)
rf_4pc_imp <- rf_4pc_imp$importance |>
    rownames_to_column()
cols <- c("Predictor", "Importance")
colnames(rf_4pc_imp) <- cols
rf_4pc_imp <- rf_4pc_imp |>
    arrange(desc(Importance)) |>
    rowid_to_column(var = "Rank") |>
    mutate(Model = "Random Forest (4 Principal Components)") |>
    select(-Importance)
rf_6pc_imp <- varImp(rf_6pc, scale = TRUE)
rf_6pc_imp <- rf_6pc_imp$importance |>
    rownames_to_column()
cols <- c("Predictor", "Importance")
colnames(rf_6pc_imp) <- cols
rf_6pc_imp <- rf_6pc_imp |>
    arrange(desc(Importance)) |>
    rowid_to_column(var = "Rank") |>
    mutate(Model = "Random Forest (6 Principal Components)") |>
    select(-Importance)
rf_8pc_imp <- varImp(rf_8pc, scale = TRUE)
rf_8pc_imp <- rf_8pc_imp$importance |>
    rownames_to_column()
cols <- c("Predictor", "Importance")
colnames(rf_8pc_imp) <- cols
rf_8pc_imp <- rf_8pc_imp |>
    arrange(desc(Importance)) |>
    rowid_to_column(var = "Rank") |>
    mutate(Model = "Random Forest (8 Principal Components)") |>
    select(-Importance)
rf_pc_imp_df <- rf_2pc_imp |>
    bind_rows(rf_4pc_imp, rf_6pc_imp, rf_8pc_imp) |>
    pivot_wider(names_from = Rank, values_from = Predictor,
                names_prefix = "Rank_")
kable(rf_pc_imp_df, format = "simple")
Model Rank_1 Rank_2 Rank_3 Rank_4 Rank_5 Rank_6 Rank_7 Rank_8
Random Forest (2 Principal Components) PC2 PC1 NA NA NA NA NA NA
Random Forest (4 Principal Components) PC2 PC3 PC1 PC4 NA NA NA NA
Random Forest (6 Principal Components) PC2 PC3 PC1 PC5 PC4 PC6 NA NA
Random Forest (8 Principal Components) PC2 PC3 PC5 PC1 PC7 PC4 PC8 PC6

All of these Random Forest models rank the second principal component as their most important feature, and most of them rank the third principal component as their second-most important feature. Only the Random Forest model built using only the first two principal components includes the first principal component in its top two features.

Support Vector Machine Model (Principal Components)

Next we train four Support Vector Machine models using increasing numbers of principal components. Summaries of the best Support Vector Machine models using radial basis kernels that we arrived at during tuning are below:

fn <- "svm_2pc.rds"
if (!file.exists(fn)){
    svm_tune_2pc <- tune(svm, quality ~ PC1 + PC2,
                         data = train_df_pca, kernel = "radial",
                         ranges = tune_grid, tunecontrol = ctrl)
    svm_2pc <- svm_tune_2pc$best.model
    saveRDS(svm_2pc, fn)
}else{
    svm_2pc <- readRDS(fn)
}
fn <- "svm_4pc.rds"
if (!file.exists(fn)){
    svm_tune_4pc <- tune(svm, quality ~ PC1 + PC2 + PC3 + PC4,
                         data = train_df_pca, kernel = "radial",
                         ranges = tune_grid, tunecontrol = ctrl)
    svm_4pc <- svm_tune_4pc$best.model
    saveRDS(svm_4pc, fn)
}else{
    svm_4pc <- readRDS(fn)
}
fn <- "svm_6pc.rds"
if (!file.exists(fn)){
    svm_tune_6pc <- tune(svm, quality ~ PC1 + PC2 + PC3 + PC4 + PC5 + PC6,
                         data = train_df_pca, kernel = "radial",
                         ranges = tune_grid, tunecontrol = ctrl)
    svm_6pc <- svm_tune_6pc$best.model
    saveRDS(svm_6pc, fn)
}else{
    svm_6pc <- readRDS(fn)
}
fn <- "svm_8pc.rds"
if (!file.exists(fn)){
    svm_tune_8pc <- tune(svm,
                         quality ~ PC1 + PC2 + PC3 + PC4 + PC5 + PC6 + PC7 + PC8,
                         data = train_df_pca, kernel = "radial",
                         ranges = tune_grid, tunecontrol = ctrl)
    svm_8pc <- svm_tune_8pc$best.model
    saveRDS(svm_8pc, fn)
}else{
    svm_8pc <- readRDS(fn)
}
summ_2pc <- summarize_svm(svm_2pc) |>
    filter(Parameter != "call") |>
    mutate(Model = "Support Vector Machine (2 Principal Components)") |>
    pivot_wider(names_from = Parameter, values_from = Value)
summ_4pc <- summarize_svm(svm_4pc) |>
    filter(Parameter != "call") |>
    mutate(Model = "Support Vector Machine (4 Principal Components)") |>
    pivot_wider(names_from = Parameter, values_from = Value)
summ_6pc <- summarize_svm(svm_6pc) |>
    filter(Parameter != "call") |>
    mutate(Model = "Support Vector Machine (6 Principal Components)") |>
    pivot_wider(names_from = Parameter, values_from = Value)
summ_8pc <- summarize_svm(svm_8pc) |>
    filter(Parameter != "call") |>
    mutate(Model = "Support Vector Machine (8 Principal Components)") |>
    pivot_wider(names_from = Parameter, values_from = Value)
summ <- summ_2pc |>
    bind_rows(summ_4pc, summ_6pc, summ_8pc)
kable(summ, format = "simple")
Model cost gamma num_classes classes support_vectors_total support_vectors_split
Support Vector Machine (2 Principal Components) 10 2 2 Low, High 810 407, 403
Support Vector Machine (4 Principal Components) 0.1 0.5 2 Low, High 824 413, 411
Support Vector Machine (6 Principal Components) 1 3 2 Low, High 1017 527, 490
Support Vector Machine (8 Principal Components) 100 1 2 Low, High 846 445, 401

They use various costs between 0.1 and 10, but they all use the same gamma: 0.5.

Modal Evaluation

In this section, we evaluate the performance of all models on the test dataset. First, we generate confusion matrices for each model, which show how many wines were correctly or incorrectly classified as high- or low-quality. Each matrix includes counts of true positives, true negatives, false positives, and false negatives.

Next, we calculate key performance metrics—accuracy, precision, recall, and F1 score—for each model. These metrics provide a more quantitative comparison, helping us assess whether models built with principal components perform similarly to those using all original features. The results will indicate which modeling approach best balances correct classification and error rates.

#Random Forest Orig: predictions/confusion matrix
pred_rf_orig <- predict(rf_orig, test_df)
rf_orig_cm_complete <- confusionMatrix(pred_rf_orig, test_df$quality,
                                       positive = "High")
rf_orig_cm <- as.data.frame(rf_orig_cm_complete$table)
rf_orig_cm$Reference <- factor(rf_orig_cm$Reference,
                               levels = rev(levels(rf_orig_cm$Reference)))
rf_orig_cm <- rf_orig_cm |>
    mutate(
        Label = case_when(
            Prediction == "Low" & Reference == "Low" ~ "TN",
            Prediction == "High" & Reference == "High" ~ "TP",
            Prediction == "Low" & Reference == "High" ~ "FN",
            Prediction == "High" & Reference == "Low" ~ "FP"),
        Model = "RF (Orig. Feat.)")
#Random Forest 2pc: predictions/confusion matrix
pred_rf_2pc <- predict(rf_2pc, test_df_pca)
rf_2pc_cm_complete <- confusionMatrix(pred_rf_2pc, test_df_pca$quality,
                                      positive = "High")
rf_2pc_cm <- as.data.frame(rf_2pc_cm_complete$table)
rf_2pc_cm$Reference <- factor(rf_2pc_cm$Reference,
                               levels = rev(levels(rf_2pc_cm$Reference)))
rf_2pc_cm <- rf_2pc_cm |>
    mutate(
        Label = case_when(
            Prediction == "Low" & Reference == "Low" ~ "TN",
            Prediction == "High" & Reference == "High" ~ "TP",
            Prediction == "Low" & Reference == "High" ~ "FN",
            Prediction == "High" & Reference == "Low" ~ "FP"),
        Model = "RF (2 PC)")
#Random Forest 4pc: predictions/confusion matrix
pred_rf_4pc <- predict(rf_4pc, test_df_pca)
rf_4pc_cm_complete <- confusionMatrix(pred_rf_4pc, test_df_pca$quality,
                                      positive = "High")
rf_4pc_cm <- as.data.frame(rf_4pc_cm_complete$table)
rf_4pc_cm$Reference <- factor(rf_4pc_cm$Reference,
                               levels = rev(levels(rf_4pc_cm$Reference)))
rf_4pc_cm <- rf_4pc_cm |>
    mutate(
        Label = case_when(
            Prediction == "Low" & Reference == "Low" ~ "TN",
            Prediction == "High" & Reference == "High" ~ "TP",
            Prediction == "Low" & Reference == "High" ~ "FN",
            Prediction == "High" & Reference == "Low" ~ "FP"),
        Model = "RF (4 PC)")
#Random Forest 6pc: predictions/confusion matrix
pred_rf_6pc <- predict(rf_6pc, test_df_pca)
rf_6pc_cm_complete <- confusionMatrix(pred_rf_6pc, test_df_pca$quality,
                                      positive = "High")
rf_6pc_cm <- as.data.frame(rf_6pc_cm_complete$table)
rf_6pc_cm$Reference <- factor(rf_6pc_cm$Reference,
                               levels = rev(levels(rf_6pc_cm$Reference)))
rf_6pc_cm <- rf_6pc_cm |>
    mutate(
        Label = case_when(
            Prediction == "Low" & Reference == "Low" ~ "TN",
            Prediction == "High" & Reference == "High" ~ "TP",
            Prediction == "Low" & Reference == "High" ~ "FN",
            Prediction == "High" & Reference == "Low" ~ "FP"),
        Model = "RF (6 PC)")
#Random Forest 8pc: predictions/confusion matrix
pred_rf_8pc <- predict(rf_8pc, test_df_pca)
rf_8pc_cm_complete <- confusionMatrix(pred_rf_8pc, test_df_pca$quality,
                                      positive = "High")
rf_8pc_cm <- as.data.frame(rf_8pc_cm_complete$table)
rf_8pc_cm$Reference <- factor(rf_8pc_cm$Reference,
                               levels = rev(levels(rf_8pc_cm$Reference)))
rf_8pc_cm <- rf_8pc_cm |>
    mutate(
        Label = case_when(
            Prediction == "Low" & Reference == "Low" ~ "TN",
            Prediction == "High" & Reference == "High" ~ "TP",
            Prediction == "Low" & Reference == "High" ~ "FN",
            Prediction == "High" & Reference == "Low" ~ "FP"),
        Model = "RF (8C)")

#Support Vector Machine Orig: predictions/confusion matrix

pred_svm_orig <- predict(svm_orig, test_df, type = "class")
svm_orig_cm_complete <- confusionMatrix(pred_svm_orig, test_df$quality,
                                       positive = "High")
svm_orig_cm <- as.data.frame(svm_orig_cm_complete$table)
svm_orig_cm$Reference <- factor(svm_orig_cm$Reference,
                               levels = rev(levels(svm_orig_cm$Reference)))
svm_orig_cm <- svm_orig_cm |>
    mutate(
        Label = case_when(
            Prediction == "Low" & Reference == "Low" ~ "TN",
            Prediction == "High" & Reference == "High" ~ "TP",
            Prediction == "Low" & Reference == "High" ~ "FN",
            Prediction == "High" & Reference == "Low" ~ "FP"),
        Model = "SVM (Orig. Feat.)")
#Support Vector Machine 2pc: predictions/confusion matrix
pred_svm_2pc <- predict(svm_2pc, test_df_pca, type = "class")
svm_2pc_cm_complete <- confusionMatrix(pred_svm_2pc, test_df_pca$quality,
                                       positive = "High")
svm_2pc_cm <- as.data.frame(svm_2pc_cm_complete$table)
svm_2pc_cm$Reference <- factor(svm_2pc_cm$Reference,
                               levels = rev(levels(svm_2pc_cm$Reference)))
svm_2pc_cm <- svm_2pc_cm |>
    mutate(
        Label = case_when(
            Prediction == "Low" & Reference == "Low" ~ "TN",
            Prediction == "High" & Reference == "High" ~ "TP",
            Prediction == "Low" & Reference == "High" ~ "FN",
            Prediction == "High" & Reference == "Low" ~ "FP"),
        Model = "SVM (2 PC)")
#Support Vector Machine 4pc: predictions/confusion matrix
pred_svm_4pc <- predict(svm_4pc, test_df_pca, type = "class")
svm_4pc_cm_complete <- confusionMatrix(pred_svm_4pc, test_df_pca$quality,
                                       positive = "High")
svm_4pc_cm <- as.data.frame(svm_4pc_cm_complete$table)
svm_4pc_cm$Reference <- factor(svm_4pc_cm$Reference,
                               levels = rev(levels(svm_4pc_cm$Reference)))
svm_4pc_cm <- svm_4pc_cm |>
    mutate(
        Label = case_when(
            Prediction == "Low" & Reference == "Low" ~ "TN",
            Prediction == "High" & Reference == "High" ~ "TP",
            Prediction == "Low" & Reference == "High" ~ "FN",
            Prediction == "High" & Reference == "Low" ~ "FP"),
        Model = "SVM (4 PC)")
#Support Vector Machine 6pc: predictions/confusion matrix
pred_svm_6pc <- predict(svm_6pc, test_df_pca, type = "class")
svm_6pc_cm_complete <- confusionMatrix(pred_svm_6pc, test_df_pca$quality,
                                       positive = "High")
svm_6pc_cm <- as.data.frame(svm_6pc_cm_complete$table)
svm_6pc_cm$Reference <- factor(svm_6pc_cm$Reference,
                               levels = rev(levels(svm_6pc_cm$Reference)))
svm_6pc_cm <- svm_6pc_cm |>
    mutate(
        Label = case_when(
            Prediction == "Low" & Reference == "Low" ~ "TN",
            Prediction == "High" & Reference == "High" ~ "TP",
            Prediction == "Low" & Reference == "High" ~ "FN",
            Prediction == "High" & Reference == "Low" ~ "FP"),
        Model = "SVM (6 PC)")
#Support Vector Machine 8pc: predictions/confusion matrix
pred_svm_8pc <- predict(svm_8pc, test_df_pca, type = "class")
svm_8pc_cm_complete <- confusionMatrix(pred_svm_8pc, test_df_pca$quality,
                                       positive = "High")
svm_8pc_cm <- as.data.frame(svm_8pc_cm_complete$table)
svm_8pc_cm$Reference <- factor(svm_8pc_cm$Reference,
                               levels = rev(levels(svm_8pc_cm$Reference)))
svm_8pc_cm <- svm_8pc_cm |>
    mutate(
        Label = case_when(
            Prediction == "Low" & Reference == "Low" ~ "TN",
            Prediction == "High" & Reference == "High" ~ "TP",
            Prediction == "Low" & Reference == "High" ~ "FN",
            Prediction == "High" & Reference == "Low" ~ "FP"),
        Model = "SVM (8 PC)")
cm <- bind_rows(rf_orig_cm, rf_2pc_cm, rf_4pc_cm, rf_6pc_cm, rf_8pc_cm,
                svm_orig_cm, svm_2pc_cm, svm_4pc_cm, svm_6pc_cm, svm_8pc_cm)
p6 <- cm |>
    ggplot(aes(x = Reference, y = Prediction, fill = Freq)) +
    geom_tile(col = "black") +
    geom_text(aes(label = Freq)) +
    geom_text(aes(label = Label), vjust = 2) + 
    scale_fill_gradient(low = "white", high = pal[1]) +
    scale_x_discrete(position = "top") +
    facet_wrap(Model ~ ., ncol = 5, strip.position = "bottom") +
    labs(title = "Confusion Matrices for All Models") +
    theme(axis.line.x = element_blank(),
          axis.line.y = element_blank(),
          axis.text.y = element_text(angle = 90, hjust = 0.5),
          axis.ticks = element_blank(),
          legend.position = "bottom",
          strip.placement = "outside")
p6

Then we calculate the performance metrics for each of the models.

metrics <- as.data.frame(cbind(rbind(rf_orig_cm_complete$byClass,rf_2pc_cm_complete$byClass,rf_4pc_cm_complete$byClass,
                                     rf_6pc_cm_complete$byClass,rf_8pc_cm_complete$byClass,svm_orig_cm_complete$byClass,
                                     svm_2pc_cm_complete$byClass,svm_4pc_cm_complete$byClass,svm_6pc_cm_complete$byClass,
                                     svm_8pc_cm_complete$byClass),rbind(rf_orig_cm_complete$overall,
                                     rf_2pc_cm_complete$overall,rf_4pc_cm_complete$overall,rf_6pc_cm_complete$overall,
                                     rf_8pc_cm_complete$overall,svm_orig_cm_complete$overall,svm_2pc_cm_complete$overall,
                                     svm_4pc_cm_complete$overall,svm_6pc_cm_complete$overall,svm_8pc_cm_complete$overall)))
rownames(metrics) <- c("Random Forest (Original Features)","Random Forest (2 Principal Components)",
                       "Random Forest (4 Principal Components)","Random Forest (6 Principal Components)",
                       "Random Forest (8 Principal Components)","Support Vector Machine (Original Features)",
                       "Support Vector Machine (2 Principal Components)","Support Vector Machine (4 Principal Components)",
                       "Support Vector Machine (6 Principal Components)","Support Vector Machine (8 Principal Components)")
keep <- c("Accuracy", "Precision", "Recall", "F1")
metrics <- metrics |>
    select(all_of(keep)) |>
    round(3)
kable(metrics, format = "simple")
Accuracy Precision Recall F1
Random Forest (Original Features) 0.749 0.752 0.800 0.776
Random Forest (2 Principal Components) 0.638 0.652 0.711 0.680
Random Forest (4 Principal Components) 0.735 0.746 0.774 0.760
Random Forest (6 Principal Components) 0.749 0.743 0.821 0.780
Random Forest (8 Principal Components) 0.755 0.765 0.789 0.777
Support Vector Machine (Original Features) 0.744 0.745 0.800 0.772
Support Vector Machine (2 Principal Components) 0.638 0.674 0.642 0.658
Support Vector Machine (4 Principal Components) 0.718 0.731 0.758 0.744
Support Vector Machine (6 Principal Components) 0.709 0.683 0.863 0.763
Support Vector Machine (8 Principal Components) 0.707 0.697 0.811 0.749

Conclusion

Both the Random Forest and Support Vector Machine models perform best on this dataset using the original features rather than any number of principal components. Using only four principal components does get the Random Forest model close to the Accuracy and F1 Score of the Random Forest model using the original features though. The Support Vector Machine model, in contrast, requires twice as many, i.e. eight, principal components to get close to the Accuracy and F1 Score of the Support Vector Machine model using the original features.

Since we only had eleven numeric predictors in the dataset to begin with, reducing the feature space wasn’t really imperative, but we now have a framework for performing PCA on datasets with tons of features, where reducing the feature space is more necessary.

There are often a lot of benefits to reducing the feature space, especially when businesses track a lot of features for the products they sell. Jeans in particular come to mind as a product with lots of features (i.e. color, size, fit, style, stretch, whether it has a button-fly or a zipper, whether it comes distressed, etc.). If you can represent a product line with lots of features like that with just four or fewer principal components instead, you will be eliminating a lot of noise, your models will be faster, and your predictions can be as good as or better than predictions from models that use more features. It’s also amazing to be able to see data with many dimensions reduced to two dimensions, even in a classification problem like ours where we knew perfect wine quality class separation wouldn’t be visible in just two dimensions.