DATA 622 : Final Project

Author: Rupendra Shrestha | Nov 30, 2025

Column

Column

Instructions

Exploratory analysis and essay

Assignment

Choose a dataset

You get to decide which dataset you want to work on. The data set must be different from the ones used in previous homeworks You can work on a problem from your job, or something you are interested in. You may also obtain a dataset from sites such as Kaggle, Data.Gov, Census Bureau, USGS or other open data portals.

Select one of the methodologies studied in weeks 1-10, and another methodology from weeks 11-15 to apply in the new dataset selected.

To complete this task:

  • Describe the problem you are trying to solve.

  • Describe your datases and what you did to prepare the data for analysis.

  • Methodologies you used for analyzing the data

  • What’s the purpose of the analysis performed

  • Make your conclusions from your analysis. Please be sure to address the business impact (it could be of any domain) of your solution.

Introduction

We load a dataset of red wine quality. We will be using this dataset to compare the performance of a couple different classification models, first using all the original features, then using a reduced set of features derived via Principal Component Analysis (PCA). This will give us insight into how much the feature space can be reduced without losing predictive power, and we may even see predictive gains.

cur_theme <- theme_set(theme_classic())
pal <- brewer.pal(n = 8, name = "Dark2")
my_url1 <- "winequality.csv"
wine_qual_df <- read.csv(my_url1)
cols <- to_snake_case(colnames(wine_qual_df))
colnames(wine_qual_df) <- cols
wine_qual_df <- wine_qual_df |>
    mutate(ph = p_h) |>
    select(-p_h) |>
    relocate(quality, .before = fixed_acidity) |>
    relocate(ph, .before = sulphates)

Below are the first 10 observations in the dataset, and for the sake of readability, we only display the first 10 columns.

kable(wine_qual_df[1:10, 1:10], format = "simple")
quality fixed_acidity volatile_acidity citric_acid residual_sugar chlorides free_sulfur_dioxide total_sulfur_dioxide density ph
5 7.4 0.70 0.00 1.9 0.076 11 34 0.9978 3.51
5 7.8 0.88 0.00 2.6 0.098 25 67 0.9968 3.20
5 7.8 0.76 0.04 2.3 0.092 15 54 0.9970 3.26
6 11.2 0.28 0.56 1.9 0.075 17 60 0.9980 3.16
5 7.4 0.70 0.00 1.9 0.076 11 34 0.9978 3.51
5 7.4 0.66 0.00 1.8 0.075 13 40 0.9978 3.51
5 7.9 0.60 0.06 1.6 0.069 15 59 0.9964 3.30
7 7.3 0.65 0.00 1.2 0.065 15 21 0.9946 3.39
7 7.8 0.58 0.02 2.0 0.073 9 18 0.9968 3.36
5 7.5 0.50 0.36 6.1 0.071 17 102 0.9978 3.35

The first column is the response variable that we will attempt to predict: quality. Wines were rated on a scale from 0 to 10, but the dataset doesn’t actually include any observations with ratings less than three or higher than eight. We recode quality as a binary factor and replace values less than six with “Low” and values greater than or equal to six with “High.”

In addition to the response variable, there are 11 numeric predictor variables:
wine_qual_df <- wine_qual_df |>
    mutate(quality = factor(ifelse(quality < 6, "Low", "High"),
                            levels = c("Low", "High")))

classes <- as.data.frame(unlist(lapply(wine_qual_df, class))) |>
    rownames_to_column()
cols <- c("Variable", "Class")
colnames(classes) <- cols
classes_summary <- classes |>
    group_by(Class) |>
    summarize(Count = n(),
              Variables = paste(sort(unique(Variable)),collapse=", ")) |>
    filter(Class == "numeric")
kable(classes_summary, format = "simple")
Class Count Variables
numeric 11 alcohol, chlorides, citric_acid, density, fixed_acidity, free_sulfur_dioxide, ph, residual_sugar, sulphates, total_sulfur_dioxide, volatile_acidity

Exploratory Data Analysis

We check for any missing values within the dataset.

rem <- c("discrete_columns", "continuous_columns",
         "total_observations", "memory_usage")
completeness <- introduce(wine_qual_df) |>
    select(-all_of(rem))
knitr::kable(t(completeness), format = "simple")
rows 1599
columns 12
all_missing_columns 0
total_missing_values 0
complete_rows 1599

There are no missing values to address in this dataset. We check the distribution of the response variable to see if there’s a class imbalance between high-quality wines and low-quality wines.

qual_cols <- pal[c(2, 3)]
names(qual_cols) <- c("Low", "High")
obs = nrow(wine_qual_df)
p1 <- wine_qual_df |>
    ggplot(aes(x = quality)) +
    geom_histogram(aes(color = quality, fill = quality), stat = "count") +
    geom_text(stat = "count", aes(label = paste0(round(
        after_stat(count) / obs * 100, 1), "%")),
              size = 5, color = "white", vjust = 2, fontface = "bold") + 
    scale_color_manual(values = qual_cols) +
    scale_fill_manual(values = qual_cols) +
    scale_y_continuous(labels = scales::comma) +
    labs(title = "Distribution of Low- & High- Quality Wines",
         x = "QUALITY",
         y = "COUNT") +
    theme(legend.position = "none")
p1

There are pretty similar numbers of low-quality and high-quality wines in the dataset, so there is no class imbalance issue to address here. We summarize the distributions of the predictors below.

rem <- c("vars", "n", "trimmed", "mad", "skew", "kurtosis", "se")
excl <- c("quality*")
describe <- describe(wine_qual_df) |>
    select(-all_of(rem))
describe <- describe |>
    filter(!rownames(describe) %in% excl)
knitr::kable(describe, format = "simple")
mean sd median min max range
fixed_acidity 8.3196373 1.7410963 7.90000 4.60000 15.90000 11.30000
volatile_acidity 0.5278205 0.1790597 0.52000 0.12000 1.58000 1.46000
citric_acid 0.2709756 0.1948011 0.26000 0.00000 1.00000 1.00000
residual_sugar 2.5388055 1.4099281 2.20000 0.90000 15.50000 14.60000
chlorides 0.0874665 0.0470653 0.07900 0.01200 0.61100 0.59900
free_sulfur_dioxide 15.8749218 10.4601570 14.00000 1.00000 72.00000 71.00000
total_sulfur_dioxide 46.4677924 32.8953245 38.00000 6.00000 289.00000 283.00000
density 0.9967467 0.0018873 0.99675 0.99007 1.00369 0.01362
ph 3.3111132 0.1543865 3.31000 2.74000 4.01000 1.27000
sulphates 0.6581488 0.1695070 0.62000 0.33000 2.00000 1.67000
alcohol 10.4229831 1.0656676 10.20000 8.40000 14.90000 6.50000

None of the distributions appear degenerate, but we confirm there are no near-zero variance predictors nonetheless.

nzv <- nearZeroVar(wine_qual_df, names = TRUE, saveMetrics = FALSE)
length(nzv) == 0
[1] TRUE

There are indeed no near-zero variance predictors that we need to remove, so we visualize the distributions for all predictors below.

skip <- c("quality")
wine_qual_piv <- wine_qual_df |>
    pivot_longer(cols = !all_of(skip), names_to = "PREDICTOR",
                 values_to = "VALUE")
p2 <- wine_qual_piv |>
    ggplot(aes(x = VALUE, color = quality, fill = quality)) +
    geom_histogram(data = subset(wine_qual_piv, quality == "Low"),
                   alpha = 0.5) +
    geom_histogram(data = subset(wine_qual_piv, quality == "High"),
                   alpha = 0.5) +
    scale_color_manual(values = qual_cols) +
    scale_fill_manual(values = qual_cols) +
    scale_y_continuous(labels = scales::comma) +
    facet_wrap(PREDICTOR ~ ., ncol = 5, scales = "free_x") +
    labs(title = "Distribution of Predictors",
         y = "COUNT") +
    theme(legend.position = "top")
p2

We can see the means for alcohol and volatile_acidity are different for low-quality and high-quality wines. We can also see some skewed distributions, but we won’t be making any transformations since the focus of our analysis is really what impact reducing the feature space has on model performance.

We check for correlations between the predictors and the response variable, as well as any predictor-predictor correlations. In the interest of ignoring clutter, only correlations greater than 0.1 (in absolute value) are displayed.

plot_corr_range <- function(df, mn=0.1, mx=1.0, excl=c(NA)){
    palette <- brewer.pal(n = 7, name = "BrBG")[c(1, 4, 7)]
    tit = sprintf("Correlations Between %s and %s (Absolute Value)", mn, mx)
    r <- model.matrix(~0+., data = df) |>
        cor() |>
        round(digits=2)
    is.na(r) <- abs(r) > mx
    is.na(r) <- abs(r) < mn
    if (!is.na(excl)){
        r <- as.data.frame(r) |>
            select(-all_of(excl)) |>
            filter(!rownames(r) %in% excl)
    }
    p <- r |>
        ggcorrplot(show.diag = FALSE, type = "lower", lab = TRUE,
                   lab_size = 3, tl.cex = 10, tl.srt = 90,
                   colors = palette, outline.color = "white") +
        labs(title = tit) +
        theme(plot.title.position = "plot")
    p
}
excl <- c("qualityLow")
p3 <- plot_corr_range(df = wine_qual_df, excl = excl)
p3

We see that alcohol has the largest positive correlation with high-quality wines, and volatile_acidity has the largest negative correlation with high-quality wines. To a lesser degree, fixed_acidity, citric_acid, and sulphates are also positively correlated with high-quality wines, and chlorides, total_sulfur_dioxide, and density are also negatively correlated with high-quality wines.

We also see high correlations between some of our predictors:

  • fixed_acidity and citric_acid | density | ph

  • total_sulfur_dioxide and free_sulfur_dioxide

Note that we won’t be eliminating any predictors from consideration for the same reason we aren’t doing any transformations: we are more interested in seeing what effects reducing the feature space from all the original features to just principal components has than anything else.

Data Preparation

We split the data into train and test sets.

set.seed(1006)
sample <- sample(nrow(wine_qual_df),
                 round(nrow(wine_qual_df) * 0.7),
                 replace = FALSE)
train_df <- wine_qual_df[sample, ]
test_df <- wine_qual_df[-sample, ]

We confirm the class distributions are similar in the original, train, and test sets.

dist1 <- as.data.frame(round(prop.table(table(select(wine_qual_df, quality))), 2))
colnames(dist1) <- c("Quality", "Original Freq")
dist2 <- as.data.frame(round(prop.table(table(select(train_df, quality))), 2))
colnames(dist2) <- c("Quality", "Train Freq")
dist3 <- as.data.frame(round(prop.table(table(select(test_df, quality))), 2))
colnames(dist3) <- c("Quality", "Test Freq")
class_dist <- dist1 |>
    left_join(dist2, by = join_by(Quality)) |>
    left_join(dist3, by = join_by(Quality))
kable(class_dist, format = "simple")
Quality Original Freq Train Freq Test Freq
Low 0.47 0.47 0.45
High 0.53 0.53 0.55

The class distributions are all pretty similar.

Model Building

We build our first set of models: a Random Forest model and a Support Vector Machine model using all the original features.

Random Forest Model (Original Features)

A summary of the relative importance of the original features to the Random Forest model we have trained is below.

fn <- "rf_orig.rds"
if (!file.exists(fn)){
    rf_orig <- train(quality ~ ., data = train_df, metric = "Accuracy",
                     method = "rf", trControl = trainControl(method = "none"),
                     tuneGrid = expand.grid(.mtry = 3))
    saveRDS(rf_orig, fn)
}else{
    rf_orig <- readRDS(fn)
}
rf_orig_imp <- varImp(rf_orig, scale = TRUE)
rf_orig_imp <- rf_orig_imp$importance |>
    rownames_to_column()
cols <- c("Predictor", "Importance")
colnames(rf_orig_imp) <- cols
rf_orig_imp <- rf_orig_imp |>
    arrange(desc(Importance))
kable(rf_orig_imp, format = "simple")
Predictor Importance
alcohol 100.000000
sulphates 55.744989
volatile_acidity 49.290579
total_sulfur_dioxide 39.393053
density 28.842431
fixed_acidity 19.203067
chlorides 17.144795
ph 12.119840
citric_acid 7.937798
free_sulfur_dioxide 2.027924
residual_sugar 0.000000

Unsurprisingly, the Random Forest model estimates alcohol to be the most important feature for predicting the response variable. We deduced that from our correlation plot. The relative importance estimate for `residual_sugar`` is 0, indicating it was not used in any of the trees.

Support Vector Machine Model (Original Features)

A summary of the best Support Vector Machine model using a radial basis kernel and the original features that we arrived at during tuning is below:

ctrl <-  tune.control(sampling = "cross", cross = 10, nrepeat = 5)
    tune_grid <- list(cost = c(0.1, 1, 10, 100, 1000),
                      gamma = c(0.5, 1, 2, 3, 4))
fn <- "svm_orig.rds"
if (!file.exists(fn)){
    svm_tune_orig <- tune(svm, quality ~ .,
                          data = train_df, kernel = "radial",
                          ranges = tune_grid, tunecontrol = ctrl)
    svm_orig <- svm_tune_orig$best.model
    saveRDS(svm_orig, fn)
}else{
    svm_orig <- readRDS(fn)
}
summarize_svm <- function(svm_model){
    col1 <- c("call", "cost", "gamma", "num_classes", "classes",
              "support_vectors_total", "support_vectors_split")
    subset <- c("call", "cost", "gamma", "nclasses", "levels",
              "tot.nSV", "nSV")
    col2 <- svm_model[subset]
    copy <- col2
    for (i in 1:length(copy)){
        if (is.vector(copy[[i]])){
            col2[[i]] <- paste(col2[[i]], collapse = ", ")
        }
    }
    summ <- as.data.frame(cbind(col1, col2))
    rownames(summ) <- NULL
    colnames(summ) <- c("Parameter", "Value")
    summ
}
summ <- summarize_svm(svm_orig)
kable(summ, format = "simple")
Parameter Value
call best.tune(METHOD = svm, train.x = quality ~ ., data = train_df, , ranges = tune_grid, tunecontrol = ctrl, kernel = “radial”)
cost 1
gamma 0.5
num_classes 2
classes Low, High
support_vectors_total 877
support_vectors_split 442, 435

It uses a cost of 1 and a gamma of 0.5.

Principal Component Analysis (PCA)

We now perform PCA. First, we check the mean and variance of all predictors to confirm we need to center and scale them.

mean_df <- round(as.data.frame(apply(wine_qual_df |> select(-quality), 2, mean)), 2)
colnames(mean_df) <- c("Mean")
mean_df <- mean_df |>
    rownames_to_column(var = "Predictor")
var_df <- round(as.data.frame(apply(wine_qual_df |> select(-quality), 2, var)), 2)
colnames(var_df) <- c("Variance")
var_df <- var_df |>
    rownames_to_column(var = "Predictor")
mean_var_df <- mean_df |>
    left_join(var_df, by = join_by(Predictor))
kable(mean_var_df, format = "simple")
Predictor Mean Variance
fixed_acidity 8.32 3.03
volatile_acidity 0.53 0.03
citric_acid 0.27 0.04
residual_sugar 2.54 1.99
chlorides 0.09 0.00
free_sulfur_dioxide 15.87 109.41
total_sulfur_dioxide 46.47 1082.10
density 1.00 0.00
ph 3.31 0.02
sulphates 0.66 0.03
alcohol 10.42 1.14

Both free_sulfur_dioxide and total_sulfur_dioxide exhibit much larger variance than the other variables, so we center and scale them to prevent them from dominating the principal components. Below is the rotation, ie. the matrix of variable loadings.

pca.out = prcomp(wine_qual_df |> select(-quality), scale = TRUE)
kable(round(as.data.frame(pca.out$rotation), 2), format = "simple")
PC1 PC2 PC3 PC4 PC5 PC6 PC7 PC8 PC9 PC10 PC11
fixed_acidity 0.49 -0.11 0.12 -0.23 0.08 -0.10 0.35 -0.18 0.19 0.25 0.64
volatile_acidity -0.24 0.27 0.45 0.08 -0.22 -0.41 0.53 -0.08 -0.13 -0.37 0.00
citric_acid 0.46 -0.15 -0.24 -0.08 0.06 -0.07 -0.11 -0.38 -0.38 -0.62 -0.07
residual_sugar 0.15 0.27 -0.10 -0.37 -0.73 -0.05 -0.29 0.30 0.01 -0.09 0.18
chlorides 0.21 0.15 0.09 0.67 -0.25 -0.30 -0.37 -0.36 0.11 0.22 0.05
free_sulfur_dioxide -0.04 0.51 -0.43 -0.04 0.16 0.01 0.12 -0.20 0.64 -0.25 -0.05
total_sulfur_dioxide 0.02 0.57 -0.32 -0.03 0.22 -0.14 0.09 0.02 -0.59 0.37 0.07
density 0.40 0.23 0.34 -0.17 -0.16 0.39 0.17 -0.24 0.02 0.24 -0.57
ph -0.44 0.01 -0.06 0.00 -0.27 0.52 0.03 -0.56 -0.17 0.01 0.34
sulphates 0.24 -0.04 -0.28 0.55 -0.23 0.38 0.45 0.37 -0.06 -0.11 0.07
alcohol -0.11 -0.39 -0.47 -0.12 -0.35 -0.36 0.33 -0.22 0.04 0.30 -0.31

We generate a scree plot so that we can see the variance explained by each principal component.

scree_plot_df = as.data.frame(pca.out$sdev^2 / sum(pca.out$sdev^2))
colnames(scree_plot_df) = c("var_explained")
scree_plot_df <- scree_plot_df |>
    rowid_to_column(var = "pc")
p4 <- scree_plot_df |>
    ggplot(aes(x = pc, y = var_explained)) +
    geom_line() + 
    geom_point(size = 2) +
    scale_x_continuous(limits = c(1, 11), breaks = seq(1, 11, 1)) +
    scale_y_continuous(limits = c(0, 1), breaks = seq(0, 1, 0.1)) +
    labs(x = "Principal Component",
         y = "Variance Explained",
         title = "Scree Plot")
p4

The first principal component explains a little less than 30 percent of the variance in our data, and the second principal component explains a little less than 20 percent of the variance in our data. Since those values are relatively low, we don’t expect building models using only the first two principal components will be sufficient, so we will go up to eight principal components.

For the same reason, we don’t expect the biplot of the first two principal components we generate below to show great separation between high-quality and low-quality wines.

p5 <- pca.out |>
    ggbiplot(obs.scale = 1, var.scale = 1,
             groups = wine_qual_df$quality,
             ellipse = TRUE,
             ellipse.alpha = 0.0) +
    theme(legend.direction = "horizontal",
          legend.position = "top") +
    labs(fill = "Quality",
         color = "Quality",
         title = "Biplot of First Two Principal Components") +
    scale_color_discrete(type = qual_cols)
p5

Indeed, there is a lot of overlap in low-quality and high-quality wines plotted by just the first two principal components, but we do see some separation between the classes, which is promising.

We create new versions of the train and test sets using the features derived via PCA instead of the original features.

train_df_pca <- as.data.frame(predict(pca.out, train_df)) |>
    bind_cols(train_df |> select(quality)) |>
    relocate(quality, .before = "PC1")
test_df_pca <- as.data.frame(predict(pca.out, test_df)) |>
    bind_cols(test_df |> select(quality)) |>
    relocate(quality, .before = "PC1")

Then we build our second set of models: Random Forest and Support Vector Machine models using the first two, first four, first six, or first eight principal components.

Random Forest Model (Principal Components)

First, we train four Random Forest models using increasing numbers of principal components.

fn <- "rf_2pc.rds"
if (!file.exists(fn)){
    rf_2pc <- train(quality ~ PC1 + PC2,
                    data = train_df_pca, metric = "Accuracy",
                    method = "rf", trControl = trainControl(method = "none"),
                    tuneGrid = expand.grid(.mtry = 1))
    saveRDS(rf_2pc, fn)
}else{
    rf_2pc <- readRDS(fn)
}
fn <- "rf_4pc.rds"
if (!file.exists(fn)){
    rf_4pc <- train(quality ~ PC1 + PC2 + PC3 + PC4,
                    data = train_df_pca, metric = "Accuracy",
                    method = "rf", trControl = trainControl(method = "none"),
                    tuneGrid = expand.grid(.mtry = 2))
    saveRDS(rf_4pc, fn)
}else{
    rf_4pc <- readRDS(fn)
}
fn <- "rf_6pc.rds"
if (!file.exists(fn)){
    rf_6pc <- train(quality ~ PC1 + PC2 + PC3 + PC4 + PC5 + PC6,
                    data = train_df_pca, metric = "Accuracy",
                    method = "rf", trControl = trainControl(method = "none"),
                    tuneGrid = expand.grid(.mtry = 2))
    saveRDS(rf_6pc, fn)
}else{
    rf_6pc <- readRDS(fn)
}
fn <- "rf_8pc.rds"
if (!file.exists(fn)){
    rf_8pc <- train(quality ~ PC1 + PC2 + PC3 + PC4 + PC5 + PC6 + PC7 + PC8,
                    data = train_df_pca, metric = "Accuracy",
                    method = "rf", trControl = trainControl(method = "none"),
                    tuneGrid = expand.grid(.mtry = 2))
    saveRDS(rf_8pc, fn)
}else{
    rf_8pc <- readRDS(fn)
}

A summary of the relative importance of the principal components to each of these Random Forest models we have trained is below.

rf_2pc_imp <- varImp(rf_2pc, scale = TRUE)
rf_2pc_imp <- rf_2pc_imp$importance |>
    rownames_to_column()
cols <- c("Predictor", "Importance")
colnames(rf_2pc_imp) <- cols
rf_2pc_imp <- rf_2pc_imp |>
    arrange(desc(Importance)) |>
    rowid_to_column(var = "Rank") |>
    mutate(Model = "Random Forest (2 Principal Components)") |>
    select(-Importance)
rf_4pc_imp <- varImp(rf_4pc, scale = TRUE)
rf_4pc_imp <- rf_4pc_imp$importance |>
    rownames_to_column()
cols <- c("Predictor", "Importance")
colnames(rf_4pc_imp) <- cols
rf_4pc_imp <- rf_4pc_imp |>
    arrange(desc(Importance)) |>
    rowid_to_column(var = "Rank") |>
    mutate(Model = "Random Forest (4 Principal Components)") |>
    select(-Importance)
rf_6pc_imp <- varImp(rf_6pc, scale = TRUE)
rf_6pc_imp <- rf_6pc_imp$importance |>
    rownames_to_column()
cols <- c("Predictor", "Importance")
colnames(rf_6pc_imp) <- cols
rf_6pc_imp <- rf_6pc_imp |>
    arrange(desc(Importance)) |>
    rowid_to_column(var = "Rank") |>
    mutate(Model = "Random Forest (6 Principal Components)") |>
    select(-Importance)
rf_8pc_imp <- varImp(rf_8pc, scale = TRUE)
rf_8pc_imp <- rf_8pc_imp$importance |>
    rownames_to_column()
cols <- c("Predictor", "Importance")
colnames(rf_8pc_imp) <- cols
rf_8pc_imp <- rf_8pc_imp |>
    arrange(desc(Importance)) |>
    rowid_to_column(var = "Rank") |>
    mutate(Model = "Random Forest (8 Principal Components)") |>
    select(-Importance)
rf_pc_imp_df <- rf_2pc_imp |>
    bind_rows(rf_4pc_imp, rf_6pc_imp, rf_8pc_imp) |>
    pivot_wider(names_from = Rank, values_from = Predictor,
                names_prefix = "Rank_")
kable(rf_pc_imp_df, format = "simple")
Model Rank_1 Rank_2 Rank_3 Rank_4 Rank_5 Rank_6 Rank_7 Rank_8
Random Forest (2 Principal Components) PC2 PC1 NA NA NA NA NA NA
Random Forest (4 Principal Components) PC2 PC3 PC1 PC4 NA NA NA NA
Random Forest (6 Principal Components) PC2 PC3 PC1 PC5 PC4 PC6 NA NA
Random Forest (8 Principal Components) PC2 PC3 PC5 PC1 PC7 PC4 PC8 PC6

All of these Random Forest models rank the second principal component as their most important feature, and most of them rank the third principal component as their second-most important feature. Only the Random Forest model built using only the first two principal components includes the first principal component in its top two features.

Support Vector Machine Model (Principal Components)

Next we train four Support Vector Machine models using increasing numbers of principal components. Summaries of the best Support Vector Machine models using radial basis kernels that we arrived at during tuning are below:

fn <- "svm_2pc.rds"
if (!file.exists(fn)){
    svm_tune_2pc <- tune(svm, quality ~ PC1 + PC2,
                         data = train_df_pca, kernel = "radial",
                         ranges = tune_grid, tunecontrol = ctrl)
    svm_2pc <- svm_tune_2pc$best.model
    saveRDS(svm_2pc, fn)
}else{
    svm_2pc <- readRDS(fn)
}
fn <- "svm_4pc.rds"
if (!file.exists(fn)){
    svm_tune_4pc <- tune(svm, quality ~ PC1 + PC2 + PC3 + PC4,
                         data = train_df_pca, kernel = "radial",
                         ranges = tune_grid, tunecontrol = ctrl)
    svm_4pc <- svm_tune_4pc$best.model
    saveRDS(svm_4pc, fn)
}else{
    svm_4pc <- readRDS(fn)
}
fn <- "svm_6pc.rds"
if (!file.exists(fn)){
    svm_tune_6pc <- tune(svm, quality ~ PC1 + PC2 + PC3 + PC4 + PC5 + PC6,
                         data = train_df_pca, kernel = "radial",
                         ranges = tune_grid, tunecontrol = ctrl)
    svm_6pc <- svm_tune_6pc$best.model
    saveRDS(svm_6pc, fn)
}else{
    svm_6pc <- readRDS(fn)
}
fn <- "svm_8pc.rds"
if (!file.exists(fn)){
    svm_tune_8pc <- tune(svm,
                         quality ~ PC1 + PC2 + PC3 + PC4 + PC5 + PC6 + PC7 + PC8,
                         data = train_df_pca, kernel = "radial",
                         ranges = tune_grid, tunecontrol = ctrl)
    svm_8pc <- svm_tune_8pc$best.model
    saveRDS(svm_8pc, fn)
}else{
    svm_8pc <- readRDS(fn)
}
summ_2pc <- summarize_svm(svm_2pc) |>
    filter(Parameter != "call") |>
    mutate(Model = "Support Vector Machine (2 Principal Components)") |>
    pivot_wider(names_from = Parameter, values_from = Value)
summ_4pc <- summarize_svm(svm_4pc) |>
    filter(Parameter != "call") |>
    mutate(Model = "Support Vector Machine (4 Principal Components)") |>
    pivot_wider(names_from = Parameter, values_from = Value)
summ_6pc <- summarize_svm(svm_6pc) |>
    filter(Parameter != "call") |>
    mutate(Model = "Support Vector Machine (6 Principal Components)") |>
    pivot_wider(names_from = Parameter, values_from = Value)
summ_8pc <- summarize_svm(svm_8pc) |>
    filter(Parameter != "call") |>
    mutate(Model = "Support Vector Machine (8 Principal Components)") |>
    pivot_wider(names_from = Parameter, values_from = Value)
summ <- summ_2pc |>
    bind_rows(summ_4pc, summ_6pc, summ_8pc)
kable(summ, format = "simple")
Model cost gamma num_classes classes support_vectors_total support_vectors_split
Support Vector Machine (2 Principal Components) 10 2 2 Low, High 810 407, 403
Support Vector Machine (4 Principal Components) 0.1 0.5 2 Low, High 824 413, 411
Support Vector Machine (6 Principal Components) 1 3 2 Low, High 1017 527, 490
Support Vector Machine (8 Principal Components) 100 1 2 Low, High 846 445, 401

They use various costs between 0.1 and 10, but they all use the same gamma: 0.5.

Modal Evaluation

We make predictions on the test data using all models, and we construct confusion matrices and calculate a variety of performance metrics for them.

First, we look at the confusion matrices for each of the models.

#Random Forest Orig: predictions/confusion matrix
pred_rf_orig <- predict(rf_orig, test_df)
rf_orig_cm_complete <- confusionMatrix(pred_rf_orig, test_df$quality,
                                       positive = "High")
rf_orig_cm <- as.data.frame(rf_orig_cm_complete$table)
rf_orig_cm$Reference <- factor(rf_orig_cm$Reference,
                               levels = rev(levels(rf_orig_cm$Reference)))
rf_orig_cm <- rf_orig_cm |>
    mutate(
        Label = case_when(
            Prediction == "Low" & Reference == "Low" ~ "TN",
            Prediction == "High" & Reference == "High" ~ "TP",
            Prediction == "Low" & Reference == "High" ~ "FN",
            Prediction == "High" & Reference == "Low" ~ "FP"),
        Model = "RF (Orig. Feat.)")
#Random Forest 2pc: predictions/confusion matrix
pred_rf_2pc <- predict(rf_2pc, test_df_pca)
rf_2pc_cm_complete <- confusionMatrix(pred_rf_2pc, test_df_pca$quality,
                                      positive = "High")
rf_2pc_cm <- as.data.frame(rf_2pc_cm_complete$table)
rf_2pc_cm$Reference <- factor(rf_2pc_cm$Reference,
                               levels = rev(levels(rf_2pc_cm$Reference)))
rf_2pc_cm <- rf_2pc_cm |>
    mutate(
        Label = case_when(
            Prediction == "Low" & Reference == "Low" ~ "TN",
            Prediction == "High" & Reference == "High" ~ "TP",
            Prediction == "Low" & Reference == "High" ~ "FN",
            Prediction == "High" & Reference == "Low" ~ "FP"),
        Model = "RF (2 PC)")
#Random Forest 4pc: predictions/confusion matrix
pred_rf_4pc <- predict(rf_4pc, test_df_pca)
rf_4pc_cm_complete <- confusionMatrix(pred_rf_4pc, test_df_pca$quality,
                                      positive = "High")
rf_4pc_cm <- as.data.frame(rf_4pc_cm_complete$table)
rf_4pc_cm$Reference <- factor(rf_4pc_cm$Reference,
                               levels = rev(levels(rf_4pc_cm$Reference)))
rf_4pc_cm <- rf_4pc_cm |>
    mutate(
        Label = case_when(
            Prediction == "Low" & Reference == "Low" ~ "TN",
            Prediction == "High" & Reference == "High" ~ "TP",
            Prediction == "Low" & Reference == "High" ~ "FN",
            Prediction == "High" & Reference == "Low" ~ "FP"),
        Model = "RF (4 PC)")
#Random Forest 6pc: predictions/confusion matrix
pred_rf_6pc <- predict(rf_6pc, test_df_pca)
rf_6pc_cm_complete <- confusionMatrix(pred_rf_6pc, test_df_pca$quality,
                                      positive = "High")
rf_6pc_cm <- as.data.frame(rf_6pc_cm_complete$table)
rf_6pc_cm$Reference <- factor(rf_6pc_cm$Reference,
                               levels = rev(levels(rf_6pc_cm$Reference)))
rf_6pc_cm <- rf_6pc_cm |>
    mutate(
        Label = case_when(
            Prediction == "Low" & Reference == "Low" ~ "TN",
            Prediction == "High" & Reference == "High" ~ "TP",
            Prediction == "Low" & Reference == "High" ~ "FN",
            Prediction == "High" & Reference == "Low" ~ "FP"),
        Model = "RF (6 PC)")
#Random Forest 8pc: predictions/confusion matrix
pred_rf_8pc <- predict(rf_8pc, test_df_pca)
rf_8pc_cm_complete <- confusionMatrix(pred_rf_8pc, test_df_pca$quality,
                                      positive = "High")
rf_8pc_cm <- as.data.frame(rf_8pc_cm_complete$table)
rf_8pc_cm$Reference <- factor(rf_8pc_cm$Reference,
                               levels = rev(levels(rf_8pc_cm$Reference)))
rf_8pc_cm <- rf_8pc_cm |>
    mutate(
        Label = case_when(
            Prediction == "Low" & Reference == "Low" ~ "TN",
            Prediction == "High" & Reference == "High" ~ "TP",
            Prediction == "Low" & Reference == "High" ~ "FN",
            Prediction == "High" & Reference == "Low" ~ "FP"),
        Model = "RF (8C)")

#Support Vector Machine Orig: predictions/confusion matrix

pred_svm_orig <- predict(svm_orig, test_df, type = "class")
svm_orig_cm_complete <- confusionMatrix(pred_svm_orig, test_df$quality,
                                       positive = "High")
svm_orig_cm <- as.data.frame(svm_orig_cm_complete$table)
svm_orig_cm$Reference <- factor(svm_orig_cm$Reference,
                               levels = rev(levels(svm_orig_cm$Reference)))
svm_orig_cm <- svm_orig_cm |>
    mutate(
        Label = case_when(
            Prediction == "Low" & Reference == "Low" ~ "TN",
            Prediction == "High" & Reference == "High" ~ "TP",
            Prediction == "Low" & Reference == "High" ~ "FN",
            Prediction == "High" & Reference == "Low" ~ "FP"),
        Model = "SVM (Orig. Feat.)")
#Support Vector Machine 2pc: predictions/confusion matrix
pred_svm_2pc <- predict(svm_2pc, test_df_pca, type = "class")
svm_2pc_cm_complete <- confusionMatrix(pred_svm_2pc, test_df_pca$quality,
                                       positive = "High")
svm_2pc_cm <- as.data.frame(svm_2pc_cm_complete$table)
svm_2pc_cm$Reference <- factor(svm_2pc_cm$Reference,
                               levels = rev(levels(svm_2pc_cm$Reference)))
svm_2pc_cm <- svm_2pc_cm |>
    mutate(
        Label = case_when(
            Prediction == "Low" & Reference == "Low" ~ "TN",
            Prediction == "High" & Reference == "High" ~ "TP",
            Prediction == "Low" & Reference == "High" ~ "FN",
            Prediction == "High" & Reference == "Low" ~ "FP"),
        Model = "SVM (2 PC)")
#Support Vector Machine 4pc: predictions/confusion matrix
pred_svm_4pc <- predict(svm_4pc, test_df_pca, type = "class")
svm_4pc_cm_complete <- confusionMatrix(pred_svm_4pc, test_df_pca$quality,
                                       positive = "High")
svm_4pc_cm <- as.data.frame(svm_4pc_cm_complete$table)
svm_4pc_cm$Reference <- factor(svm_4pc_cm$Reference,
                               levels = rev(levels(svm_4pc_cm$Reference)))
svm_4pc_cm <- svm_4pc_cm |>
    mutate(
        Label = case_when(
            Prediction == "Low" & Reference == "Low" ~ "TN",
            Prediction == "High" & Reference == "High" ~ "TP",
            Prediction == "Low" & Reference == "High" ~ "FN",
            Prediction == "High" & Reference == "Low" ~ "FP"),
        Model = "SVM (4 PC)")
#Support Vector Machine 6pc: predictions/confusion matrix
pred_svm_6pc <- predict(svm_6pc, test_df_pca, type = "class")
svm_6pc_cm_complete <- confusionMatrix(pred_svm_6pc, test_df_pca$quality,
                                       positive = "High")
svm_6pc_cm <- as.data.frame(svm_6pc_cm_complete$table)
svm_6pc_cm$Reference <- factor(svm_6pc_cm$Reference,
                               levels = rev(levels(svm_6pc_cm$Reference)))
svm_6pc_cm <- svm_6pc_cm |>
    mutate(
        Label = case_when(
            Prediction == "Low" & Reference == "Low" ~ "TN",
            Prediction == "High" & Reference == "High" ~ "TP",
            Prediction == "Low" & Reference == "High" ~ "FN",
            Prediction == "High" & Reference == "Low" ~ "FP"),
        Model = "SVM (6 PC)")
#Support Vector Machine 8pc: predictions/confusion matrix
pred_svm_8pc <- predict(svm_8pc, test_df_pca, type = "class")
svm_8pc_cm_complete <- confusionMatrix(pred_svm_8pc, test_df_pca$quality,
                                       positive = "High")
svm_8pc_cm <- as.data.frame(svm_8pc_cm_complete$table)
svm_8pc_cm$Reference <- factor(svm_8pc_cm$Reference,
                               levels = rev(levels(svm_8pc_cm$Reference)))
svm_8pc_cm <- svm_8pc_cm |>
    mutate(
        Label = case_when(
            Prediction == "Low" & Reference == "Low" ~ "TN",
            Prediction == "High" & Reference == "High" ~ "TP",
            Prediction == "Low" & Reference == "High" ~ "FN",
            Prediction == "High" & Reference == "Low" ~ "FP"),
        Model = "SVM (8 PC)")
cm <- bind_rows(rf_orig_cm, rf_2pc_cm, rf_4pc_cm, rf_6pc_cm, rf_8pc_cm,
                svm_orig_cm, svm_2pc_cm, svm_4pc_cm, svm_6pc_cm, svm_8pc_cm)
p6 <- cm |>
    ggplot(aes(x = Reference, y = Prediction, fill = Freq)) +
    geom_tile(col = "black") +
    geom_text(aes(label = Freq)) +
    geom_text(aes(label = Label), vjust = 2) + 
    scale_fill_gradient(low = "white", high = pal[1]) +
    scale_x_discrete(position = "top") +
    facet_wrap(Model ~ ., ncol = 5, strip.position = "bottom") +
    labs(title = "Confusion Matrices for All Models") +
    theme(axis.line.x = element_blank(),
          axis.line.y = element_blank(),
          axis.text.y = element_text(angle = 90, hjust = 0.5),
          axis.ticks = element_blank(),
          legend.position = "bottom",
          strip.placement = "outside")
p6

Then we calculate the performance metrics for each of the models.

metrics <- as.data.frame(cbind(rbind(rf_orig_cm_complete$byClass,
                                     rf_2pc_cm_complete$byClass,
                                     rf_4pc_cm_complete$byClass,
                                     rf_6pc_cm_complete$byClass,
                                     rf_8pc_cm_complete$byClass,
                                     svm_orig_cm_complete$byClass,
                                     svm_2pc_cm_complete$byClass,
                                     svm_4pc_cm_complete$byClass,
                                     svm_6pc_cm_complete$byClass,
                                     svm_8pc_cm_complete$byClass),
                               rbind(rf_orig_cm_complete$overall,
                                     rf_2pc_cm_complete$overall,
                                     rf_4pc_cm_complete$overall,
                                     rf_6pc_cm_complete$overall,
                                     rf_8pc_cm_complete$overall,
                                     svm_orig_cm_complete$overall,
                                     svm_2pc_cm_complete$overall,
                                     svm_4pc_cm_complete$overall,
                                     svm_6pc_cm_complete$overall,
                                     svm_8pc_cm_complete$overall)))
rownames(metrics) <- c("Random Forest (Original Features)",
                       "Random Forest (2 Principal Components)",
                       "Random Forest (4 Principal Components)",
                       "Random Forest (6 Principal Components)",
                       "Random Forest (8 Principal Components)",
                       "Support Vector Machine (Original Features)",
                       "Support Vector Machine (2 Principal Components)",
                       "Support Vector Machine (4 Principal Components)",
                       "Support Vector Machine (6 Principal Components)",
                       "Support Vector Machine (8 Principal Components)")
keep <- c("Accuracy", "Precision", "Recall", "F1")
metrics <- metrics |>
    select(all_of(keep)) |>
    round(3)
kable(metrics, format = "simple")
Accuracy Precision Recall F1
Random Forest (Original Features) 0.810 0.814 0.848 0.831
Random Forest (2 Principal Components) 0.725 0.736 0.776 0.756
Random Forest (4 Principal Components) 0.794 0.808 0.817 0.813
Random Forest (6 Principal Components) 0.810 0.805 0.863 0.833
Random Forest (8 Principal Components) 0.812 0.824 0.837 0.830
Support Vector Machine (Original Features) 0.783 0.787 0.829 0.807
Support Vector Machine (2 Principal Components) 0.646 0.688 0.646 0.667
Support Vector Machine (4 Principal Components) 0.715 0.744 0.730 0.737
Support Vector Machine (6 Principal Components) 0.773 0.747 0.886 0.810
Support Vector Machine (8 Principal Components) 0.775 0.765 0.852 0.806

Conclusion

Both the Random Forest and Support Vector Machine models perform best on this dataset using the original features rather than any number of principal components. Using only four principal components does get the Random Forest model close to the Accuracy and F1 Score of the Random Forest model using the original features though. The Support Vector Machine model, in contrast, requires twice as many, i.e. eight, principal components to get close to the Accuracy and F1 Score of the Support Vector Machine model using the original features.

Since we only had eleven numeric predictors in the dataset to begin with, reducing the feature space wasn’t really imperative, but we now have a framework for performing PCA on datasets with tons of features, where reducing the feature space is more necessary.

There are often a lot of benefits to reducing the feature space, especially when businesses track a lot of features for the products they sell. Jeans in particular come to mind as a product with lots of features (i.e. color, size, fit, style, stretch, whether it has a button-fly or a zipper, whether it comes distressed, etc.). If you can represent a product line with lots of features like that with just four or fewer principal components instead, you will be eliminating a lot of noise, your models will be faster, and your predictions can be as good as or better than predictions from models that use more features. It’s also amazing to be able to see data with many dimensions reduced to two dimensions, even in a classification problem like ours where we knew perfect wine quality class separation wouldn’t be visible in just two dimensions.