DATA622 - Homework 4 - Final Project

Packages

library(caret)
library(DataExplorer)
library(e1071)
library(ggbiplot)
library(ggcorrplot)
library(knitr)
library(psych)
library(randomForest)
library(RColorBrewer)
library(snakecase)
library(tidyverse)

Introduction

We load a dataset of red wine quality. We will be using this dataset to compare the performance of a couple different classification models, first using all the original features, then using a reduced set of features derived via Principal Component Analysis (PCA). This will give us insight into how much the feature space can be reduced without losing predictive power, and we may even see predictive gains.

cur_theme <- theme_set(theme_classic())
pal <- brewer.pal(n = 8, name = "Dark2")
my_url1 <- "https://raw.githubusercontent.com/geedoubledee/data622_homework4/main/winequality-red.csv"
wine_qual_df <- read.csv(my_url1)
cols <- to_snake_case(colnames(wine_qual_df))
colnames(wine_qual_df) <- cols
wine_qual_df <- wine_qual_df |>
    mutate(ph = p_h) |>
    select(-p_h) |>
    relocate(quality, .before = fixed_acidity) |>
    relocate(ph, .before = sulphates)

Below are the first 10 observations in the dataset, and for the sake of readability, we only display the first 10 columns.

kable(wine_qual_df[1:10, 1:10], format = "simple")

quality	fixed_acidity	volatile_acidity	citric_acid	residual_sugar	chlorides	free_sulfur_dioxide	total_sulfur_dioxide	density	ph
5	7.4	0.70	0.00	1.9	0.076	11	34	0.9978	3.51
5	7.8	0.88	0.00	2.6	0.098	25	67	0.9968	3.20
5	7.8	0.76	0.04	2.3	0.092	15	54	0.9970	3.26
6	11.2	0.28	0.56	1.9	0.075	17	60	0.9980	3.16
5	7.4	0.70	0.00	1.9	0.076	11	34	0.9978	3.51
5	7.4	0.66	0.00	1.8	0.075	13	40	0.9978	3.51
5	7.9	0.60	0.06	1.6	0.069	15	59	0.9964	3.30
7	7.3	0.65	0.00	1.2	0.065	15	21	0.9946	3.39
7	7.8	0.58	0.02	2.0	0.073	9	18	0.9968	3.36
5	7.5	0.50	0.36	6.1	0.071	17	102	0.9978	3.35

The first column is the response variable that we will attempt to predict: quality. Wines were rated on a scale from 0 to 10, but the dataset doesn’t actually include any observations with ratings less than three or higher than eight. We recode quality as a binary factor and replace values less than six with “Low” and values greater than or equal to six with “High.”

wine_qual_df <- wine_qual_df |>
    mutate(quality = factor(ifelse(quality < 6, "Low", "High"),
                            levels = c("Low", "High")))

In addition to the response variable, there are 11 numeric predictor variables:

classes <- as.data.frame(unlist(lapply(wine_qual_df, class))) |>
    rownames_to_column()
cols <- c("Variable", "Class")
colnames(classes) <- cols
classes_summary <- classes |>
    group_by(Class) |>
    summarize(Count = n(),
              Variables = paste(sort(unique(Variable)),collapse=", ")) |>
    filter(Class == "numeric")
kable(classes_summary, format = "simple")

Class	Count	Variables
numeric	11	alcohol, chlorides, citric_acid, density, fixed_acidity, free_sulfur_dioxide, ph, residual_sugar, sulphates, total_sulfur_dioxide, volatile_acidity

Exploratory Data Analysis

We check for any missing values within the dataset.

rem <- c("discrete_columns", "continuous_columns",
         "total_observations", "memory_usage")
completeness <- introduce(wine_qual_df) |>
    select(-all_of(rem))
knitr::kable(t(completeness), format = "simple")

rows	1599
columns	12
all_missing_columns	0
total_missing_values	0
complete_rows	1599

There are no missing values to address in this dataset. We check the distribution of the response variable to see if there’s a class imbalance between high-quality wines and low-quality wines.

qual_cols <- pal[c(2, 3)]
names(qual_cols) <- c("Low", "High")
obs = nrow(wine_qual_df)
p1 <- wine_qual_df |>
    ggplot(aes(x = quality)) +
    geom_histogram(aes(color = quality, fill = quality), stat = "count") +
    geom_text(stat = "count", aes(label = paste0(round(
        after_stat(count) / obs * 100, 1), "%")),
              size = 5, color = "white", vjust = 2, fontface = "bold") + 
    scale_color_manual(values = qual_cols) +
    scale_fill_manual(values = qual_cols) +
    scale_y_continuous(labels = scales::comma) +
    labs(title = "Distribution of Low- & High- Quality Wines",
         x = "QUALITY",
         y = "COUNT") +
    theme(legend.position = "none")
p1

There are pretty similar numbers of low-quality and high-quality wines in the dataset, so there is no class imbalance issue to address here. We summarize the distributions of the predictors below.

rem <- c("vars", "n", "trimmed", "mad", "skew", "kurtosis", "se")
excl <- c("quality*")
describe <- describe(wine_qual_df) |>
    select(-all_of(rem))
describe <- describe |>
    filter(!rownames(describe) %in% excl)
knitr::kable(describe, format = "simple")

	mean	sd	median	min	max	range
fixed_acidity	8.3196373	1.7410963	7.90000	4.60000	15.90000	11.30000
volatile_acidity	0.5278205	0.1790597	0.52000	0.12000	1.58000	1.46000
citric_acid	0.2709756	0.1948011	0.26000	0.00000	1.00000	1.00000
residual_sugar	2.5388055	1.4099281	2.20000	0.90000	15.50000	14.60000
chlorides	0.0874665	0.0470653	0.07900	0.01200	0.61100	0.59900
free_sulfur_dioxide	15.8749218	10.4601570	14.00000	1.00000	72.00000	71.00000
total_sulfur_dioxide	46.4677924	32.8953245	38.00000	6.00000	289.00000	283.00000
density	0.9967467	0.0018873	0.99675	0.99007	1.00369	0.01362
ph	3.3111132	0.1543865	3.31000	2.74000	4.01000	1.27000
sulphates	0.6581488	0.1695070	0.62000	0.33000	2.00000	1.67000
alcohol	10.4229831	1.0656676	10.20000	8.40000	14.90000	6.50000

None of the distributions appear degenerate, but we confirm there are no near-zero variance predictors nonetheless.

nzv <- nearZeroVar(wine_qual_df, names = TRUE, saveMetrics = FALSE)
length(nzv) == 0

## [1] TRUE

There are indeed no near-zero variance predictors that we need to remove, so we visualize the distributions for all predictors below.

skip <- c("quality")
wine_qual_piv <- wine_qual_df |>
    pivot_longer(cols = !all_of(skip), names_to = "PREDICTOR",
                 values_to = "VALUE")
p2 <- wine_qual_piv |>
    ggplot(aes(x = VALUE, color = quality, fill = quality)) +
    geom_histogram(data = subset(wine_qual_piv, quality == "Low"),
                   alpha = 0.5) +
    geom_histogram(data = subset(wine_qual_piv, quality == "High"),
                   alpha = 0.5) +
    scale_color_manual(values = qual_cols) +
    scale_fill_manual(values = qual_cols) +
    scale_y_continuous(labels = scales::comma) +
    facet_wrap(PREDICTOR ~ ., ncol = 5, scales = "free_x") +
    labs(title = "Distribution of Predictors",
         y = "COUNT") +
    theme(legend.position = "top")
p2

We can see the means for alcohol and volatile_acidity are different for low-quality and high-quality wines. We can also see some skewed distributions, but we won’t be making any transformations since the focus of our analysis is really what impact reducing the feature space has on model performance.

We check for correlations between the predictors and the response variable, as well as any predictor-predictor correlations. In the interest of ignoring clutter, only correlations greater than 0.1 (in absolute value) are displayed.

plot_corr_range <- function(df, mn=0.1, mx=1.0, excl=c(NA)){
    palette <- brewer.pal(n = 7, name = "BrBG")[c(1, 4, 7)]
    tit = sprintf("Correlations Between %s and %s (Absolute Value)", mn, mx)
    r <- model.matrix(~0+., data = df) |>
        cor() |>
        round(digits=2)
    is.na(r) <- abs(r) > mx
    is.na(r) <- abs(r) < mn
    if (!is.na(excl)){
        r <- as.data.frame(r) |>
            select(-all_of(excl)) |>
            filter(!rownames(r) %in% excl)
    }
    p <- r |>
        ggcorrplot(show.diag = FALSE, type = "lower", lab = TRUE,
                   lab_size = 3, tl.cex = 10, tl.srt = 90,
                   colors = palette, outline.color = "white") +
        labs(title = tit) +
        theme(plot.title.position = "plot")
    p
}
excl <- c("qualityLow")
p3 <- plot_corr_range(df = wine_qual_df, excl = excl)
p3

We see that alcohol has the largest positive correlation with high-quality wines, and volatile_acidity has the largest negative correlation with high-quality wines. To a lesser degree, fixed_acidity, citric_acid, and sulphates are also positively correlated with high-quality wines, and chlorides, total_sulfur_dioxide, and density are also negatively correlated with high-quality wines.

We also see high correlations between some of our predictors:

fixed_acidity and citric_acid | density | ph
total_sulfur_dioxide and free_sulfur_dioxide

Note that we won’t be eliminating any predictors from consideration for the same reason we aren’t doing any transformations: we are more interested in seeing what effects reducing the feature space from all the original features to just principal components has than anything else.

Data Preparation

We split the data into train and test sets.

set.seed(1006)
sample <- sample(nrow(wine_qual_df),
                 round(nrow(wine_qual_df) * 0.7),
                 replace = FALSE)
train_df <- wine_qual_df[sample, ]
test_df <- wine_qual_df[-sample, ]

We confirm the class distributions are similar in the original, train, and test sets.

dist1 <- as.data.frame(round(prop.table(table(select(wine_qual_df, quality))), 2))
colnames(dist1) <- c("Quality", "Original Freq")
dist2 <- as.data.frame(round(prop.table(table(select(train_df, quality))), 2))
colnames(dist2) <- c("Quality", "Train Freq")
dist3 <- as.data.frame(round(prop.table(table(select(test_df, quality))), 2))
colnames(dist3) <- c("Quality", "Test Freq")
class_dist <- dist1 |>
    left_join(dist2, by = join_by(Quality)) |>
    left_join(dist3, by = join_by(Quality))
kable(class_dist, format = "simple")

Quality	Original Freq	Train Freq	Test Freq
Low	0.47	0.47	0.45
High	0.53	0.53	0.55

The class distributions are all pretty similar.

Model Building

We build our first set of models: a Random Forest model and a Support Vector Machine model using all the original features.

Random Forest Model (Original Features)

A summary of the relative importance of the original features to the Random Forest model we have trained is below.

fn <- "rf_orig.rds"
if (!file.exists(fn)){
    rf_orig <- train(quality ~ ., data = train_df, metric = "Accuracy",
                     method = "rf", trControl = trainControl(method = "none"),
                     tuneGrid = expand.grid(.mtry = 3))
    saveRDS(rf_orig, fn)
}else{
    rf_orig <- readRDS(fn)
}
rf_orig_imp <- varImp(rf_orig, scale = TRUE)
rf_orig_imp <- rf_orig_imp$importance |>
    rownames_to_column()
cols <- c("Predictor", "Importance")
colnames(rf_orig_imp) <- cols
rf_orig_imp <- rf_orig_imp |>
    arrange(desc(Importance))
kable(rf_orig_imp, format = "simple")

Predictor	Importance
alcohol	100.000000
sulphates	55.744989
volatile_acidity	49.290579
total_sulfur_dioxide	39.393053
density	28.842431
fixed_acidity	19.203067
chlorides	17.144795
ph	12.119840
citric_acid	7.937798
free_sulfur_dioxide	2.027924
residual_sugar	0.000000

Unsurprisingly, the Random Forest model estimates alcohol to be the most important feature for predicting the response variable. We deduced that from our correlation plot. The relative importance estimate for `residual_sugar`` is 0, indicating it was not used in any of the trees.

Support Vector Machine Model (Original Features)

A summary of the best Support Vector Machine model using a radial basis kernel and the original features that we arrived at during tuning is below:

ctrl <-  tune.control(sampling = "cross", cross = 10, nrepeat = 5)
    tune_grid <- list(cost = c(0.1, 1, 10, 100, 1000),
                      gamma = c(0.5, 1, 2, 3, 4))
fn <- "svm_orig.rds"
if (!file.exists(fn)){
    svm_tune_orig <- tune(svm, quality ~ .,
                          data = train_df, kernel = "radial",
                          ranges = tune_grid, tunecontrol = ctrl)
    svm_orig <- svm_tune_orig$best.model
    saveRDS(svm_orig, fn)
}else{
    svm_orig <- readRDS(fn)
}
summarize_svm <- function(svm_model){
    col1 <- c("call", "cost", "gamma", "num_classes", "classes",
              "support_vectors_total", "support_vectors_split")
    subset <- c("call", "cost", "gamma", "nclasses", "levels",
              "tot.nSV", "nSV")
    col2 <- svm_model[subset]
    copy <- col2
    for (i in 1:length(copy)){
        if (is.vector(copy[[i]])){
            col2[[i]] <- paste(col2[[i]], collapse = ", ")
        }
    }
    summ <- as.data.frame(cbind(col1, col2))
    rownames(summ) <- NULL
    colnames(summ) <- c("Parameter", "Value")
    summ
}
summ <- summarize_svm(svm_orig)
kable(summ, format = "simple")

Parameter	Value
call	best.tune(METHOD = svm, train.x = quality ~ ., data = train_df, , ranges = tune_grid, tunecontrol = ctrl, kernel = “radial”)
cost	1
gamma	0.5
num_classes	2
classes	Low, High
support_vectors_total	877
support_vectors_split	442, 435

It uses a cost of 1 and a gamma of 0.5.

Principal Component Analysis (PCA)

We now perform PCA. First, we check the mean and variance of all predictors to confirm we need to center and scale them.

mean_df <- round(as.data.frame(apply(wine_qual_df |> select(-quality), 2, mean)), 2)
colnames(mean_df) <- c("Mean")
mean_df <- mean_df |>
    rownames_to_column(var = "Predictor")
var_df <- round(as.data.frame(apply(wine_qual_df |> select(-quality), 2, var)), 2)
colnames(var_df) <- c("Variance")
var_df <- var_df |>
    rownames_to_column(var = "Predictor")
mean_var_df <- mean_df |>
    left_join(var_df, by = join_by(Predictor))
kable(mean_var_df, format = "simple")

Predictor	Mean	Variance
fixed_acidity	8.32	3.03
volatile_acidity	0.53	0.03
citric_acid	0.27	0.04
residual_sugar	2.54	1.99
chlorides	0.09	0.00
free_sulfur_dioxide	15.87	109.41
total_sulfur_dioxide	46.47	1082.10
density	1.00	0.00
ph	3.31	0.02
sulphates	0.66	0.03
alcohol	10.42	1.14

Both free_sulfur_dioxide and total_sulfur_dioxide exhibit much larger variance than the other variables, so we center and scale them to prevent them from dominating the principal components. Below is the rotation, ie. the matrix of variable loadings.

pca.out = prcomp(wine_qual_df |> select(-quality), scale = TRUE)
kable(round(as.data.frame(pca.out$rotation), 2), format = "simple")

	PC1	PC2	PC3	PC4	PC5	PC6	PC7	PC8	PC9	PC10	PC11
fixed_acidity	0.49	0.11	-0.12	0.23	-0.08	-0.10	0.35	-0.18	-0.19	0.25	-0.64
volatile_acidity	-0.24	-0.27	-0.45	-0.08	0.22	-0.41	0.53	-0.08	0.13	-0.37	0.00
citric_acid	0.46	0.15	0.24	0.08	-0.06	-0.07	-0.11	-0.38	0.38	-0.62	0.07
residual_sugar	0.15	-0.27	0.10	0.37	0.73	-0.05	-0.29	0.30	-0.01	-0.09	-0.18
chlorides	0.21	-0.15	-0.09	-0.67	0.25	-0.30	-0.37	-0.36	-0.11	0.22	-0.05
free_sulfur_dioxide	-0.04	-0.51	0.43	0.04	-0.16	0.01	0.12	-0.20	-0.64	-0.25	0.05
total_sulfur_dioxide	0.02	-0.57	0.32	0.03	-0.22	-0.14	0.09	0.02	0.59	0.37	-0.07
density	0.40	-0.23	-0.34	0.17	0.16	0.39	0.17	-0.24	-0.02	0.24	0.57
ph	-0.44	-0.01	0.06	0.00	0.27	0.52	0.03	-0.56	0.17	0.01	-0.34
sulphates	0.24	0.04	0.28	-0.55	0.23	0.38	0.45	0.37	0.06	-0.11	-0.07
alcohol	-0.11	0.39	0.47	0.12	0.35	-0.36	0.33	-0.22	-0.04	0.30	0.31

We generate a scree plot so that we can see the variance explained by each principal component.

scree_plot_df = as.data.frame(pca.out$sdev^2 / sum(pca.out$sdev^2))
colnames(scree_plot_df) = c("var_explained")
scree_plot_df <- scree_plot_df |>
    rowid_to_column(var = "pc")
p4 <- scree_plot_df |>
    ggplot(aes(x = pc, y = var_explained)) +
    geom_line() + 
    geom_point(size = 2) +
    scale_x_continuous(limits = c(1, 11), breaks = seq(1, 11, 1)) +
    scale_y_continuous(limits = c(0, 1), breaks = seq(0, 1, 0.1)) +
    labs(x = "Principal Component",
         y = "Variance Explained",
         title = "Scree Plot")
p4

The first principal component explains a little less than 30 percent of the variance in our data, and the second principal component explains a little less than 20 percent of the variance in our data. Since those values are relatively low, we don’t expect building models using only the first two principal components will be sufficient, so we will go up to eight principal components.

For the same reason, we don’t expect the biplot of the first two principal components we generate below to show great separation between high-quality and low-quality wines.

p5 <- pca.out |>
    ggbiplot(obs.scale = 1, var.scale = 1,
             groups = wine_qual_df$quality,
             ellipse = TRUE,
             ellipse.alpha = 0.0) +
    theme(legend.direction = "horizontal",
          legend.position = "top") +
    labs(fill = "Quality",
         color = "Quality",
         title = "Biplot of First Two Principal Components") +
    scale_color_discrete(type = qual_cols)
p5

Indeed, there is a lot of overlap in low-quality and high-quality wines plotted by just the first two principal components, but we do see some separation between the classes, which is promising.

We create new versions of the train and test sets using the features derived via PCA instead of the original features.

train_df_pca <- as.data.frame(predict(pca.out, train_df)) |>
    bind_cols(train_df |> select(quality)) |>
    relocate(quality, .before = "PC1")
test_df_pca <- as.data.frame(predict(pca.out, test_df)) |>
    bind_cols(test_df |> select(quality)) |>
    relocate(quality, .before = "PC1")

Then we build our second set of models: Random Forest and Support Vector Machine models using the first two, first four, first six, or first eight principal components.

Random Forest Model (Principal Components)

First, we train four Random Forest models using increasing numbers of principal components.

fn <- "rf_2pc.rds"
if (!file.exists(fn)){
    rf_2pc <- train(quality ~ PC1 + PC2,
                    data = train_df_pca, metric = "Accuracy",
                    method = "rf", trControl = trainControl(method = "none"),
                    tuneGrid = expand.grid(.mtry = 1))
    saveRDS(rf_2pc, fn)
}else{
    rf_2pc <- readRDS(fn)
}
fn <- "rf_4pc.rds"
if (!file.exists(fn)){
    rf_4pc <- train(quality ~ PC1 + PC2 + PC3 + PC4,
                    data = train_df_pca, metric = "Accuracy",
                    method = "rf", trControl = trainControl(method = "none"),
                    tuneGrid = expand.grid(.mtry = 2))
    saveRDS(rf_4pc, fn)
}else{
    rf_4pc <- readRDS(fn)
}
fn <- "rf_6pc.rds"
if (!file.exists(fn)){
    rf_6pc <- train(quality ~ PC1 + PC2 + PC3 + PC4 + PC5 + PC6,
                    data = train_df_pca, metric = "Accuracy",
                    method = "rf", trControl = trainControl(method = "none"),
                    tuneGrid = expand.grid(.mtry = 2))
    saveRDS(rf_6pc, fn)
}else{
    rf_6pc <- readRDS(fn)
}
fn <- "rf_8pc.rds"
if (!file.exists(fn)){
    rf_8pc <- train(quality ~ PC1 + PC2 + PC3 + PC4 + PC5 + PC6 + PC7 + PC8,
                    data = train_df_pca, metric = "Accuracy",
                    method = "rf", trControl = trainControl(method = "none"),
                    tuneGrid = expand.grid(.mtry = 2))
    saveRDS(rf_8pc, fn)
}else{
    rf_8pc <- readRDS(fn)
}

A summary of the relative importance of the principal components to each of these Random Forest models we have trained is below.

rf_2pc_imp <- varImp(rf_2pc, scale = TRUE)
rf_2pc_imp <- rf_2pc_imp$importance |>
    rownames_to_column()
cols <- c("Predictor", "Importance")
colnames(rf_2pc_imp) <- cols
rf_2pc_imp <- rf_2pc_imp |>
    arrange(desc(Importance)) |>
    rowid_to_column(var = "Rank") |>
    mutate(Model = "Random Forest (2 Principal Components)") |>
    select(-Importance)
rf_4pc_imp <- varImp(rf_4pc, scale = TRUE)
rf_4pc_imp <- rf_4pc_imp$importance |>
    rownames_to_column()
cols <- c("Predictor", "Importance")
colnames(rf_4pc_imp) <- cols
rf_4pc_imp <- rf_4pc_imp |>
    arrange(desc(Importance)) |>
    rowid_to_column(var = "Rank") |>
    mutate(Model = "Random Forest (4 Principal Components)") |>
    select(-Importance)
rf_6pc_imp <- varImp(rf_6pc, scale = TRUE)
rf_6pc_imp <- rf_6pc_imp$importance |>
    rownames_to_column()
cols <- c("Predictor", "Importance")
colnames(rf_6pc_imp) <- cols
rf_6pc_imp <- rf_6pc_imp |>
    arrange(desc(Importance)) |>
    rowid_to_column(var = "Rank") |>
    mutate(Model = "Random Forest (6 Principal Components)") |>
    select(-Importance)
rf_8pc_imp <- varImp(rf_8pc, scale = TRUE)
rf_8pc_imp <- rf_8pc_imp$importance |>
    rownames_to_column()
cols <- c("Predictor", "Importance")
colnames(rf_8pc_imp) <- cols
rf_8pc_imp <- rf_8pc_imp |>
    arrange(desc(Importance)) |>
    rowid_to_column(var = "Rank") |>
    mutate(Model = "Random Forest (8 Principal Components)") |>
    select(-Importance)
rf_pc_imp_df <- rf_2pc_imp |>
    bind_rows(rf_4pc_imp, rf_6pc_imp, rf_8pc_imp) |>
    pivot_wider(names_from = Rank, values_from = Predictor,
                names_prefix = "Rank_")
kable(rf_pc_imp_df, format = "simple")

Model	Rank_1	Rank_2	Rank_3	Rank_4	Rank_5	Rank_6	Rank_7	Rank_8
Random Forest (2 Principal Components)	PC2	PC1	NA	NA	NA	NA	NA	NA
Random Forest (4 Principal Components)	PC2	PC3	PC1	PC4	NA	NA	NA	NA
Random Forest (6 Principal Components)	PC2	PC3	PC1	PC5	PC4	PC6	NA	NA
Random Forest (8 Principal Components)	PC2	PC3	PC5	PC7	PC1	PC6	PC8	PC4

All of these Random Forest models rank the second principal component as their most important feature, and most of them rank the third principal component as their second-most important feature. Only the Random Forest model built using only the first two principal components includes the first principal component in its top two features.

Support Vector Machine Model (Principal Components)

Next we train four Support Vector Machine models using increasing numbers of principal components. Summaries of the best Support Vector Machine models using radial basis kernels that we arrived at during tuning are below:

fn <- "svm_2pc.rds"
if (!file.exists(fn)){
    svm_tune_2pc <- tune(svm, quality ~ PC1 + PC2,
                         data = train_df_pca, kernel = "radial",
                         ranges = tune_grid, tunecontrol = ctrl)
    svm_2pc <- svm_tune_2pc$best.model
    saveRDS(svm_2pc, fn)
}else{
    svm_2pc <- readRDS(fn)
}
fn <- "svm_4pc.rds"
if (!file.exists(fn)){
    svm_tune_4pc <- tune(svm, quality ~ PC1 + PC2 + PC3 + PC4,
                         data = train_df_pca, kernel = "radial",
                         ranges = tune_grid, tunecontrol = ctrl)
    svm_4pc <- svm_tune_4pc$best.model
    saveRDS(svm_4pc, fn)
}else{
    svm_4pc <- readRDS(fn)
}
fn <- "svm_6pc.rds"
if (!file.exists(fn)){
    svm_tune_6pc <- tune(svm, quality ~ PC1 + PC2 + PC3 + PC4 + PC5 + PC6,
                         data = train_df_pca, kernel = "radial",
                         ranges = tune_grid, tunecontrol = ctrl)
    svm_6pc <- svm_tune_6pc$best.model
    saveRDS(svm_6pc, fn)
}else{
    svm_6pc <- readRDS(fn)
}
fn <- "svm_8pc.rds"
if (!file.exists(fn)){
    svm_tune_8pc <- tune(svm,
                         quality ~ PC1 + PC2 + PC3 + PC4 + PC5 + PC6 + PC7 + PC8,
                         data = train_df_pca, kernel = "radial",
                         ranges = tune_grid, tunecontrol = ctrl)
    svm_8pc <- svm_tune_8pc$best.model
    saveRDS(svm_8pc, fn)
}else{
    svm_8pc <- readRDS(fn)
}
summ_2pc <- summarize_svm(svm_2pc) |>
    filter(Parameter != "call") |>
    mutate(Model = "Support Vector Machine (2 Principal Components)") |>
    pivot_wider(names_from = Parameter, values_from = Value)
summ_4pc <- summarize_svm(svm_4pc) |>
    filter(Parameter != "call") |>
    mutate(Model = "Support Vector Machine (4 Principal Components)") |>
    pivot_wider(names_from = Parameter, values_from = Value)
summ_6pc <- summarize_svm(svm_6pc) |>
    filter(Parameter != "call") |>
    mutate(Model = "Support Vector Machine (6 Principal Components)") |>
    pivot_wider(names_from = Parameter, values_from = Value)
summ_8pc <- summarize_svm(svm_8pc) |>
    filter(Parameter != "call") |>
    mutate(Model = "Support Vector Machine (8 Principal Components)") |>
    pivot_wider(names_from = Parameter, values_from = Value)
summ <- summ_2pc |>
    bind_rows(summ_4pc, summ_6pc, summ_8pc)
kable(summ, format = "simple")

Model	cost	gamma	num_classes	classes	support_vectors_total	support_vectors_split
Support Vector Machine (2 Principal Components)	10	0.5	2	Low, High	820	410, 410
Support Vector Machine (4 Principal Components)	0.1	0.5	2	Low, High	824	413, 411
Support Vector Machine (6 Principal Components)	1	0.5	2	Low, High	763	387, 376
Support Vector Machine (8 Principal Components)	10	0.5	2	Low, High	755	381, 374

They use various costs between 0.1 and 10, but they all use the same gamma: 0.5.

Model Evaluation

We make predictions on the test data using all models, and we construct confusion matrices and calculate a variety of performance metrics for them.

First, we look at the confusion matrices for each of the models.

#Random Forest Orig: predictions/confusion matrix
pred_rf_orig <- predict(rf_orig, test_df)
rf_orig_cm_complete <- confusionMatrix(pred_rf_orig, test_df$quality,
                                       positive = "High")
rf_orig_cm <- as.data.frame(rf_orig_cm_complete$table)
rf_orig_cm$Reference <- factor(rf_orig_cm$Reference,
                               levels = rev(levels(rf_orig_cm$Reference)))
rf_orig_cm <- rf_orig_cm |>
    mutate(
        Label = case_when(
            Prediction == "Low" & Reference == "Low" ~ "TN",
            Prediction == "High" & Reference == "High" ~ "TP",
            Prediction == "Low" & Reference == "High" ~ "FN",
            Prediction == "High" & Reference == "Low" ~ "FP"),
        Model = "RF (Orig. Feat.)")
#Random Forest 2pc: predictions/confusion matrix
pred_rf_2pc <- predict(rf_2pc, test_df_pca)
rf_2pc_cm_complete <- confusionMatrix(pred_rf_2pc, test_df_pca$quality,
                                      positive = "High")
rf_2pc_cm <- as.data.frame(rf_2pc_cm_complete$table)
rf_2pc_cm$Reference <- factor(rf_2pc_cm$Reference,
                               levels = rev(levels(rf_2pc_cm$Reference)))
rf_2pc_cm <- rf_2pc_cm |>
    mutate(
        Label = case_when(
            Prediction == "Low" & Reference == "Low" ~ "TN",
            Prediction == "High" & Reference == "High" ~ "TP",
            Prediction == "Low" & Reference == "High" ~ "FN",
            Prediction == "High" & Reference == "Low" ~ "FP"),
        Model = "RF (2 PC)")
#Random Forest 4pc: predictions/confusion matrix
pred_rf_4pc <- predict(rf_4pc, test_df_pca)
rf_4pc_cm_complete <- confusionMatrix(pred_rf_4pc, test_df_pca$quality,
                                      positive = "High")
rf_4pc_cm <- as.data.frame(rf_4pc_cm_complete$table)
rf_4pc_cm$Reference <- factor(rf_4pc_cm$Reference,
                               levels = rev(levels(rf_4pc_cm$Reference)))
rf_4pc_cm <- rf_4pc_cm |>
    mutate(
        Label = case_when(
            Prediction == "Low" & Reference == "Low" ~ "TN",
            Prediction == "High" & Reference == "High" ~ "TP",
            Prediction == "Low" & Reference == "High" ~ "FN",
            Prediction == "High" & Reference == "Low" ~ "FP"),
        Model = "RF (4 PC)")
#Random Forest 6pc: predictions/confusion matrix
pred_rf_6pc <- predict(rf_6pc, test_df_pca)
rf_6pc_cm_complete <- confusionMatrix(pred_rf_6pc, test_df_pca$quality,
                                      positive = "High")
rf_6pc_cm <- as.data.frame(rf_6pc_cm_complete$table)
rf_6pc_cm$Reference <- factor(rf_6pc_cm$Reference,
                               levels = rev(levels(rf_6pc_cm$Reference)))
rf_6pc_cm <- rf_6pc_cm |>
    mutate(
        Label = case_when(
            Prediction == "Low" & Reference == "Low" ~ "TN",
            Prediction == "High" & Reference == "High" ~ "TP",
            Prediction == "Low" & Reference == "High" ~ "FN",
            Prediction == "High" & Reference == "Low" ~ "FP"),
        Model = "RF (6 PC)")
#Random Forest 8pc: predictions/confusion matrix
pred_rf_8pc <- predict(rf_8pc, test_df_pca)
rf_8pc_cm_complete <- confusionMatrix(pred_rf_8pc, test_df_pca$quality,
                                      positive = "High")
rf_8pc_cm <- as.data.frame(rf_8pc_cm_complete$table)
rf_8pc_cm$Reference <- factor(rf_8pc_cm$Reference,
                               levels = rev(levels(rf_8pc_cm$Reference)))
rf_8pc_cm <- rf_8pc_cm |>
    mutate(
        Label = case_when(
            Prediction == "Low" & Reference == "Low" ~ "TN",
            Prediction == "High" & Reference == "High" ~ "TP",
            Prediction == "Low" & Reference == "High" ~ "FN",
            Prediction == "High" & Reference == "Low" ~ "FP"),
        Model = "RF (8C)")
#Support Vector Machine Orig: predictions/confusion matrix
pred_svm_orig <- predict(svm_orig, test_df, type = "class")
svm_orig_cm_complete <- confusionMatrix(pred_svm_orig, test_df$quality,
                                       positive = "High")
svm_orig_cm <- as.data.frame(svm_orig_cm_complete$table)
svm_orig_cm$Reference <- factor(svm_orig_cm$Reference,
                               levels = rev(levels(svm_orig_cm$Reference)))
svm_orig_cm <- svm_orig_cm |>
    mutate(
        Label = case_when(
            Prediction == "Low" & Reference == "Low" ~ "TN",
            Prediction == "High" & Reference == "High" ~ "TP",
            Prediction == "Low" & Reference == "High" ~ "FN",
            Prediction == "High" & Reference == "Low" ~ "FP"),
        Model = "SVM (Orig. Feat.)")
#Support Vector Machine 2pc: predictions/confusion matrix
pred_svm_2pc <- predict(svm_2pc, test_df_pca, type = "class")
svm_2pc_cm_complete <- confusionMatrix(pred_svm_2pc, test_df_pca$quality,
                                       positive = "High")
svm_2pc_cm <- as.data.frame(svm_2pc_cm_complete$table)
svm_2pc_cm$Reference <- factor(svm_2pc_cm$Reference,
                               levels = rev(levels(svm_2pc_cm$Reference)))
svm_2pc_cm <- svm_2pc_cm |>
    mutate(
        Label = case_when(
            Prediction == "Low" & Reference == "Low" ~ "TN",
            Prediction == "High" & Reference == "High" ~ "TP",
            Prediction == "Low" & Reference == "High" ~ "FN",
            Prediction == "High" & Reference == "Low" ~ "FP"),
        Model = "SVM (2 PC)")
#Support Vector Machine 4pc: predictions/confusion matrix
pred_svm_4pc <- predict(svm_4pc, test_df_pca, type = "class")
svm_4pc_cm_complete <- confusionMatrix(pred_svm_4pc, test_df_pca$quality,
                                       positive = "High")
svm_4pc_cm <- as.data.frame(svm_4pc_cm_complete$table)
svm_4pc_cm$Reference <- factor(svm_4pc_cm$Reference,
                               levels = rev(levels(svm_4pc_cm$Reference)))
svm_4pc_cm <- svm_4pc_cm |>
    mutate(
        Label = case_when(
            Prediction == "Low" & Reference == "Low" ~ "TN",
            Prediction == "High" & Reference == "High" ~ "TP",
            Prediction == "Low" & Reference == "High" ~ "FN",
            Prediction == "High" & Reference == "Low" ~ "FP"),
        Model = "SVM (4 PC)")
#Support Vector Machine 6pc: predictions/confusion matrix
pred_svm_6pc <- predict(svm_6pc, test_df_pca, type = "class")
svm_6pc_cm_complete <- confusionMatrix(pred_svm_6pc, test_df_pca$quality,
                                       positive = "High")
svm_6pc_cm <- as.data.frame(svm_6pc_cm_complete$table)
svm_6pc_cm$Reference <- factor(svm_6pc_cm$Reference,
                               levels = rev(levels(svm_6pc_cm$Reference)))
svm_6pc_cm <- svm_6pc_cm |>
    mutate(
        Label = case_when(
            Prediction == "Low" & Reference == "Low" ~ "TN",
            Prediction == "High" & Reference == "High" ~ "TP",
            Prediction == "Low" & Reference == "High" ~ "FN",
            Prediction == "High" & Reference == "Low" ~ "FP"),
        Model = "SVM (6 PC)")
#Support Vector Machine 8pc: predictions/confusion matrix
pred_svm_8pc <- predict(svm_8pc, test_df_pca, type = "class")
svm_8pc_cm_complete <- confusionMatrix(pred_svm_8pc, test_df_pca$quality,
                                       positive = "High")
svm_8pc_cm <- as.data.frame(svm_8pc_cm_complete$table)
svm_8pc_cm$Reference <- factor(svm_8pc_cm$Reference,
                               levels = rev(levels(svm_8pc_cm$Reference)))
svm_8pc_cm <- svm_8pc_cm |>
    mutate(
        Label = case_when(
            Prediction == "Low" & Reference == "Low" ~ "TN",
            Prediction == "High" & Reference == "High" ~ "TP",
            Prediction == "Low" & Reference == "High" ~ "FN",
            Prediction == "High" & Reference == "Low" ~ "FP"),
        Model = "SVM (8 PC)")
cm <- bind_rows(rf_orig_cm, rf_2pc_cm, rf_4pc_cm, rf_6pc_cm, rf_8pc_cm,
                svm_orig_cm, svm_2pc_cm, svm_4pc_cm, svm_6pc_cm, svm_8pc_cm)
p6 <- cm |>
    ggplot(aes(x = Reference, y = Prediction, fill = Freq)) +
    geom_tile(col = "black") +
    geom_text(aes(label = Freq)) +
    geom_text(aes(label = Label), vjust = 2) + 
    scale_fill_gradient(low = "white", high = pal[1]) +
    scale_x_discrete(position = "top") +
    facet_wrap(Model ~ ., ncol = 5, strip.position = "bottom") +
    labs(title = "Confusion Matrices for All Models") +
    theme(axis.line.x = element_blank(),
          axis.line.y = element_blank(),
          axis.text.y = element_text(angle = 90, hjust = 0.5),
          axis.ticks = element_blank(),
          legend.position = "bottom",
          strip.placement = "outside")
p6

Then we calculate the performance metrics for each of the models.

metrics <- as.data.frame(cbind(rbind(rf_orig_cm_complete$byClass,
                                     rf_2pc_cm_complete$byClass,
                                     rf_4pc_cm_complete$byClass,
                                     rf_6pc_cm_complete$byClass,
                                     rf_8pc_cm_complete$byClass,
                                     svm_orig_cm_complete$byClass,
                                     svm_2pc_cm_complete$byClass,
                                     svm_4pc_cm_complete$byClass,
                                     svm_6pc_cm_complete$byClass,
                                     svm_8pc_cm_complete$byClass),
                               rbind(rf_orig_cm_complete$overall,
                                     rf_2pc_cm_complete$overall,
                                     rf_4pc_cm_complete$overall,
                                     rf_6pc_cm_complete$overall,
                                     rf_8pc_cm_complete$overall,
                                     svm_orig_cm_complete$overall,
                                     svm_2pc_cm_complete$overall,
                                     svm_4pc_cm_complete$overall,
                                     svm_6pc_cm_complete$overall,
                                     svm_8pc_cm_complete$overall)))
rownames(metrics) <- c("Random Forest (Original Features)",
                       "Random Forest (2 Principal Components)",
                       "Random Forest (4 Principal Components)",
                       "Random Forest (6 Principal Components)",
                       "Random Forest (8 Principal Components)",
                       "Support Vector Machine (Original Features)",
                       "Support Vector Machine (2 Principal Components)",
                       "Support Vector Machine (4 Principal Components)",
                       "Support Vector Machine (6 Principal Components)",
                       "Support Vector Machine (8 Principal Components)")
keep <- c("Accuracy", "Precision", "Recall", "F1")
metrics <- metrics |>
    select(all_of(keep)) |>
    round(3)
kable(metrics, format = "simple")

	Accuracy	Precision	Recall	F1
Random Forest (Original Features)	0.810	0.814	0.848	0.831
Random Forest (2 Principal Components)	0.715	0.723	0.776	0.749
Random Forest (4 Principal Components)	0.802	0.809	0.837	0.822
Random Forest (6 Principal Components)	0.804	0.807	0.844	0.825
Random Forest (8 Principal Components)	0.806	0.815	0.837	0.826
Support Vector Machine (Original Features)	0.783	0.787	0.829	0.807
Support Vector Machine (2 Principal Components)	0.654	0.678	0.703	0.690
Support Vector Machine (4 Principal Components)	0.715	0.744	0.730	0.737
Support Vector Machine (6 Principal Components)	0.748	0.773	0.764	0.769
Support Vector Machine (8 Principal Components)	0.773	0.773	0.829	0.800

Conclusion

Both the Random Forest and Support Vector Machine models perform best on this dataset using the original features rather than any number of principal components. Using only four principal components does get the Random Forest model close to the Accuracy and F1 Score of the Random Forest model using the original features though. The Support Vector Machine model, in contrast, requires twice as many, i.e. eight, principal components to get close to the Accuracy and F1 Score of the Support Vector Machine model using the original features.

Since we only had eleven numeric predictors in the dataset to begin with, reducing the feature space wasn’t really imperative, but we now have a framework for performing PCA on datasets with tons of features, where reducing the feature space is more necessary.

There are often a lot of benefits to reducing the feature space, especially when businesses track a lot of features for the products they sell. Jeans in particular come to mind as a product with lots of features (i.e. color, size, fit, style, stretch, whether it has a button-fly or a zipper, whether it comes distressed, etc.). If you can represent a product line with lots of features like that with just four or fewer principal components instead, you will be eliminating a lot of noise, your models will be faster, and your predictions can be as good as or better than predictions from models that use more features. It’s also amazing to be able to see data with many dimensions reduced to two dimensions, even in a classification problem like ours where we knew perfect wine quality class separation wouldn’t be visible in just two dimensions.

Video Presentation

You can view the video presentation here.