library(caret)
library(DataExplorer)
library(e1071)
library(ggbiplot)
library(ggcorrplot)
library(knitr)
library(psych)
library(randomForest)
library(RColorBrewer)
library(snakecase)
library(tidyverse)
We load a dataset of red wine quality. We will be using this dataset to compare the performance of a couple different classification models, first using all the original features, then using a reduced set of features derived via Principal Component Analysis (PCA). This will give us insight into how much the feature space can be reduced without losing predictive power, and we may even see predictive gains.
cur_theme <- theme_set(theme_classic())
pal <- brewer.pal(n = 8, name = "Dark2")
my_url1 <- "https://raw.githubusercontent.com/geedoubledee/data622_homework4/main/winequality-red.csv"
wine_qual_df <- read.csv(my_url1)
cols <- to_snake_case(colnames(wine_qual_df))
colnames(wine_qual_df) <- cols
wine_qual_df <- wine_qual_df |>
mutate(ph = p_h) |>
select(-p_h) |>
relocate(quality, .before = fixed_acidity) |>
relocate(ph, .before = sulphates)
Below are the first 10 observations in the dataset, and for the sake of readability, we only display the first 10 columns.
kable(wine_qual_df[1:10, 1:10], format = "simple")
| quality | fixed_acidity | volatile_acidity | citric_acid | residual_sugar | chlorides | free_sulfur_dioxide | total_sulfur_dioxide | density | ph |
|---|---|---|---|---|---|---|---|---|---|
| 5 | 7.4 | 0.70 | 0.00 | 1.9 | 0.076 | 11 | 34 | 0.9978 | 3.51 |
| 5 | 7.8 | 0.88 | 0.00 | 2.6 | 0.098 | 25 | 67 | 0.9968 | 3.20 |
| 5 | 7.8 | 0.76 | 0.04 | 2.3 | 0.092 | 15 | 54 | 0.9970 | 3.26 |
| 6 | 11.2 | 0.28 | 0.56 | 1.9 | 0.075 | 17 | 60 | 0.9980 | 3.16 |
| 5 | 7.4 | 0.70 | 0.00 | 1.9 | 0.076 | 11 | 34 | 0.9978 | 3.51 |
| 5 | 7.4 | 0.66 | 0.00 | 1.8 | 0.075 | 13 | 40 | 0.9978 | 3.51 |
| 5 | 7.9 | 0.60 | 0.06 | 1.6 | 0.069 | 15 | 59 | 0.9964 | 3.30 |
| 7 | 7.3 | 0.65 | 0.00 | 1.2 | 0.065 | 15 | 21 | 0.9946 | 3.39 |
| 7 | 7.8 | 0.58 | 0.02 | 2.0 | 0.073 | 9 | 18 | 0.9968 | 3.36 |
| 5 | 7.5 | 0.50 | 0.36 | 6.1 | 0.071 | 17 | 102 | 0.9978 | 3.35 |
The first column is the response variable that we will attempt to
predict: quality. Wines were rated on a scale from 0 to 10,
but the dataset doesn’t actually include any observations with ratings
less than three or higher than eight. We recode quality as
a binary factor and replace values less than six with “Low” and values
greater than or equal to six with “High.”
wine_qual_df <- wine_qual_df |>
mutate(quality = factor(ifelse(quality < 6, "Low", "High"),
levels = c("Low", "High")))
In addition to the response variable, there are 11 numeric predictor variables:
classes <- as.data.frame(unlist(lapply(wine_qual_df, class))) |>
rownames_to_column()
cols <- c("Variable", "Class")
colnames(classes) <- cols
classes_summary <- classes |>
group_by(Class) |>
summarize(Count = n(),
Variables = paste(sort(unique(Variable)),collapse=", ")) |>
filter(Class == "numeric")
kable(classes_summary, format = "simple")
| Class | Count | Variables |
|---|---|---|
| numeric | 11 | alcohol, chlorides, citric_acid, density, fixed_acidity, free_sulfur_dioxide, ph, residual_sugar, sulphates, total_sulfur_dioxide, volatile_acidity |
We check for any missing values within the dataset.
rem <- c("discrete_columns", "continuous_columns",
"total_observations", "memory_usage")
completeness <- introduce(wine_qual_df) |>
select(-all_of(rem))
knitr::kable(t(completeness), format = "simple")
| rows | 1599 |
| columns | 12 |
| all_missing_columns | 0 |
| total_missing_values | 0 |
| complete_rows | 1599 |
There are no missing values to address in this dataset. We check the distribution of the response variable to see if there’s a class imbalance between high-quality wines and low-quality wines.
qual_cols <- pal[c(2, 3)]
names(qual_cols) <- c("Low", "High")
obs = nrow(wine_qual_df)
p1 <- wine_qual_df |>
ggplot(aes(x = quality)) +
geom_histogram(aes(color = quality, fill = quality), stat = "count") +
geom_text(stat = "count", aes(label = paste0(round(
after_stat(count) / obs * 100, 1), "%")),
size = 5, color = "white", vjust = 2, fontface = "bold") +
scale_color_manual(values = qual_cols) +
scale_fill_manual(values = qual_cols) +
scale_y_continuous(labels = scales::comma) +
labs(title = "Distribution of Low- & High- Quality Wines",
x = "QUALITY",
y = "COUNT") +
theme(legend.position = "none")
p1
There are pretty similar numbers of low-quality and high-quality wines in the dataset, so there is no class imbalance issue to address here. We summarize the distributions of the predictors below.
rem <- c("vars", "n", "trimmed", "mad", "skew", "kurtosis", "se")
excl <- c("quality*")
describe <- describe(wine_qual_df) |>
select(-all_of(rem))
describe <- describe |>
filter(!rownames(describe) %in% excl)
knitr::kable(describe, format = "simple")
| mean | sd | median | min | max | range | |
|---|---|---|---|---|---|---|
| fixed_acidity | 8.3196373 | 1.7410963 | 7.90000 | 4.60000 | 15.90000 | 11.30000 |
| volatile_acidity | 0.5278205 | 0.1790597 | 0.52000 | 0.12000 | 1.58000 | 1.46000 |
| citric_acid | 0.2709756 | 0.1948011 | 0.26000 | 0.00000 | 1.00000 | 1.00000 |
| residual_sugar | 2.5388055 | 1.4099281 | 2.20000 | 0.90000 | 15.50000 | 14.60000 |
| chlorides | 0.0874665 | 0.0470653 | 0.07900 | 0.01200 | 0.61100 | 0.59900 |
| free_sulfur_dioxide | 15.8749218 | 10.4601570 | 14.00000 | 1.00000 | 72.00000 | 71.00000 |
| total_sulfur_dioxide | 46.4677924 | 32.8953245 | 38.00000 | 6.00000 | 289.00000 | 283.00000 |
| density | 0.9967467 | 0.0018873 | 0.99675 | 0.99007 | 1.00369 | 0.01362 |
| ph | 3.3111132 | 0.1543865 | 3.31000 | 2.74000 | 4.01000 | 1.27000 |
| sulphates | 0.6581488 | 0.1695070 | 0.62000 | 0.33000 | 2.00000 | 1.67000 |
| alcohol | 10.4229831 | 1.0656676 | 10.20000 | 8.40000 | 14.90000 | 6.50000 |
None of the distributions appear degenerate, but we confirm there are no near-zero variance predictors nonetheless.
nzv <- nearZeroVar(wine_qual_df, names = TRUE, saveMetrics = FALSE)
length(nzv) == 0
## [1] TRUE
There are indeed no near-zero variance predictors that we need to remove, so we visualize the distributions for all predictors below.
skip <- c("quality")
wine_qual_piv <- wine_qual_df |>
pivot_longer(cols = !all_of(skip), names_to = "PREDICTOR",
values_to = "VALUE")
p2 <- wine_qual_piv |>
ggplot(aes(x = VALUE, color = quality, fill = quality)) +
geom_histogram(data = subset(wine_qual_piv, quality == "Low"),
alpha = 0.5) +
geom_histogram(data = subset(wine_qual_piv, quality == "High"),
alpha = 0.5) +
scale_color_manual(values = qual_cols) +
scale_fill_manual(values = qual_cols) +
scale_y_continuous(labels = scales::comma) +
facet_wrap(PREDICTOR ~ ., ncol = 5, scales = "free_x") +
labs(title = "Distribution of Predictors",
y = "COUNT") +
theme(legend.position = "top")
p2
We can see the means for alcohol and
volatile_acidity are different for low-quality and
high-quality wines. We can also see some skewed distributions, but we
won’t be making any transformations since the focus of our analysis is
really what impact reducing the feature space has on model
performance.
We check for correlations between the predictors and the response variable, as well as any predictor-predictor correlations. In the interest of ignoring clutter, only correlations greater than 0.1 (in absolute value) are displayed.
plot_corr_range <- function(df, mn=0.1, mx=1.0, excl=c(NA)){
palette <- brewer.pal(n = 7, name = "BrBG")[c(1, 4, 7)]
tit = sprintf("Correlations Between %s and %s (Absolute Value)", mn, mx)
r <- model.matrix(~0+., data = df) |>
cor() |>
round(digits=2)
is.na(r) <- abs(r) > mx
is.na(r) <- abs(r) < mn
if (!is.na(excl)){
r <- as.data.frame(r) |>
select(-all_of(excl)) |>
filter(!rownames(r) %in% excl)
}
p <- r |>
ggcorrplot(show.diag = FALSE, type = "lower", lab = TRUE,
lab_size = 3, tl.cex = 10, tl.srt = 90,
colors = palette, outline.color = "white") +
labs(title = tit) +
theme(plot.title.position = "plot")
p
}
excl <- c("qualityLow")
p3 <- plot_corr_range(df = wine_qual_df, excl = excl)
p3
We see that alcohol has the largest positive correlation
with high-quality wines, and volatile_acidity has the
largest negative correlation with high-quality wines. To a lesser
degree, fixed_acidity, citric_acid, and
sulphates are also positively correlated with high-quality
wines, and chlorides, total_sulfur_dioxide,
and density are also negatively correlated with
high-quality wines.
We also see high correlations between some of our predictors:
fixed_acidity and citric_acid |
density | ph
total_sulfur_dioxide and
free_sulfur_dioxide
Note that we won’t be eliminating any predictors from consideration for the same reason we aren’t doing any transformations: we are more interested in seeing what effects reducing the feature space from all the original features to just principal components has than anything else.
We split the data into train and test sets.
set.seed(1006)
sample <- sample(nrow(wine_qual_df),
round(nrow(wine_qual_df) * 0.7),
replace = FALSE)
train_df <- wine_qual_df[sample, ]
test_df <- wine_qual_df[-sample, ]
We confirm the class distributions are similar in the original, train, and test sets.
dist1 <- as.data.frame(round(prop.table(table(select(wine_qual_df, quality))), 2))
colnames(dist1) <- c("Quality", "Original Freq")
dist2 <- as.data.frame(round(prop.table(table(select(train_df, quality))), 2))
colnames(dist2) <- c("Quality", "Train Freq")
dist3 <- as.data.frame(round(prop.table(table(select(test_df, quality))), 2))
colnames(dist3) <- c("Quality", "Test Freq")
class_dist <- dist1 |>
left_join(dist2, by = join_by(Quality)) |>
left_join(dist3, by = join_by(Quality))
kable(class_dist, format = "simple")
| Quality | Original Freq | Train Freq | Test Freq |
|---|---|---|---|
| Low | 0.47 | 0.47 | 0.45 |
| High | 0.53 | 0.53 | 0.55 |
The class distributions are all pretty similar.
We build our first set of models: a Random Forest model and a Support Vector Machine model using all the original features.
A summary of the relative importance of the original features to the Random Forest model we have trained is below.
fn <- "rf_orig.rds"
if (!file.exists(fn)){
rf_orig <- train(quality ~ ., data = train_df, metric = "Accuracy",
method = "rf", trControl = trainControl(method = "none"),
tuneGrid = expand.grid(.mtry = 3))
saveRDS(rf_orig, fn)
}else{
rf_orig <- readRDS(fn)
}
rf_orig_imp <- varImp(rf_orig, scale = TRUE)
rf_orig_imp <- rf_orig_imp$importance |>
rownames_to_column()
cols <- c("Predictor", "Importance")
colnames(rf_orig_imp) <- cols
rf_orig_imp <- rf_orig_imp |>
arrange(desc(Importance))
kable(rf_orig_imp, format = "simple")
| Predictor | Importance |
|---|---|
| alcohol | 100.000000 |
| sulphates | 55.744989 |
| volatile_acidity | 49.290579 |
| total_sulfur_dioxide | 39.393053 |
| density | 28.842431 |
| fixed_acidity | 19.203067 |
| chlorides | 17.144795 |
| ph | 12.119840 |
| citric_acid | 7.937798 |
| free_sulfur_dioxide | 2.027924 |
| residual_sugar | 0.000000 |
Unsurprisingly, the Random Forest model estimates
alcohol to be the most important feature for predicting the
response variable. We deduced that from our correlation plot. The
relative importance estimate for `residual_sugar`` is 0, indicating it
was not used in any of the trees.
A summary of the best Support Vector Machine model using a radial basis kernel and the original features that we arrived at during tuning is below:
ctrl <- tune.control(sampling = "cross", cross = 10, nrepeat = 5)
tune_grid <- list(cost = c(0.1, 1, 10, 100, 1000),
gamma = c(0.5, 1, 2, 3, 4))
fn <- "svm_orig.rds"
if (!file.exists(fn)){
svm_tune_orig <- tune(svm, quality ~ .,
data = train_df, kernel = "radial",
ranges = tune_grid, tunecontrol = ctrl)
svm_orig <- svm_tune_orig$best.model
saveRDS(svm_orig, fn)
}else{
svm_orig <- readRDS(fn)
}
summarize_svm <- function(svm_model){
col1 <- c("call", "cost", "gamma", "num_classes", "classes",
"support_vectors_total", "support_vectors_split")
subset <- c("call", "cost", "gamma", "nclasses", "levels",
"tot.nSV", "nSV")
col2 <- svm_model[subset]
copy <- col2
for (i in 1:length(copy)){
if (is.vector(copy[[i]])){
col2[[i]] <- paste(col2[[i]], collapse = ", ")
}
}
summ <- as.data.frame(cbind(col1, col2))
rownames(summ) <- NULL
colnames(summ) <- c("Parameter", "Value")
summ
}
summ <- summarize_svm(svm_orig)
kable(summ, format = "simple")
| Parameter | Value |
|---|---|
| call | best.tune(METHOD = svm, train.x = quality ~ ., data = train_df, , ranges = tune_grid, tunecontrol = ctrl, kernel = “radial”) |
| cost | 1 |
| gamma | 0.5 |
| num_classes | 2 |
| classes | Low, High |
| support_vectors_total | 877 |
| support_vectors_split | 442, 435 |
It uses a cost of 1 and a gamma of 0.5.
We now perform PCA. First, we check the mean and variance of all predictors to confirm we need to center and scale them.
mean_df <- round(as.data.frame(apply(wine_qual_df |> select(-quality), 2, mean)), 2)
colnames(mean_df) <- c("Mean")
mean_df <- mean_df |>
rownames_to_column(var = "Predictor")
var_df <- round(as.data.frame(apply(wine_qual_df |> select(-quality), 2, var)), 2)
colnames(var_df) <- c("Variance")
var_df <- var_df |>
rownames_to_column(var = "Predictor")
mean_var_df <- mean_df |>
left_join(var_df, by = join_by(Predictor))
kable(mean_var_df, format = "simple")
| Predictor | Mean | Variance |
|---|---|---|
| fixed_acidity | 8.32 | 3.03 |
| volatile_acidity | 0.53 | 0.03 |
| citric_acid | 0.27 | 0.04 |
| residual_sugar | 2.54 | 1.99 |
| chlorides | 0.09 | 0.00 |
| free_sulfur_dioxide | 15.87 | 109.41 |
| total_sulfur_dioxide | 46.47 | 1082.10 |
| density | 1.00 | 0.00 |
| ph | 3.31 | 0.02 |
| sulphates | 0.66 | 0.03 |
| alcohol | 10.42 | 1.14 |
Both free_sulfur_dioxide and
total_sulfur_dioxide exhibit much larger variance than the
other variables, so we center and scale them to prevent them from
dominating the principal components. Below is the rotation, ie. the
matrix of variable loadings.
pca.out = prcomp(wine_qual_df |> select(-quality), scale = TRUE)
kable(round(as.data.frame(pca.out$rotation), 2), format = "simple")
| PC1 | PC2 | PC3 | PC4 | PC5 | PC6 | PC7 | PC8 | PC9 | PC10 | PC11 | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| fixed_acidity | 0.49 | 0.11 | -0.12 | 0.23 | -0.08 | -0.10 | 0.35 | -0.18 | -0.19 | 0.25 | -0.64 |
| volatile_acidity | -0.24 | -0.27 | -0.45 | -0.08 | 0.22 | -0.41 | 0.53 | -0.08 | 0.13 | -0.37 | 0.00 |
| citric_acid | 0.46 | 0.15 | 0.24 | 0.08 | -0.06 | -0.07 | -0.11 | -0.38 | 0.38 | -0.62 | 0.07 |
| residual_sugar | 0.15 | -0.27 | 0.10 | 0.37 | 0.73 | -0.05 | -0.29 | 0.30 | -0.01 | -0.09 | -0.18 |
| chlorides | 0.21 | -0.15 | -0.09 | -0.67 | 0.25 | -0.30 | -0.37 | -0.36 | -0.11 | 0.22 | -0.05 |
| free_sulfur_dioxide | -0.04 | -0.51 | 0.43 | 0.04 | -0.16 | 0.01 | 0.12 | -0.20 | -0.64 | -0.25 | 0.05 |
| total_sulfur_dioxide | 0.02 | -0.57 | 0.32 | 0.03 | -0.22 | -0.14 | 0.09 | 0.02 | 0.59 | 0.37 | -0.07 |
| density | 0.40 | -0.23 | -0.34 | 0.17 | 0.16 | 0.39 | 0.17 | -0.24 | -0.02 | 0.24 | 0.57 |
| ph | -0.44 | -0.01 | 0.06 | 0.00 | 0.27 | 0.52 | 0.03 | -0.56 | 0.17 | 0.01 | -0.34 |
| sulphates | 0.24 | 0.04 | 0.28 | -0.55 | 0.23 | 0.38 | 0.45 | 0.37 | 0.06 | -0.11 | -0.07 |
| alcohol | -0.11 | 0.39 | 0.47 | 0.12 | 0.35 | -0.36 | 0.33 | -0.22 | -0.04 | 0.30 | 0.31 |
We generate a scree plot so that we can see the variance explained by each principal component.
scree_plot_df = as.data.frame(pca.out$sdev^2 / sum(pca.out$sdev^2))
colnames(scree_plot_df) = c("var_explained")
scree_plot_df <- scree_plot_df |>
rowid_to_column(var = "pc")
p4 <- scree_plot_df |>
ggplot(aes(x = pc, y = var_explained)) +
geom_line() +
geom_point(size = 2) +
scale_x_continuous(limits = c(1, 11), breaks = seq(1, 11, 1)) +
scale_y_continuous(limits = c(0, 1), breaks = seq(0, 1, 0.1)) +
labs(x = "Principal Component",
y = "Variance Explained",
title = "Scree Plot")
p4
The first principal component explains a little less than 30 percent of the variance in our data, and the second principal component explains a little less than 20 percent of the variance in our data. Since those values are relatively low, we don’t expect building models using only the first two principal components will be sufficient, so we will go up to eight principal components.
For the same reason, we don’t expect the biplot of the first two principal components we generate below to show great separation between high-quality and low-quality wines.
p5 <- pca.out |>
ggbiplot(obs.scale = 1, var.scale = 1,
groups = wine_qual_df$quality,
ellipse = TRUE,
ellipse.alpha = 0.0) +
theme(legend.direction = "horizontal",
legend.position = "top") +
labs(fill = "Quality",
color = "Quality",
title = "Biplot of First Two Principal Components") +
scale_color_discrete(type = qual_cols)
p5
Indeed, there is a lot of overlap in low-quality and high-quality wines plotted by just the first two principal components, but we do see some separation between the classes, which is promising.
We create new versions of the train and test sets using the features derived via PCA instead of the original features.
train_df_pca <- as.data.frame(predict(pca.out, train_df)) |>
bind_cols(train_df |> select(quality)) |>
relocate(quality, .before = "PC1")
test_df_pca <- as.data.frame(predict(pca.out, test_df)) |>
bind_cols(test_df |> select(quality)) |>
relocate(quality, .before = "PC1")
Then we build our second set of models: Random Forest and Support Vector Machine models using the first two, first four, first six, or first eight principal components.
First, we train four Random Forest models using increasing numbers of principal components.
fn <- "rf_2pc.rds"
if (!file.exists(fn)){
rf_2pc <- train(quality ~ PC1 + PC2,
data = train_df_pca, metric = "Accuracy",
method = "rf", trControl = trainControl(method = "none"),
tuneGrid = expand.grid(.mtry = 1))
saveRDS(rf_2pc, fn)
}else{
rf_2pc <- readRDS(fn)
}
fn <- "rf_4pc.rds"
if (!file.exists(fn)){
rf_4pc <- train(quality ~ PC1 + PC2 + PC3 + PC4,
data = train_df_pca, metric = "Accuracy",
method = "rf", trControl = trainControl(method = "none"),
tuneGrid = expand.grid(.mtry = 2))
saveRDS(rf_4pc, fn)
}else{
rf_4pc <- readRDS(fn)
}
fn <- "rf_6pc.rds"
if (!file.exists(fn)){
rf_6pc <- train(quality ~ PC1 + PC2 + PC3 + PC4 + PC5 + PC6,
data = train_df_pca, metric = "Accuracy",
method = "rf", trControl = trainControl(method = "none"),
tuneGrid = expand.grid(.mtry = 2))
saveRDS(rf_6pc, fn)
}else{
rf_6pc <- readRDS(fn)
}
fn <- "rf_8pc.rds"
if (!file.exists(fn)){
rf_8pc <- train(quality ~ PC1 + PC2 + PC3 + PC4 + PC5 + PC6 + PC7 + PC8,
data = train_df_pca, metric = "Accuracy",
method = "rf", trControl = trainControl(method = "none"),
tuneGrid = expand.grid(.mtry = 2))
saveRDS(rf_8pc, fn)
}else{
rf_8pc <- readRDS(fn)
}
A summary of the relative importance of the principal components to each of these Random Forest models we have trained is below.
rf_2pc_imp <- varImp(rf_2pc, scale = TRUE)
rf_2pc_imp <- rf_2pc_imp$importance |>
rownames_to_column()
cols <- c("Predictor", "Importance")
colnames(rf_2pc_imp) <- cols
rf_2pc_imp <- rf_2pc_imp |>
arrange(desc(Importance)) |>
rowid_to_column(var = "Rank") |>
mutate(Model = "Random Forest (2 Principal Components)") |>
select(-Importance)
rf_4pc_imp <- varImp(rf_4pc, scale = TRUE)
rf_4pc_imp <- rf_4pc_imp$importance |>
rownames_to_column()
cols <- c("Predictor", "Importance")
colnames(rf_4pc_imp) <- cols
rf_4pc_imp <- rf_4pc_imp |>
arrange(desc(Importance)) |>
rowid_to_column(var = "Rank") |>
mutate(Model = "Random Forest (4 Principal Components)") |>
select(-Importance)
rf_6pc_imp <- varImp(rf_6pc, scale = TRUE)
rf_6pc_imp <- rf_6pc_imp$importance |>
rownames_to_column()
cols <- c("Predictor", "Importance")
colnames(rf_6pc_imp) <- cols
rf_6pc_imp <- rf_6pc_imp |>
arrange(desc(Importance)) |>
rowid_to_column(var = "Rank") |>
mutate(Model = "Random Forest (6 Principal Components)") |>
select(-Importance)
rf_8pc_imp <- varImp(rf_8pc, scale = TRUE)
rf_8pc_imp <- rf_8pc_imp$importance |>
rownames_to_column()
cols <- c("Predictor", "Importance")
colnames(rf_8pc_imp) <- cols
rf_8pc_imp <- rf_8pc_imp |>
arrange(desc(Importance)) |>
rowid_to_column(var = "Rank") |>
mutate(Model = "Random Forest (8 Principal Components)") |>
select(-Importance)
rf_pc_imp_df <- rf_2pc_imp |>
bind_rows(rf_4pc_imp, rf_6pc_imp, rf_8pc_imp) |>
pivot_wider(names_from = Rank, values_from = Predictor,
names_prefix = "Rank_")
kable(rf_pc_imp_df, format = "simple")
| Model | Rank_1 | Rank_2 | Rank_3 | Rank_4 | Rank_5 | Rank_6 | Rank_7 | Rank_8 |
|---|---|---|---|---|---|---|---|---|
| Random Forest (2 Principal Components) | PC2 | PC1 | NA | NA | NA | NA | NA | NA |
| Random Forest (4 Principal Components) | PC2 | PC3 | PC1 | PC4 | NA | NA | NA | NA |
| Random Forest (6 Principal Components) | PC2 | PC3 | PC1 | PC5 | PC4 | PC6 | NA | NA |
| Random Forest (8 Principal Components) | PC2 | PC3 | PC5 | PC7 | PC1 | PC6 | PC8 | PC4 |
All of these Random Forest models rank the second principal component as their most important feature, and most of them rank the third principal component as their second-most important feature. Only the Random Forest model built using only the first two principal components includes the first principal component in its top two features.
Next we train four Support Vector Machine models using increasing numbers of principal components. Summaries of the best Support Vector Machine models using radial basis kernels that we arrived at during tuning are below:
fn <- "svm_2pc.rds"
if (!file.exists(fn)){
svm_tune_2pc <- tune(svm, quality ~ PC1 + PC2,
data = train_df_pca, kernel = "radial",
ranges = tune_grid, tunecontrol = ctrl)
svm_2pc <- svm_tune_2pc$best.model
saveRDS(svm_2pc, fn)
}else{
svm_2pc <- readRDS(fn)
}
fn <- "svm_4pc.rds"
if (!file.exists(fn)){
svm_tune_4pc <- tune(svm, quality ~ PC1 + PC2 + PC3 + PC4,
data = train_df_pca, kernel = "radial",
ranges = tune_grid, tunecontrol = ctrl)
svm_4pc <- svm_tune_4pc$best.model
saveRDS(svm_4pc, fn)
}else{
svm_4pc <- readRDS(fn)
}
fn <- "svm_6pc.rds"
if (!file.exists(fn)){
svm_tune_6pc <- tune(svm, quality ~ PC1 + PC2 + PC3 + PC4 + PC5 + PC6,
data = train_df_pca, kernel = "radial",
ranges = tune_grid, tunecontrol = ctrl)
svm_6pc <- svm_tune_6pc$best.model
saveRDS(svm_6pc, fn)
}else{
svm_6pc <- readRDS(fn)
}
fn <- "svm_8pc.rds"
if (!file.exists(fn)){
svm_tune_8pc <- tune(svm,
quality ~ PC1 + PC2 + PC3 + PC4 + PC5 + PC6 + PC7 + PC8,
data = train_df_pca, kernel = "radial",
ranges = tune_grid, tunecontrol = ctrl)
svm_8pc <- svm_tune_8pc$best.model
saveRDS(svm_8pc, fn)
}else{
svm_8pc <- readRDS(fn)
}
summ_2pc <- summarize_svm(svm_2pc) |>
filter(Parameter != "call") |>
mutate(Model = "Support Vector Machine (2 Principal Components)") |>
pivot_wider(names_from = Parameter, values_from = Value)
summ_4pc <- summarize_svm(svm_4pc) |>
filter(Parameter != "call") |>
mutate(Model = "Support Vector Machine (4 Principal Components)") |>
pivot_wider(names_from = Parameter, values_from = Value)
summ_6pc <- summarize_svm(svm_6pc) |>
filter(Parameter != "call") |>
mutate(Model = "Support Vector Machine (6 Principal Components)") |>
pivot_wider(names_from = Parameter, values_from = Value)
summ_8pc <- summarize_svm(svm_8pc) |>
filter(Parameter != "call") |>
mutate(Model = "Support Vector Machine (8 Principal Components)") |>
pivot_wider(names_from = Parameter, values_from = Value)
summ <- summ_2pc |>
bind_rows(summ_4pc, summ_6pc, summ_8pc)
kable(summ, format = "simple")
| Model | cost | gamma | num_classes | classes | support_vectors_total | support_vectors_split |
|---|---|---|---|---|---|---|
| Support Vector Machine (2 Principal Components) | 10 | 0.5 | 2 | Low, High | 820 | 410, 410 |
| Support Vector Machine (4 Principal Components) | 0.1 | 0.5 | 2 | Low, High | 824 | 413, 411 |
| Support Vector Machine (6 Principal Components) | 1 | 0.5 | 2 | Low, High | 763 | 387, 376 |
| Support Vector Machine (8 Principal Components) | 10 | 0.5 | 2 | Low, High | 755 | 381, 374 |
They use various costs between 0.1 and 10, but they all use the same gamma: 0.5.
We make predictions on the test data using all models, and we construct confusion matrices and calculate a variety of performance metrics for them.
First, we look at the confusion matrices for each of the models.
#Random Forest Orig: predictions/confusion matrix
pred_rf_orig <- predict(rf_orig, test_df)
rf_orig_cm_complete <- confusionMatrix(pred_rf_orig, test_df$quality,
positive = "High")
rf_orig_cm <- as.data.frame(rf_orig_cm_complete$table)
rf_orig_cm$Reference <- factor(rf_orig_cm$Reference,
levels = rev(levels(rf_orig_cm$Reference)))
rf_orig_cm <- rf_orig_cm |>
mutate(
Label = case_when(
Prediction == "Low" & Reference == "Low" ~ "TN",
Prediction == "High" & Reference == "High" ~ "TP",
Prediction == "Low" & Reference == "High" ~ "FN",
Prediction == "High" & Reference == "Low" ~ "FP"),
Model = "RF (Orig. Feat.)")
#Random Forest 2pc: predictions/confusion matrix
pred_rf_2pc <- predict(rf_2pc, test_df_pca)
rf_2pc_cm_complete <- confusionMatrix(pred_rf_2pc, test_df_pca$quality,
positive = "High")
rf_2pc_cm <- as.data.frame(rf_2pc_cm_complete$table)
rf_2pc_cm$Reference <- factor(rf_2pc_cm$Reference,
levels = rev(levels(rf_2pc_cm$Reference)))
rf_2pc_cm <- rf_2pc_cm |>
mutate(
Label = case_when(
Prediction == "Low" & Reference == "Low" ~ "TN",
Prediction == "High" & Reference == "High" ~ "TP",
Prediction == "Low" & Reference == "High" ~ "FN",
Prediction == "High" & Reference == "Low" ~ "FP"),
Model = "RF (2 PC)")
#Random Forest 4pc: predictions/confusion matrix
pred_rf_4pc <- predict(rf_4pc, test_df_pca)
rf_4pc_cm_complete <- confusionMatrix(pred_rf_4pc, test_df_pca$quality,
positive = "High")
rf_4pc_cm <- as.data.frame(rf_4pc_cm_complete$table)
rf_4pc_cm$Reference <- factor(rf_4pc_cm$Reference,
levels = rev(levels(rf_4pc_cm$Reference)))
rf_4pc_cm <- rf_4pc_cm |>
mutate(
Label = case_when(
Prediction == "Low" & Reference == "Low" ~ "TN",
Prediction == "High" & Reference == "High" ~ "TP",
Prediction == "Low" & Reference == "High" ~ "FN",
Prediction == "High" & Reference == "Low" ~ "FP"),
Model = "RF (4 PC)")
#Random Forest 6pc: predictions/confusion matrix
pred_rf_6pc <- predict(rf_6pc, test_df_pca)
rf_6pc_cm_complete <- confusionMatrix(pred_rf_6pc, test_df_pca$quality,
positive = "High")
rf_6pc_cm <- as.data.frame(rf_6pc_cm_complete$table)
rf_6pc_cm$Reference <- factor(rf_6pc_cm$Reference,
levels = rev(levels(rf_6pc_cm$Reference)))
rf_6pc_cm <- rf_6pc_cm |>
mutate(
Label = case_when(
Prediction == "Low" & Reference == "Low" ~ "TN",
Prediction == "High" & Reference == "High" ~ "TP",
Prediction == "Low" & Reference == "High" ~ "FN",
Prediction == "High" & Reference == "Low" ~ "FP"),
Model = "RF (6 PC)")
#Random Forest 8pc: predictions/confusion matrix
pred_rf_8pc <- predict(rf_8pc, test_df_pca)
rf_8pc_cm_complete <- confusionMatrix(pred_rf_8pc, test_df_pca$quality,
positive = "High")
rf_8pc_cm <- as.data.frame(rf_8pc_cm_complete$table)
rf_8pc_cm$Reference <- factor(rf_8pc_cm$Reference,
levels = rev(levels(rf_8pc_cm$Reference)))
rf_8pc_cm <- rf_8pc_cm |>
mutate(
Label = case_when(
Prediction == "Low" & Reference == "Low" ~ "TN",
Prediction == "High" & Reference == "High" ~ "TP",
Prediction == "Low" & Reference == "High" ~ "FN",
Prediction == "High" & Reference == "Low" ~ "FP"),
Model = "RF (8C)")
#Support Vector Machine Orig: predictions/confusion matrix
pred_svm_orig <- predict(svm_orig, test_df, type = "class")
svm_orig_cm_complete <- confusionMatrix(pred_svm_orig, test_df$quality,
positive = "High")
svm_orig_cm <- as.data.frame(svm_orig_cm_complete$table)
svm_orig_cm$Reference <- factor(svm_orig_cm$Reference,
levels = rev(levels(svm_orig_cm$Reference)))
svm_orig_cm <- svm_orig_cm |>
mutate(
Label = case_when(
Prediction == "Low" & Reference == "Low" ~ "TN",
Prediction == "High" & Reference == "High" ~ "TP",
Prediction == "Low" & Reference == "High" ~ "FN",
Prediction == "High" & Reference == "Low" ~ "FP"),
Model = "SVM (Orig. Feat.)")
#Support Vector Machine 2pc: predictions/confusion matrix
pred_svm_2pc <- predict(svm_2pc, test_df_pca, type = "class")
svm_2pc_cm_complete <- confusionMatrix(pred_svm_2pc, test_df_pca$quality,
positive = "High")
svm_2pc_cm <- as.data.frame(svm_2pc_cm_complete$table)
svm_2pc_cm$Reference <- factor(svm_2pc_cm$Reference,
levels = rev(levels(svm_2pc_cm$Reference)))
svm_2pc_cm <- svm_2pc_cm |>
mutate(
Label = case_when(
Prediction == "Low" & Reference == "Low" ~ "TN",
Prediction == "High" & Reference == "High" ~ "TP",
Prediction == "Low" & Reference == "High" ~ "FN",
Prediction == "High" & Reference == "Low" ~ "FP"),
Model = "SVM (2 PC)")
#Support Vector Machine 4pc: predictions/confusion matrix
pred_svm_4pc <- predict(svm_4pc, test_df_pca, type = "class")
svm_4pc_cm_complete <- confusionMatrix(pred_svm_4pc, test_df_pca$quality,
positive = "High")
svm_4pc_cm <- as.data.frame(svm_4pc_cm_complete$table)
svm_4pc_cm$Reference <- factor(svm_4pc_cm$Reference,
levels = rev(levels(svm_4pc_cm$Reference)))
svm_4pc_cm <- svm_4pc_cm |>
mutate(
Label = case_when(
Prediction == "Low" & Reference == "Low" ~ "TN",
Prediction == "High" & Reference == "High" ~ "TP",
Prediction == "Low" & Reference == "High" ~ "FN",
Prediction == "High" & Reference == "Low" ~ "FP"),
Model = "SVM (4 PC)")
#Support Vector Machine 6pc: predictions/confusion matrix
pred_svm_6pc <- predict(svm_6pc, test_df_pca, type = "class")
svm_6pc_cm_complete <- confusionMatrix(pred_svm_6pc, test_df_pca$quality,
positive = "High")
svm_6pc_cm <- as.data.frame(svm_6pc_cm_complete$table)
svm_6pc_cm$Reference <- factor(svm_6pc_cm$Reference,
levels = rev(levels(svm_6pc_cm$Reference)))
svm_6pc_cm <- svm_6pc_cm |>
mutate(
Label = case_when(
Prediction == "Low" & Reference == "Low" ~ "TN",
Prediction == "High" & Reference == "High" ~ "TP",
Prediction == "Low" & Reference == "High" ~ "FN",
Prediction == "High" & Reference == "Low" ~ "FP"),
Model = "SVM (6 PC)")
#Support Vector Machine 8pc: predictions/confusion matrix
pred_svm_8pc <- predict(svm_8pc, test_df_pca, type = "class")
svm_8pc_cm_complete <- confusionMatrix(pred_svm_8pc, test_df_pca$quality,
positive = "High")
svm_8pc_cm <- as.data.frame(svm_8pc_cm_complete$table)
svm_8pc_cm$Reference <- factor(svm_8pc_cm$Reference,
levels = rev(levels(svm_8pc_cm$Reference)))
svm_8pc_cm <- svm_8pc_cm |>
mutate(
Label = case_when(
Prediction == "Low" & Reference == "Low" ~ "TN",
Prediction == "High" & Reference == "High" ~ "TP",
Prediction == "Low" & Reference == "High" ~ "FN",
Prediction == "High" & Reference == "Low" ~ "FP"),
Model = "SVM (8 PC)")
cm <- bind_rows(rf_orig_cm, rf_2pc_cm, rf_4pc_cm, rf_6pc_cm, rf_8pc_cm,
svm_orig_cm, svm_2pc_cm, svm_4pc_cm, svm_6pc_cm, svm_8pc_cm)
p6 <- cm |>
ggplot(aes(x = Reference, y = Prediction, fill = Freq)) +
geom_tile(col = "black") +
geom_text(aes(label = Freq)) +
geom_text(aes(label = Label), vjust = 2) +
scale_fill_gradient(low = "white", high = pal[1]) +
scale_x_discrete(position = "top") +
facet_wrap(Model ~ ., ncol = 5, strip.position = "bottom") +
labs(title = "Confusion Matrices for All Models") +
theme(axis.line.x = element_blank(),
axis.line.y = element_blank(),
axis.text.y = element_text(angle = 90, hjust = 0.5),
axis.ticks = element_blank(),
legend.position = "bottom",
strip.placement = "outside")
p6
Then we calculate the performance metrics for each of the models.
metrics <- as.data.frame(cbind(rbind(rf_orig_cm_complete$byClass,
rf_2pc_cm_complete$byClass,
rf_4pc_cm_complete$byClass,
rf_6pc_cm_complete$byClass,
rf_8pc_cm_complete$byClass,
svm_orig_cm_complete$byClass,
svm_2pc_cm_complete$byClass,
svm_4pc_cm_complete$byClass,
svm_6pc_cm_complete$byClass,
svm_8pc_cm_complete$byClass),
rbind(rf_orig_cm_complete$overall,
rf_2pc_cm_complete$overall,
rf_4pc_cm_complete$overall,
rf_6pc_cm_complete$overall,
rf_8pc_cm_complete$overall,
svm_orig_cm_complete$overall,
svm_2pc_cm_complete$overall,
svm_4pc_cm_complete$overall,
svm_6pc_cm_complete$overall,
svm_8pc_cm_complete$overall)))
rownames(metrics) <- c("Random Forest (Original Features)",
"Random Forest (2 Principal Components)",
"Random Forest (4 Principal Components)",
"Random Forest (6 Principal Components)",
"Random Forest (8 Principal Components)",
"Support Vector Machine (Original Features)",
"Support Vector Machine (2 Principal Components)",
"Support Vector Machine (4 Principal Components)",
"Support Vector Machine (6 Principal Components)",
"Support Vector Machine (8 Principal Components)")
keep <- c("Accuracy", "Precision", "Recall", "F1")
metrics <- metrics |>
select(all_of(keep)) |>
round(3)
kable(metrics, format = "simple")
| Accuracy | Precision | Recall | F1 | |
|---|---|---|---|---|
| Random Forest (Original Features) | 0.810 | 0.814 | 0.848 | 0.831 |
| Random Forest (2 Principal Components) | 0.715 | 0.723 | 0.776 | 0.749 |
| Random Forest (4 Principal Components) | 0.802 | 0.809 | 0.837 | 0.822 |
| Random Forest (6 Principal Components) | 0.804 | 0.807 | 0.844 | 0.825 |
| Random Forest (8 Principal Components) | 0.806 | 0.815 | 0.837 | 0.826 |
| Support Vector Machine (Original Features) | 0.783 | 0.787 | 0.829 | 0.807 |
| Support Vector Machine (2 Principal Components) | 0.654 | 0.678 | 0.703 | 0.690 |
| Support Vector Machine (4 Principal Components) | 0.715 | 0.744 | 0.730 | 0.737 |
| Support Vector Machine (6 Principal Components) | 0.748 | 0.773 | 0.764 | 0.769 |
| Support Vector Machine (8 Principal Components) | 0.773 | 0.773 | 0.829 | 0.800 |
Both the Random Forest and Support Vector Machine models perform best on this dataset using the original features rather than any number of principal components. Using only four principal components does get the Random Forest model close to the Accuracy and F1 Score of the Random Forest model using the original features though. The Support Vector Machine model, in contrast, requires twice as many, i.e. eight, principal components to get close to the Accuracy and F1 Score of the Support Vector Machine model using the original features.
Since we only had eleven numeric predictors in the dataset to begin with, reducing the feature space wasn’t really imperative, but we now have a framework for performing PCA on datasets with tons of features, where reducing the feature space is more necessary.
There are often a lot of benefits to reducing the feature space, especially when businesses track a lot of features for the products they sell. Jeans in particular come to mind as a product with lots of features (i.e. color, size, fit, style, stretch, whether it has a button-fly or a zipper, whether it comes distressed, etc.). If you can represent a product line with lots of features like that with just four or fewer principal components instead, you will be eliminating a lot of noise, your models will be faster, and your predictions can be as good as or better than predictions from models that use more features. It’s also amazing to be able to see data with many dimensions reduced to two dimensions, even in a classification problem like ours where we knew perfect wine quality class separation wouldn’t be visible in just two dimensions.
You can view the video presentation here.