Instructions
Exploratory analysis and essay
Assignment
Choose a dataset
You get to decide which dataset you want to work on. The data set must be different from the ones used in previous homeworks You can work on a problem from your job, or something you are interested in. You may also obtain a dataset from sites such as Kaggle, Data.Gov, Census Bureau, USGS or other open data portals.
Select one of the methodologies studied in weeks 1-10, and another methodology from weeks 11-15 to apply in the new dataset selected.
To complete this task:
Describe the problem you are trying to solve.
Describe your datases and what you did to prepare the data for analysis.
Methodologies you used for analyzing the data
What’s the purpose of the analysis performed
Make your conclusions from your analysis. Please be sure to address the business impact (it could be of any domain) of your solution.
Introduction
We load a dataset of red wine quality. We will be using this dataset to compare the performance of a couple different classification models, first using all the original features, then using a reduced set of features derived via Principal Component Analysis (PCA). This will give us insight into how much the feature space can be reduced without losing predictive power, and we may even see predictive gains.
cur_theme <- theme_set(theme_classic())
pal <- brewer.pal(n = 8, name = "Dark2")
my_url1 <- "winequality.csv"
wine_qual_df <- read.csv(my_url1)
cols <- to_snake_case(colnames(wine_qual_df))
colnames(wine_qual_df) <- cols
wine_qual_df <- wine_qual_df |>
mutate(ph = p_h) |>
select(-p_h) |>
relocate(quality, .before = fixed_acidity) |>
relocate(ph, .before = sulphates)Below are the first 10 observations in the dataset, and for the sake of readability, we only display the first 10 columns.
| quality | fixed_acidity | volatile_acidity | citric_acid | residual_sugar | chlorides | free_sulfur_dioxide | total_sulfur_dioxide | density | ph |
|---|---|---|---|---|---|---|---|---|---|
| 5 | 7.4 | 0.70 | 0.00 | 1.9 | 0.076 | 11 | 34 | 0.9978 | 3.51 |
| 5 | 7.8 | 0.88 | 0.00 | 2.6 | 0.098 | 25 | 67 | 0.9968 | 3.20 |
| 5 | 7.8 | 0.76 | 0.04 | 2.3 | 0.092 | 15 | 54 | 0.9970 | 3.26 |
| 6 | 11.2 | 0.28 | 0.56 | 1.9 | 0.075 | 17 | 60 | 0.9980 | 3.16 |
| 5 | 7.4 | 0.70 | 0.00 | 1.9 | 0.076 | 11 | 34 | 0.9978 | 3.51 |
| 5 | 7.4 | 0.66 | 0.00 | 1.8 | 0.075 | 13 | 40 | 0.9978 | 3.51 |
| 5 | 7.9 | 0.60 | 0.06 | 1.6 | 0.069 | 15 | 59 | 0.9964 | 3.30 |
| 7 | 7.3 | 0.65 | 0.00 | 1.2 | 0.065 | 15 | 21 | 0.9946 | 3.39 |
| 7 | 7.8 | 0.58 | 0.02 | 2.0 | 0.073 | 9 | 18 | 0.9968 | 3.36 |
| 5 | 7.5 | 0.50 | 0.36 | 6.1 | 0.071 | 17 | 102 | 0.9978 | 3.35 |
The first column is the response variable that we will attempt to predict: quality. Wines were rated on a scale from 0 to 10, but the dataset doesn’t actually include any observations with ratings less than three or higher than eight. We recode quality as a binary factor and replace values less than six with “Low” and values greater than or equal to six with “High.”
In addition to the response variable, there are 11 numeric predictor variables:wine_qual_df <- wine_qual_df |>
mutate(quality = factor(ifelse(quality < 6, "Low", "High"),
levels = c("Low", "High")))
classes <- as.data.frame(unlist(lapply(wine_qual_df, class))) |>
rownames_to_column()
cols <- c("Variable", "Class")
colnames(classes) <- cols
classes_summary <- classes |>
group_by(Class) |>
summarize(Count = n(),
Variables = paste(sort(unique(Variable)),collapse=", ")) |>
filter(Class == "numeric")
kable(classes_summary, format = "simple")| Class | Count | Variables |
|---|---|---|
| numeric | 11 | alcohol, chlorides, citric_acid, density, fixed_acidity, free_sulfur_dioxide, ph, residual_sugar, sulphates, total_sulfur_dioxide, volatile_acidity |
Exploratory Data Analysis
We check for any missing values within the dataset.
rem <- c("discrete_columns", "continuous_columns",
"total_observations", "memory_usage")
completeness <- introduce(wine_qual_df) |>
select(-all_of(rem))
knitr::kable(t(completeness), format = "simple")| rows | 1599 |
| columns | 12 |
| all_missing_columns | 0 |
| total_missing_values | 0 |
| complete_rows | 1599 |
There are no missing values to address in this dataset. We check the distribution of the response variable to see if there’s a class imbalance between high-quality wines and low-quality wines.
qual_cols <- pal[c(2, 3)]
names(qual_cols) <- c("Low", "High")
obs = nrow(wine_qual_df)
p1 <- wine_qual_df |>
ggplot(aes(x = quality)) +
geom_histogram(aes(color = quality, fill = quality), stat = "count") +
geom_text(stat = "count", aes(label = paste0(round(
after_stat(count) / obs * 100, 1), "%")),
size = 5, color = "white", vjust = 2, fontface = "bold") +
scale_color_manual(values = qual_cols) +
scale_fill_manual(values = qual_cols) +
scale_y_continuous(labels = scales::comma) +
labs(title = "Distribution of Low- & High- Quality Wines",
x = "QUALITY",
y = "COUNT") +
theme(legend.position = "none")
p1There are pretty similar numbers of low-quality and high-quality wines in the dataset, so there is no class imbalance issue to address here. We summarize the distributions of the predictors below.
rem <- c("vars", "n", "trimmed", "mad", "skew", "kurtosis", "se")
excl <- c("quality*")
describe <- describe(wine_qual_df) |>
select(-all_of(rem))
describe <- describe |>
filter(!rownames(describe) %in% excl)
knitr::kable(describe, format = "simple")| mean | sd | median | min | max | range | |
|---|---|---|---|---|---|---|
| fixed_acidity | 8.3196373 | 1.7410963 | 7.90000 | 4.60000 | 15.90000 | 11.30000 |
| volatile_acidity | 0.5278205 | 0.1790597 | 0.52000 | 0.12000 | 1.58000 | 1.46000 |
| citric_acid | 0.2709756 | 0.1948011 | 0.26000 | 0.00000 | 1.00000 | 1.00000 |
| residual_sugar | 2.5388055 | 1.4099281 | 2.20000 | 0.90000 | 15.50000 | 14.60000 |
| chlorides | 0.0874665 | 0.0470653 | 0.07900 | 0.01200 | 0.61100 | 0.59900 |
| free_sulfur_dioxide | 15.8749218 | 10.4601570 | 14.00000 | 1.00000 | 72.00000 | 71.00000 |
| total_sulfur_dioxide | 46.4677924 | 32.8953245 | 38.00000 | 6.00000 | 289.00000 | 283.00000 |
| density | 0.9967467 | 0.0018873 | 0.99675 | 0.99007 | 1.00369 | 0.01362 |
| ph | 3.3111132 | 0.1543865 | 3.31000 | 2.74000 | 4.01000 | 1.27000 |
| sulphates | 0.6581488 | 0.1695070 | 0.62000 | 0.33000 | 2.00000 | 1.67000 |
| alcohol | 10.4229831 | 1.0656676 | 10.20000 | 8.40000 | 14.90000 | 6.50000 |
None of the distributions appear degenerate, but we confirm there are no near-zero variance predictors nonetheless.
[1] TRUE
There are indeed no near-zero variance predictors that we need to remove, so we visualize the distributions for all predictors below.
skip <- c("quality")
wine_qual_piv <- wine_qual_df |>
pivot_longer(cols = !all_of(skip), names_to = "PREDICTOR",
values_to = "VALUE")
p2 <- wine_qual_piv |>
ggplot(aes(x = VALUE, color = quality, fill = quality)) +
geom_histogram(data = subset(wine_qual_piv, quality == "Low"),
alpha = 0.5) +
geom_histogram(data = subset(wine_qual_piv, quality == "High"),
alpha = 0.5) +
scale_color_manual(values = qual_cols) +
scale_fill_manual(values = qual_cols) +
scale_y_continuous(labels = scales::comma) +
facet_wrap(PREDICTOR ~ ., ncol = 5, scales = "free_x") +
labs(title = "Distribution of Predictors",
y = "COUNT") +
theme(legend.position = "top")
p2We can see the means for alcohol and volatile_acidity are different for low-quality and high-quality wines. We can also see some skewed distributions, but we won’t be making any transformations since the focus of our analysis is really what impact reducing the feature space has on model performance.
We check for correlations between the predictors and the response variable, as well as any predictor-predictor correlations. In the interest of ignoring clutter, only correlations greater than 0.1 (in absolute value) are displayed.
plot_corr_range <- function(df, mn=0.1, mx=1.0, excl=c(NA)){
palette <- brewer.pal(n = 7, name = "BrBG")[c(1, 4, 7)]
tit = sprintf("Correlations Between %s and %s (Absolute Value)", mn, mx)
r <- model.matrix(~0+., data = df) |>
cor() |>
round(digits=2)
is.na(r) <- abs(r) > mx
is.na(r) <- abs(r) < mn
if (!is.na(excl)){
r <- as.data.frame(r) |>
select(-all_of(excl)) |>
filter(!rownames(r) %in% excl)
}
p <- r |>
ggcorrplot(show.diag = FALSE, type = "lower", lab = TRUE,
lab_size = 3, tl.cex = 10, tl.srt = 90,
colors = palette, outline.color = "white") +
labs(title = tit) +
theme(plot.title.position = "plot")
p
}
excl <- c("qualityLow")
p3 <- plot_corr_range(df = wine_qual_df, excl = excl)
p3We see that alcohol has the largest positive correlation with high-quality wines, and volatile_acidity has the largest negative correlation with high-quality wines. To a lesser degree, fixed_acidity, citric_acid, and sulphates are also positively correlated with high-quality wines, and chlorides, total_sulfur_dioxide, and density are also negatively correlated with high-quality wines.
We also see high correlations between some of our predictors:
fixed_acidity and citric_acid | density | ph
total_sulfur_dioxide and free_sulfur_dioxide
Note that we won’t be eliminating any predictors from consideration for the same reason we aren’t doing any transformations: we are more interested in seeing what effects reducing the feature space from all the original features to just principal components has than anything else.
Data Preparation
We split the data into train and test sets.
set.seed(1006)
sample <- sample(nrow(wine_qual_df),
round(nrow(wine_qual_df) * 0.7),
replace = FALSE)
train_df <- wine_qual_df[sample, ]
test_df <- wine_qual_df[-sample, ]We confirm the class distributions are similar in the original, train, and test sets.
dist1 <- as.data.frame(round(prop.table(table(select(wine_qual_df, quality))), 2))
colnames(dist1) <- c("Quality", "Original Freq")
dist2 <- as.data.frame(round(prop.table(table(select(train_df, quality))), 2))
colnames(dist2) <- c("Quality", "Train Freq")
dist3 <- as.data.frame(round(prop.table(table(select(test_df, quality))), 2))
colnames(dist3) <- c("Quality", "Test Freq")
class_dist <- dist1 |>
left_join(dist2, by = join_by(Quality)) |>
left_join(dist3, by = join_by(Quality))
kable(class_dist, format = "simple")| Quality | Original Freq | Train Freq | Test Freq |
|---|---|---|---|
| Low | 0.47 | 0.47 | 0.45 |
| High | 0.53 | 0.53 | 0.55 |
The class distributions are all pretty similar.
Model Building
We build our first set of models: a Random Forest model and a Support Vector Machine model using all the original features.
Random Forest Model (Original Features)
A summary of the relative importance of the original features to the Random Forest model we have trained is below.
fn <- "rf_orig.rds"
if (!file.exists(fn)){
rf_orig <- train(quality ~ ., data = train_df, metric = "Accuracy",
method = "rf", trControl = trainControl(method = "none"),
tuneGrid = expand.grid(.mtry = 3))
saveRDS(rf_orig, fn)
}else{
rf_orig <- readRDS(fn)
}
rf_orig_imp <- varImp(rf_orig, scale = TRUE)
rf_orig_imp <- rf_orig_imp$importance |>
rownames_to_column()
cols <- c("Predictor", "Importance")
colnames(rf_orig_imp) <- cols
rf_orig_imp <- rf_orig_imp |>
arrange(desc(Importance))
kable(rf_orig_imp, format = "simple")| Predictor | Importance |
|---|---|
| alcohol | 100.000000 |
| sulphates | 55.744989 |
| volatile_acidity | 49.290579 |
| total_sulfur_dioxide | 39.393053 |
| density | 28.842431 |
| fixed_acidity | 19.203067 |
| chlorides | 17.144795 |
| ph | 12.119840 |
| citric_acid | 7.937798 |
| free_sulfur_dioxide | 2.027924 |
| residual_sugar | 0.000000 |
Unsurprisingly, the Random Forest model estimates alcohol to be the most important feature for predicting the response variable. We deduced that from our correlation plot. The relative importance estimate for `residual_sugar`` is 0, indicating it was not used in any of the trees.
Support Vector Machine Model (Original Features)
A summary of the best Support Vector Machine model using a radial basis kernel and the original features that we arrived at during tuning is below:
ctrl <- tune.control(sampling = "cross", cross = 10, nrepeat = 5)
tune_grid <- list(cost = c(0.1, 1, 10, 100, 1000),
gamma = c(0.5, 1, 2, 3, 4))
fn <- "svm_orig.rds"
if (!file.exists(fn)){
svm_tune_orig <- tune(svm, quality ~ .,
data = train_df, kernel = "radial",
ranges = tune_grid, tunecontrol = ctrl)
svm_orig <- svm_tune_orig$best.model
saveRDS(svm_orig, fn)
}else{
svm_orig <- readRDS(fn)
}
summarize_svm <- function(svm_model){
col1 <- c("call", "cost", "gamma", "num_classes", "classes",
"support_vectors_total", "support_vectors_split")
subset <- c("call", "cost", "gamma", "nclasses", "levels",
"tot.nSV", "nSV")
col2 <- svm_model[subset]
copy <- col2
for (i in 1:length(copy)){
if (is.vector(copy[[i]])){
col2[[i]] <- paste(col2[[i]], collapse = ", ")
}
}
summ <- as.data.frame(cbind(col1, col2))
rownames(summ) <- NULL
colnames(summ) <- c("Parameter", "Value")
summ
}
summ <- summarize_svm(svm_orig)
kable(summ, format = "simple")| Parameter | Value |
|---|---|
| call | best.tune(METHOD = svm, train.x = quality ~ ., data = train_df, , ranges = tune_grid, tunecontrol = ctrl, kernel = “radial”) |
| cost | 1 |
| gamma | 0.5 |
| num_classes | 2 |
| classes | Low, High |
| support_vectors_total | 877 |
| support_vectors_split | 442, 435 |
It uses a cost of 1 and a gamma of 0.5.
Principal Component Analysis (PCA)
We now perform PCA. First, we check the mean and variance of all predictors to confirm we need to center and scale them.
mean_df <- round(as.data.frame(apply(wine_qual_df |> select(-quality), 2, mean)), 2)
colnames(mean_df) <- c("Mean")
mean_df <- mean_df |>
rownames_to_column(var = "Predictor")
var_df <- round(as.data.frame(apply(wine_qual_df |> select(-quality), 2, var)), 2)
colnames(var_df) <- c("Variance")
var_df <- var_df |>
rownames_to_column(var = "Predictor")
mean_var_df <- mean_df |>
left_join(var_df, by = join_by(Predictor))
kable(mean_var_df, format = "simple")| Predictor | Mean | Variance |
|---|---|---|
| fixed_acidity | 8.32 | 3.03 |
| volatile_acidity | 0.53 | 0.03 |
| citric_acid | 0.27 | 0.04 |
| residual_sugar | 2.54 | 1.99 |
| chlorides | 0.09 | 0.00 |
| free_sulfur_dioxide | 15.87 | 109.41 |
| total_sulfur_dioxide | 46.47 | 1082.10 |
| density | 1.00 | 0.00 |
| ph | 3.31 | 0.02 |
| sulphates | 0.66 | 0.03 |
| alcohol | 10.42 | 1.14 |
Both free_sulfur_dioxide and total_sulfur_dioxide exhibit much larger variance than the other variables, so we center and scale them to prevent them from dominating the principal components. Below is the rotation, ie. the matrix of variable loadings.
pca.out = prcomp(wine_qual_df |> select(-quality), scale = TRUE)
kable(round(as.data.frame(pca.out$rotation), 2), format = "simple")| PC1 | PC2 | PC3 | PC4 | PC5 | PC6 | PC7 | PC8 | PC9 | PC10 | PC11 | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| fixed_acidity | 0.49 | -0.11 | 0.12 | -0.23 | 0.08 | -0.10 | 0.35 | -0.18 | 0.19 | 0.25 | 0.64 |
| volatile_acidity | -0.24 | 0.27 | 0.45 | 0.08 | -0.22 | -0.41 | 0.53 | -0.08 | -0.13 | -0.37 | 0.00 |
| citric_acid | 0.46 | -0.15 | -0.24 | -0.08 | 0.06 | -0.07 | -0.11 | -0.38 | -0.38 | -0.62 | -0.07 |
| residual_sugar | 0.15 | 0.27 | -0.10 | -0.37 | -0.73 | -0.05 | -0.29 | 0.30 | 0.01 | -0.09 | 0.18 |
| chlorides | 0.21 | 0.15 | 0.09 | 0.67 | -0.25 | -0.30 | -0.37 | -0.36 | 0.11 | 0.22 | 0.05 |
| free_sulfur_dioxide | -0.04 | 0.51 | -0.43 | -0.04 | 0.16 | 0.01 | 0.12 | -0.20 | 0.64 | -0.25 | -0.05 |
| total_sulfur_dioxide | 0.02 | 0.57 | -0.32 | -0.03 | 0.22 | -0.14 | 0.09 | 0.02 | -0.59 | 0.37 | 0.07 |
| density | 0.40 | 0.23 | 0.34 | -0.17 | -0.16 | 0.39 | 0.17 | -0.24 | 0.02 | 0.24 | -0.57 |
| ph | -0.44 | 0.01 | -0.06 | 0.00 | -0.27 | 0.52 | 0.03 | -0.56 | -0.17 | 0.01 | 0.34 |
| sulphates | 0.24 | -0.04 | -0.28 | 0.55 | -0.23 | 0.38 | 0.45 | 0.37 | -0.06 | -0.11 | 0.07 |
| alcohol | -0.11 | -0.39 | -0.47 | -0.12 | -0.35 | -0.36 | 0.33 | -0.22 | 0.04 | 0.30 | -0.31 |
We generate a scree plot so that we can see the variance explained by each principal component.
scree_plot_df = as.data.frame(pca.out$sdev^2 / sum(pca.out$sdev^2))
colnames(scree_plot_df) = c("var_explained")
scree_plot_df <- scree_plot_df |>
rowid_to_column(var = "pc")
p4 <- scree_plot_df |>
ggplot(aes(x = pc, y = var_explained)) +
geom_line() +
geom_point(size = 2) +
scale_x_continuous(limits = c(1, 11), breaks = seq(1, 11, 1)) +
scale_y_continuous(limits = c(0, 1), breaks = seq(0, 1, 0.1)) +
labs(x = "Principal Component",
y = "Variance Explained",
title = "Scree Plot")
p4The first principal component explains a little less than 30 percent of the variance in our data, and the second principal component explains a little less than 20 percent of the variance in our data. Since those values are relatively low, we don’t expect building models using only the first two principal components will be sufficient, so we will go up to eight principal components.
For the same reason, we don’t expect the biplot of the first two principal components we generate below to show great separation between high-quality and low-quality wines.
p5 <- pca.out |>
ggbiplot(obs.scale = 1, var.scale = 1,
groups = wine_qual_df$quality,
ellipse = TRUE,
ellipse.alpha = 0.0) +
theme(legend.direction = "horizontal",
legend.position = "top") +
labs(fill = "Quality",
color = "Quality",
title = "Biplot of First Two Principal Components") +
scale_color_discrete(type = qual_cols)
p5Indeed, there is a lot of overlap in low-quality and high-quality wines plotted by just the first two principal components, but we do see some separation between the classes, which is promising.
We create new versions of the train and test sets using the features derived via PCA instead of the original features.
train_df_pca <- as.data.frame(predict(pca.out, train_df)) |>
bind_cols(train_df |> select(quality)) |>
relocate(quality, .before = "PC1")
test_df_pca <- as.data.frame(predict(pca.out, test_df)) |>
bind_cols(test_df |> select(quality)) |>
relocate(quality, .before = "PC1")Then we build our second set of models: Random Forest and Support Vector Machine models using the first two, first four, first six, or first eight principal components.
Random Forest Model (Principal Components)
First, we train four Random Forest models using increasing numbers of principal components.
fn <- "rf_2pc.rds"
if (!file.exists(fn)){
rf_2pc <- train(quality ~ PC1 + PC2,
data = train_df_pca, metric = "Accuracy",
method = "rf", trControl = trainControl(method = "none"),
tuneGrid = expand.grid(.mtry = 1))
saveRDS(rf_2pc, fn)
}else{
rf_2pc <- readRDS(fn)
}
fn <- "rf_4pc.rds"
if (!file.exists(fn)){
rf_4pc <- train(quality ~ PC1 + PC2 + PC3 + PC4,
data = train_df_pca, metric = "Accuracy",
method = "rf", trControl = trainControl(method = "none"),
tuneGrid = expand.grid(.mtry = 2))
saveRDS(rf_4pc, fn)
}else{
rf_4pc <- readRDS(fn)
}
fn <- "rf_6pc.rds"
if (!file.exists(fn)){
rf_6pc <- train(quality ~ PC1 + PC2 + PC3 + PC4 + PC5 + PC6,
data = train_df_pca, metric = "Accuracy",
method = "rf", trControl = trainControl(method = "none"),
tuneGrid = expand.grid(.mtry = 2))
saveRDS(rf_6pc, fn)
}else{
rf_6pc <- readRDS(fn)
}
fn <- "rf_8pc.rds"
if (!file.exists(fn)){
rf_8pc <- train(quality ~ PC1 + PC2 + PC3 + PC4 + PC5 + PC6 + PC7 + PC8,
data = train_df_pca, metric = "Accuracy",
method = "rf", trControl = trainControl(method = "none"),
tuneGrid = expand.grid(.mtry = 2))
saveRDS(rf_8pc, fn)
}else{
rf_8pc <- readRDS(fn)
}A summary of the relative importance of the principal components to each of these Random Forest models we have trained is below.
rf_2pc_imp <- varImp(rf_2pc, scale = TRUE)
rf_2pc_imp <- rf_2pc_imp$importance |>
rownames_to_column()
cols <- c("Predictor", "Importance")
colnames(rf_2pc_imp) <- cols
rf_2pc_imp <- rf_2pc_imp |>
arrange(desc(Importance)) |>
rowid_to_column(var = "Rank") |>
mutate(Model = "Random Forest (2 Principal Components)") |>
select(-Importance)
rf_4pc_imp <- varImp(rf_4pc, scale = TRUE)
rf_4pc_imp <- rf_4pc_imp$importance |>
rownames_to_column()
cols <- c("Predictor", "Importance")
colnames(rf_4pc_imp) <- cols
rf_4pc_imp <- rf_4pc_imp |>
arrange(desc(Importance)) |>
rowid_to_column(var = "Rank") |>
mutate(Model = "Random Forest (4 Principal Components)") |>
select(-Importance)
rf_6pc_imp <- varImp(rf_6pc, scale = TRUE)
rf_6pc_imp <- rf_6pc_imp$importance |>
rownames_to_column()
cols <- c("Predictor", "Importance")
colnames(rf_6pc_imp) <- cols
rf_6pc_imp <- rf_6pc_imp |>
arrange(desc(Importance)) |>
rowid_to_column(var = "Rank") |>
mutate(Model = "Random Forest (6 Principal Components)") |>
select(-Importance)
rf_8pc_imp <- varImp(rf_8pc, scale = TRUE)
rf_8pc_imp <- rf_8pc_imp$importance |>
rownames_to_column()
cols <- c("Predictor", "Importance")
colnames(rf_8pc_imp) <- cols
rf_8pc_imp <- rf_8pc_imp |>
arrange(desc(Importance)) |>
rowid_to_column(var = "Rank") |>
mutate(Model = "Random Forest (8 Principal Components)") |>
select(-Importance)
rf_pc_imp_df <- rf_2pc_imp |>
bind_rows(rf_4pc_imp, rf_6pc_imp, rf_8pc_imp) |>
pivot_wider(names_from = Rank, values_from = Predictor,
names_prefix = "Rank_")
kable(rf_pc_imp_df, format = "simple")| Model | Rank_1 | Rank_2 | Rank_3 | Rank_4 | Rank_5 | Rank_6 | Rank_7 | Rank_8 |
|---|---|---|---|---|---|---|---|---|
| Random Forest (2 Principal Components) | PC2 | PC1 | NA | NA | NA | NA | NA | NA |
| Random Forest (4 Principal Components) | PC2 | PC3 | PC1 | PC4 | NA | NA | NA | NA |
| Random Forest (6 Principal Components) | PC2 | PC3 | PC1 | PC5 | PC4 | PC6 | NA | NA |
| Random Forest (8 Principal Components) | PC2 | PC3 | PC5 | PC1 | PC7 | PC4 | PC8 | PC6 |
All of these Random Forest models rank the second principal component as their most important feature, and most of them rank the third principal component as their second-most important feature. Only the Random Forest model built using only the first two principal components includes the first principal component in its top two features.
Support Vector Machine Model (Principal Components)
Next we train four Support Vector Machine models using increasing numbers of principal components. Summaries of the best Support Vector Machine models using radial basis kernels that we arrived at during tuning are below:
fn <- "svm_2pc.rds"
if (!file.exists(fn)){
svm_tune_2pc <- tune(svm, quality ~ PC1 + PC2,
data = train_df_pca, kernel = "radial",
ranges = tune_grid, tunecontrol = ctrl)
svm_2pc <- svm_tune_2pc$best.model
saveRDS(svm_2pc, fn)
}else{
svm_2pc <- readRDS(fn)
}
fn <- "svm_4pc.rds"
if (!file.exists(fn)){
svm_tune_4pc <- tune(svm, quality ~ PC1 + PC2 + PC3 + PC4,
data = train_df_pca, kernel = "radial",
ranges = tune_grid, tunecontrol = ctrl)
svm_4pc <- svm_tune_4pc$best.model
saveRDS(svm_4pc, fn)
}else{
svm_4pc <- readRDS(fn)
}
fn <- "svm_6pc.rds"
if (!file.exists(fn)){
svm_tune_6pc <- tune(svm, quality ~ PC1 + PC2 + PC3 + PC4 + PC5 + PC6,
data = train_df_pca, kernel = "radial",
ranges = tune_grid, tunecontrol = ctrl)
svm_6pc <- svm_tune_6pc$best.model
saveRDS(svm_6pc, fn)
}else{
svm_6pc <- readRDS(fn)
}
fn <- "svm_8pc.rds"
if (!file.exists(fn)){
svm_tune_8pc <- tune(svm,
quality ~ PC1 + PC2 + PC3 + PC4 + PC5 + PC6 + PC7 + PC8,
data = train_df_pca, kernel = "radial",
ranges = tune_grid, tunecontrol = ctrl)
svm_8pc <- svm_tune_8pc$best.model
saveRDS(svm_8pc, fn)
}else{
svm_8pc <- readRDS(fn)
}
summ_2pc <- summarize_svm(svm_2pc) |>
filter(Parameter != "call") |>
mutate(Model = "Support Vector Machine (2 Principal Components)") |>
pivot_wider(names_from = Parameter, values_from = Value)
summ_4pc <- summarize_svm(svm_4pc) |>
filter(Parameter != "call") |>
mutate(Model = "Support Vector Machine (4 Principal Components)") |>
pivot_wider(names_from = Parameter, values_from = Value)
summ_6pc <- summarize_svm(svm_6pc) |>
filter(Parameter != "call") |>
mutate(Model = "Support Vector Machine (6 Principal Components)") |>
pivot_wider(names_from = Parameter, values_from = Value)
summ_8pc <- summarize_svm(svm_8pc) |>
filter(Parameter != "call") |>
mutate(Model = "Support Vector Machine (8 Principal Components)") |>
pivot_wider(names_from = Parameter, values_from = Value)
summ <- summ_2pc |>
bind_rows(summ_4pc, summ_6pc, summ_8pc)
kable(summ, format = "simple")| Model | cost | gamma | num_classes | classes | support_vectors_total | support_vectors_split |
|---|---|---|---|---|---|---|
| Support Vector Machine (2 Principal Components) | 10 | 2 | 2 | Low, High | 810 | 407, 403 |
| Support Vector Machine (4 Principal Components) | 0.1 | 0.5 | 2 | Low, High | 824 | 413, 411 |
| Support Vector Machine (6 Principal Components) | 1 | 3 | 2 | Low, High | 1017 | 527, 490 |
| Support Vector Machine (8 Principal Components) | 100 | 1 | 2 | Low, High | 846 | 445, 401 |
They use various costs between 0.1 and 10, but they all use the same gamma: 0.5.
Modal Evaluation
We make predictions on the test data using all models, and we construct confusion matrices and calculate a variety of performance metrics for them.
First, we look at the confusion matrices for each of the models.
#Random Forest Orig: predictions/confusion matrix
pred_rf_orig <- predict(rf_orig, test_df)
rf_orig_cm_complete <- confusionMatrix(pred_rf_orig, test_df$quality,
positive = "High")
rf_orig_cm <- as.data.frame(rf_orig_cm_complete$table)
rf_orig_cm$Reference <- factor(rf_orig_cm$Reference,
levels = rev(levels(rf_orig_cm$Reference)))
rf_orig_cm <- rf_orig_cm |>
mutate(
Label = case_when(
Prediction == "Low" & Reference == "Low" ~ "TN",
Prediction == "High" & Reference == "High" ~ "TP",
Prediction == "Low" & Reference == "High" ~ "FN",
Prediction == "High" & Reference == "Low" ~ "FP"),
Model = "RF (Orig. Feat.)")
#Random Forest 2pc: predictions/confusion matrix
pred_rf_2pc <- predict(rf_2pc, test_df_pca)
rf_2pc_cm_complete <- confusionMatrix(pred_rf_2pc, test_df_pca$quality,
positive = "High")
rf_2pc_cm <- as.data.frame(rf_2pc_cm_complete$table)
rf_2pc_cm$Reference <- factor(rf_2pc_cm$Reference,
levels = rev(levels(rf_2pc_cm$Reference)))
rf_2pc_cm <- rf_2pc_cm |>
mutate(
Label = case_when(
Prediction == "Low" & Reference == "Low" ~ "TN",
Prediction == "High" & Reference == "High" ~ "TP",
Prediction == "Low" & Reference == "High" ~ "FN",
Prediction == "High" & Reference == "Low" ~ "FP"),
Model = "RF (2 PC)")
#Random Forest 4pc: predictions/confusion matrix
pred_rf_4pc <- predict(rf_4pc, test_df_pca)
rf_4pc_cm_complete <- confusionMatrix(pred_rf_4pc, test_df_pca$quality,
positive = "High")
rf_4pc_cm <- as.data.frame(rf_4pc_cm_complete$table)
rf_4pc_cm$Reference <- factor(rf_4pc_cm$Reference,
levels = rev(levels(rf_4pc_cm$Reference)))
rf_4pc_cm <- rf_4pc_cm |>
mutate(
Label = case_when(
Prediction == "Low" & Reference == "Low" ~ "TN",
Prediction == "High" & Reference == "High" ~ "TP",
Prediction == "Low" & Reference == "High" ~ "FN",
Prediction == "High" & Reference == "Low" ~ "FP"),
Model = "RF (4 PC)")
#Random Forest 6pc: predictions/confusion matrix
pred_rf_6pc <- predict(rf_6pc, test_df_pca)
rf_6pc_cm_complete <- confusionMatrix(pred_rf_6pc, test_df_pca$quality,
positive = "High")
rf_6pc_cm <- as.data.frame(rf_6pc_cm_complete$table)
rf_6pc_cm$Reference <- factor(rf_6pc_cm$Reference,
levels = rev(levels(rf_6pc_cm$Reference)))
rf_6pc_cm <- rf_6pc_cm |>
mutate(
Label = case_when(
Prediction == "Low" & Reference == "Low" ~ "TN",
Prediction == "High" & Reference == "High" ~ "TP",
Prediction == "Low" & Reference == "High" ~ "FN",
Prediction == "High" & Reference == "Low" ~ "FP"),
Model = "RF (6 PC)")
#Random Forest 8pc: predictions/confusion matrix
pred_rf_8pc <- predict(rf_8pc, test_df_pca)
rf_8pc_cm_complete <- confusionMatrix(pred_rf_8pc, test_df_pca$quality,
positive = "High")
rf_8pc_cm <- as.data.frame(rf_8pc_cm_complete$table)
rf_8pc_cm$Reference <- factor(rf_8pc_cm$Reference,
levels = rev(levels(rf_8pc_cm$Reference)))
rf_8pc_cm <- rf_8pc_cm |>
mutate(
Label = case_when(
Prediction == "Low" & Reference == "Low" ~ "TN",
Prediction == "High" & Reference == "High" ~ "TP",
Prediction == "Low" & Reference == "High" ~ "FN",
Prediction == "High" & Reference == "Low" ~ "FP"),
Model = "RF (8C)")
#Support Vector Machine Orig: predictions/confusion matrix
pred_svm_orig <- predict(svm_orig, test_df, type = "class")
svm_orig_cm_complete <- confusionMatrix(pred_svm_orig, test_df$quality,
positive = "High")
svm_orig_cm <- as.data.frame(svm_orig_cm_complete$table)
svm_orig_cm$Reference <- factor(svm_orig_cm$Reference,
levels = rev(levels(svm_orig_cm$Reference)))
svm_orig_cm <- svm_orig_cm |>
mutate(
Label = case_when(
Prediction == "Low" & Reference == "Low" ~ "TN",
Prediction == "High" & Reference == "High" ~ "TP",
Prediction == "Low" & Reference == "High" ~ "FN",
Prediction == "High" & Reference == "Low" ~ "FP"),
Model = "SVM (Orig. Feat.)")
#Support Vector Machine 2pc: predictions/confusion matrix
pred_svm_2pc <- predict(svm_2pc, test_df_pca, type = "class")
svm_2pc_cm_complete <- confusionMatrix(pred_svm_2pc, test_df_pca$quality,
positive = "High")
svm_2pc_cm <- as.data.frame(svm_2pc_cm_complete$table)
svm_2pc_cm$Reference <- factor(svm_2pc_cm$Reference,
levels = rev(levels(svm_2pc_cm$Reference)))
svm_2pc_cm <- svm_2pc_cm |>
mutate(
Label = case_when(
Prediction == "Low" & Reference == "Low" ~ "TN",
Prediction == "High" & Reference == "High" ~ "TP",
Prediction == "Low" & Reference == "High" ~ "FN",
Prediction == "High" & Reference == "Low" ~ "FP"),
Model = "SVM (2 PC)")
#Support Vector Machine 4pc: predictions/confusion matrix
pred_svm_4pc <- predict(svm_4pc, test_df_pca, type = "class")
svm_4pc_cm_complete <- confusionMatrix(pred_svm_4pc, test_df_pca$quality,
positive = "High")
svm_4pc_cm <- as.data.frame(svm_4pc_cm_complete$table)
svm_4pc_cm$Reference <- factor(svm_4pc_cm$Reference,
levels = rev(levels(svm_4pc_cm$Reference)))
svm_4pc_cm <- svm_4pc_cm |>
mutate(
Label = case_when(
Prediction == "Low" & Reference == "Low" ~ "TN",
Prediction == "High" & Reference == "High" ~ "TP",
Prediction == "Low" & Reference == "High" ~ "FN",
Prediction == "High" & Reference == "Low" ~ "FP"),
Model = "SVM (4 PC)")
#Support Vector Machine 6pc: predictions/confusion matrix
pred_svm_6pc <- predict(svm_6pc, test_df_pca, type = "class")
svm_6pc_cm_complete <- confusionMatrix(pred_svm_6pc, test_df_pca$quality,
positive = "High")
svm_6pc_cm <- as.data.frame(svm_6pc_cm_complete$table)
svm_6pc_cm$Reference <- factor(svm_6pc_cm$Reference,
levels = rev(levels(svm_6pc_cm$Reference)))
svm_6pc_cm <- svm_6pc_cm |>
mutate(
Label = case_when(
Prediction == "Low" & Reference == "Low" ~ "TN",
Prediction == "High" & Reference == "High" ~ "TP",
Prediction == "Low" & Reference == "High" ~ "FN",
Prediction == "High" & Reference == "Low" ~ "FP"),
Model = "SVM (6 PC)")
#Support Vector Machine 8pc: predictions/confusion matrix
pred_svm_8pc <- predict(svm_8pc, test_df_pca, type = "class")
svm_8pc_cm_complete <- confusionMatrix(pred_svm_8pc, test_df_pca$quality,
positive = "High")
svm_8pc_cm <- as.data.frame(svm_8pc_cm_complete$table)
svm_8pc_cm$Reference <- factor(svm_8pc_cm$Reference,
levels = rev(levels(svm_8pc_cm$Reference)))
svm_8pc_cm <- svm_8pc_cm |>
mutate(
Label = case_when(
Prediction == "Low" & Reference == "Low" ~ "TN",
Prediction == "High" & Reference == "High" ~ "TP",
Prediction == "Low" & Reference == "High" ~ "FN",
Prediction == "High" & Reference == "Low" ~ "FP"),
Model = "SVM (8 PC)")
cm <- bind_rows(rf_orig_cm, rf_2pc_cm, rf_4pc_cm, rf_6pc_cm, rf_8pc_cm,
svm_orig_cm, svm_2pc_cm, svm_4pc_cm, svm_6pc_cm, svm_8pc_cm)
p6 <- cm |>
ggplot(aes(x = Reference, y = Prediction, fill = Freq)) +
geom_tile(col = "black") +
geom_text(aes(label = Freq)) +
geom_text(aes(label = Label), vjust = 2) +
scale_fill_gradient(low = "white", high = pal[1]) +
scale_x_discrete(position = "top") +
facet_wrap(Model ~ ., ncol = 5, strip.position = "bottom") +
labs(title = "Confusion Matrices for All Models") +
theme(axis.line.x = element_blank(),
axis.line.y = element_blank(),
axis.text.y = element_text(angle = 90, hjust = 0.5),
axis.ticks = element_blank(),
legend.position = "bottom",
strip.placement = "outside")
p6Then we calculate the performance metrics for each of the models.
metrics <- as.data.frame(cbind(rbind(rf_orig_cm_complete$byClass,
rf_2pc_cm_complete$byClass,
rf_4pc_cm_complete$byClass,
rf_6pc_cm_complete$byClass,
rf_8pc_cm_complete$byClass,
svm_orig_cm_complete$byClass,
svm_2pc_cm_complete$byClass,
svm_4pc_cm_complete$byClass,
svm_6pc_cm_complete$byClass,
svm_8pc_cm_complete$byClass),
rbind(rf_orig_cm_complete$overall,
rf_2pc_cm_complete$overall,
rf_4pc_cm_complete$overall,
rf_6pc_cm_complete$overall,
rf_8pc_cm_complete$overall,
svm_orig_cm_complete$overall,
svm_2pc_cm_complete$overall,
svm_4pc_cm_complete$overall,
svm_6pc_cm_complete$overall,
svm_8pc_cm_complete$overall)))
rownames(metrics) <- c("Random Forest (Original Features)",
"Random Forest (2 Principal Components)",
"Random Forest (4 Principal Components)",
"Random Forest (6 Principal Components)",
"Random Forest (8 Principal Components)",
"Support Vector Machine (Original Features)",
"Support Vector Machine (2 Principal Components)",
"Support Vector Machine (4 Principal Components)",
"Support Vector Machine (6 Principal Components)",
"Support Vector Machine (8 Principal Components)")
keep <- c("Accuracy", "Precision", "Recall", "F1")
metrics <- metrics |>
select(all_of(keep)) |>
round(3)
kable(metrics, format = "simple")| Accuracy | Precision | Recall | F1 | |
|---|---|---|---|---|
| Random Forest (Original Features) | 0.810 | 0.814 | 0.848 | 0.831 |
| Random Forest (2 Principal Components) | 0.725 | 0.736 | 0.776 | 0.756 |
| Random Forest (4 Principal Components) | 0.794 | 0.808 | 0.817 | 0.813 |
| Random Forest (6 Principal Components) | 0.810 | 0.805 | 0.863 | 0.833 |
| Random Forest (8 Principal Components) | 0.812 | 0.824 | 0.837 | 0.830 |
| Support Vector Machine (Original Features) | 0.783 | 0.787 | 0.829 | 0.807 |
| Support Vector Machine (2 Principal Components) | 0.646 | 0.688 | 0.646 | 0.667 |
| Support Vector Machine (4 Principal Components) | 0.715 | 0.744 | 0.730 | 0.737 |
| Support Vector Machine (6 Principal Components) | 0.773 | 0.747 | 0.886 | 0.810 |
| Support Vector Machine (8 Principal Components) | 0.775 | 0.765 | 0.852 | 0.806 |
Conclusion
Both the Random Forest and Support Vector Machine models perform best on this dataset using the original features rather than any number of principal components. Using only four principal components does get the Random Forest model close to the Accuracy and F1 Score of the Random Forest model using the original features though. The Support Vector Machine model, in contrast, requires twice as many, i.e. eight, principal components to get close to the Accuracy and F1 Score of the Support Vector Machine model using the original features.
Since we only had eleven numeric predictors in the dataset to begin with, reducing the feature space wasn’t really imperative, but we now have a framework for performing PCA on datasets with tons of features, where reducing the feature space is more necessary.
There are often a lot of benefits to reducing the feature space, especially when businesses track a lot of features for the products they sell. Jeans in particular come to mind as a product with lots of features (i.e. color, size, fit, style, stretch, whether it has a button-fly or a zipper, whether it comes distressed, etc.). If you can represent a product line with lots of features like that with just four or fewer principal components instead, you will be eliminating a lot of noise, your models will be faster, and your predictions can be as good as or better than predictions from models that use more features. It’s also amazing to be able to see data with many dimensions reduced to two dimensions, even in a classification problem like ours where we knew perfect wine quality class separation wouldn’t be visible in just two dimensions.