3.1 and 3.2 in Kuhn & Johnson - Applied Predictive Modeling
library(caret)
library(corrplot)
library(dplyr)
library(e1071)
library(fable)
library(fpp3)
library(ggplot2)
library(GGally)
library(mlbench)
library(MASS)
library(patchwork)
library(tsibble)
library(tidyr)
The UC Irvine Machine Learning Repository6 contains a data set related to glass identification. The data consists of 214 glass samples labeled as one of seven class categories. There are nine predictors, including the refractive index and percentages of eight elements: Na, Mg, Al, Si, K, Ca, Ba, and Fe. The data can be accessed via:
data(Glass)
str(Glass)
## 'data.frame': 214 obs. of 10 variables:
## $ RI : num 1.52 1.52 1.52 1.52 1.52 ...
## $ Na : num 13.6 13.9 13.5 13.2 13.3 ...
## $ Mg : num 4.49 3.6 3.55 3.69 3.62 3.61 3.6 3.61 3.58 3.6 ...
## $ Al : num 1.1 1.36 1.54 1.29 1.24 1.62 1.14 1.05 1.37 1.36 ...
## $ Si : num 71.8 72.7 73 72.6 73.1 ...
## $ K : num 0.06 0.48 0.39 0.57 0.55 0.64 0.58 0.57 0.56 0.57 ...
## $ Ca : num 8.75 7.83 7.78 8.22 8.07 8.07 8.17 8.24 8.3 8.4 ...
## $ Ba : num 0 0 0 0 0 0 0 0 0 0 ...
## $ Fe : num 0 0 0 0 0 0.26 0 0 0 0.11 ...
## $ Type: Factor w/ 6 levels "1","2","3","5",..: 1 1 1 1 1 1 1 1 1 1 ...
(a) Using visualizations, explore the predictor variables to understand their distributions as well as the relationships between predictors.
pred_columns <- colnames(Glass)
pred_columns <- pred_columns[-10]
glass_predictors <- Glass[,-10] #Exclude the Type
# Distribution
par(mfrow = c(2,5))
for (i in col(Glass[1,])) {
title <- paste(colnames(Glass)[i], "Distribution", sep = " ")
hist(as.numeric(Glass[,i]), xlab= colnames(Glass)[i], main =title)}
# Correlation
glass_predictors %>%
cor() %>%
corrplot::corrplot(method = "number", diag = FALSE, type = 'lower')
The distributions for these predictor variables are all over the place, with very few resembling a normal distribution. I can also see skew and outliers right off the bat for almost every element.
(b) Do there appear to be any outliers in the data? Are any predictors skewed?
par(mfrow = c(2,5))
# Box plot to show outliers
for (i in 1:ncol(glass_predictors)) {
title <- paste(colnames(glass_predictors)[i], "Boxplot", sep = " ")
boxplot(glass_predictors[,i], main = title, ylab = colnames(glass_predictors)[i])}
# Checking skewness
skew_values <- apply(glass_predictors, 2, skewness)
skew_summary <- data.frame(Predictor = colnames(glass_predictors), Skewness = skew_values)
print(skew_summary)
## Predictor Skewness
## RI RI 1.6027151
## Na Na 0.4478343
## Mg Mg -1.1364523
## Al Al 0.8946104
## Si Si -0.7202392
## K K 6.4600889
## Ca Ca 2.0184463
## Ba Ba 3.3686800
## Fe Fe 1.7298107
The advantage of boxplots is that they represent outliers (observations more than 2 standard deviations from the mean) as dots. We can see that there are outliers for every element apart from Magnesium (Mg). We also see that most variables are skewed, with the highest being Potassium (K), while the histogram distributions for Na, Al, and Si look the most normal, and therefore have the least skewness.
(c) Are there any relevant transformations of one or more predictors that might improve the classification model?
par(mfrow = c(2,5))
# Box Cox, Center, Scale (Ch. 3.2)
glass_normalized <- preProcess(glass_predictors, method = c("BoxCox", "center", "scale"))
glass_normalized_predict <- predict(glass_normalized, glass_predictors)
# Plot
glass_normalized_predict %>%
gather() %>%
ggplot(aes(value)) +
geom_histogram()+
ggtitle("Predictor Variables Distribution - BoxCox Transform") +
facet_wrap(~ key, scales = "free")
# Box-plot after transform
par(mfrow = c(2,5))
for (i in 1:ncol(glass_normalized_predict)) {
title <- paste(colnames(glass_normalized_predict)[i], "Box-Cox Transformed Boxplot", sep = " ")
boxplot(glass_normalized_predict[,i], main = title, ylab = colnames(glass_normalized_predict)[i])}
# PCA
glass_normalized_pca <- preProcess(glass_predictors, method = c("BoxCox","center", "scale", "pca"))
glass_normalized_pca_predict <- predict(glass_normalized_pca, glass_predictors)
# Plot
glass_normalized_pca_predict %>%
gather() %>%
ggplot(aes(value)) +
geom_histogram()+
ggtitle("PCA of BoxCox Transformed Predictor Variables") +
facet_wrap(~ key, scales = "free")
In our textbook, in Section 3.2 on Centering and Scaling, it discusses how the average predictor value is subtracted from all values, giving it a mean of 0. The data is scaled by value of the predictor variable being divided by its standard deviation, which coerces the values to have a common standard deviation of 1. The Box-Cox transform helps normalize the data by finding the optimal lambda for the family of transformations (Pg. 32), and we can see that the distributions for each of these 9 predictor variables has somewhat improved, apart from Ba, Fe, K, and Mg. Chapter 3.3 discusses the value of Principal Component Analysis, which seeks to find linear combinations of the predictors to capture the most variance. These transformations help normalize the data and improve the classification model.
The soybean data can also be found at the UC Irvine Machine Learning Repository. Data were collected to predict disease in 683 soybeans. The 35 predictors are mostly categorical and include information on the environmental conditions (e.g., temperature, precipitation) and plant conditions (e.g., left spots, mold growth). The outcome labels consist of 19 distinct classes.
data(Soybean)
?Soybean
(a) Investigate the frequency distributions for the categorical predictors. Are any of the distributions degenerate in the ways discussed earlier in this chapter?
# Frequency table for each predictor variable
#str(Soybean)
# Names of categorical predictors
categorical_columns <- sapply(Soybean, is.factor)
# Exclude 'Class' from list of predictors
categorical_columns <- categorical_columns[names(Soybean) != "Class"]
# Create empty data frame to store counts
count_table <- data.frame()
# Loop through categorical predictors
for (col in names(Soybean)[categorical_columns]) {
# Get the frequency distribution for the current column
freq <- table(Soybean[[col]])
# Convert to data frame for neat arrangement
freq_df <- as.data.frame(freq)
# Assign column names
colnames(freq_df) <- c("Level", "Count")
# Add the variable name as a new column (row name)
freq_df$Variable <- col
# Bind this table to the main table
count_table <- rbind(count_table, freq_df)
}
# Reshape the data to a wide format
count_table_wide <- count_table %>%
pivot_wider(names_from = Level, values_from = Count, values_fill = list(Count = 0))
# I'll remove 'Class' from the count table, and its associated columns
count_table_wide <- count_table_wide %>%
filter(Variable != "Class")
# Define the allowed levels
allowed_levels <- as.character(0:6)
# Get the names of the columns (excluding the 'Variable' column)
columns_to_keep <- colnames(count_table_wide)
# Keep only columns that are in allowed_levels, and 'Variable'
columns_to_keep <- c("Variable", columns_to_keep[columns_to_keep %in% allowed_levels])
# Select the relevant columns from count_table_wide
count_table_wide <- count_table_wide[, columns_to_keep]
print(count_table_wide)
## # A tibble: 35 × 8
## Variable `0` `1` `2` `3` `4` `5` `6`
## <chr> <int> <int> <int> <int> <int> <int> <int>
## 1 date 26 75 93 118 131 149 90
## 2 plant.stand 354 293 0 0 0 0 0
## 3 precip 74 112 459 0 0 0 0
## 4 temp 80 374 199 0 0 0 0
## 5 hail 435 127 0 0 0 0 0
## 6 crop.hist 65 165 219 218 0 0 0
## 7 area.dam 123 227 145 187 0 0 0
## 8 sever 195 322 45 0 0 0 0
## 9 seed.tmt 305 222 35 0 0 0 0
## 10 germ 165 213 193 0 0 0 0
## # ℹ 25 more rows
# Plot all variables to see degeneracy frequency
plot_list <- list()
for (var in names(Soybean)) {
if (is.factor(Soybean[[var]])) {
p <- ggplot(Soybean, aes_string(x = var)) +
geom_bar() +
theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
labs(x = var, y = 'Count') +
ggtitle(var)
plot_list[[var]] <- p
}
}
plot_grid <- wrap_plots(plot_list, ncol = 4)
plot_grid
Our textbook mentions that “some models can be crippled by predictors with degenerate distributions. In these cases, there can be a significant improvement in model performance and/or stability without the problematic variables.” (Pg. 44), and the meaning degenerate implies that the variable doesn’t provide meaningful information for analysis due to a lack of variation, i.e.- a constant value, or one that vary rarely varies. We can see that most of the variables are degenerate, with only a few that aren’t, such as the date, crop.hist, and area.dam
(c) Develop a strategy for handling missing data, either by eliminating predictors or imputation.
# Imputing missing data, by getting the mean value for each field by class
Soybean_imputed <- Soybean %>%
mutate(across(-Class, ~as.numeric(as.character(.)))) %>%
group_by(Class) %>%
mutate(across(everything(), ~if_else(is.na(.), mean(., na.rm = TRUE), .))) %>%
ungroup() %>%
mutate(across(-Class, ~factor(., ordered = TRUE)))
print("Original Soybean Data")
## [1] "Original Soybean Data"
print(sapply(Soybean, function(x) sum(is.na(x))))
## Class date plant.stand precip temp
## 0 1 36 38 30
## hail crop.hist area.dam sever seed.tmt
## 121 16 1 121 121
## germ plant.growth leaves leaf.halo leaf.marg
## 112 16 0 84 84
## leaf.size leaf.shread leaf.malf leaf.mild stem
## 84 100 84 108 16
## lodging stem.cankers canker.lesion fruiting.bodies ext.decay
## 121 38 38 106 38
## mycelium int.discolor sclerotia fruit.pods fruit.spots
## 38 38 38 84 106
## seed mold.growth seed.discolor seed.size shriveling
## 92 92 106 92 106
## roots
## 31
print("Imputed Soybean Data ")
## [1] "Imputed Soybean Data "
print(sapply(Soybean_imputed, function(x) sum(is.na(x))))
## Class date plant.stand precip temp
## 0 0 0 0 0
## hail crop.hist area.dam sever seed.tmt
## 0 0 0 0 0
## germ plant.growth leaves leaf.halo leaf.marg
## 0 0 0 0 0
## leaf.size leaf.shread leaf.malf leaf.mild stem
## 0 0 0 0 0
## lodging stem.cankers canker.lesion fruiting.bodies ext.decay
## 0 0 0 0 0
## mycelium int.discolor sclerotia fruit.pods fruit.spots
## 0 0 0 0 0
## seed mold.growth seed.discolor seed.size shriveling
## 0 0 0 0 0
## roots
## 0
The textbook offers insight onto missing data, and it seems that some of the classes might be structurally missing, or informatively missing where we don’t have those kinds of observations for a given soybean. Given the dataset is small, it makes no sense to remove the missing values. Alternatively, we can check to see if any predictors have collinearity or have near-zero variance, and remove the predictor entirely. If class-specific imputation didn’t suffice, then K-nearest neighbors or the MICE (Multiple Imputation by Chained Equations) method could instead.