The predictors with the highest correlation are the Ri-K pair at 0.810 and Si-Ri at -0.542. Overall the other correlations seem to be weaker.
data(Glass)
pair_plot <- ggpairs(Glass[, -10])
pair_plot
It looks like most of the predictors show slight skewness, but some like Mg, K, Ca, Ba, and Fe are more heavily right-skewed.
Glass %>%
select(-Type) %>%
pivot_longer(cols = everything(), names_to = "variable", values_to = "value") %>%
ggplot(aes(x = value)) +
geom_histogram(bins = 30, fill = "blue", alpha = 0.7) +
facet_wrap(~variable, scales = "free") +
theme_minimal()
It looks like a lot of the predictors have outliers- patricularly the elements Al, Ba, Ca, K, Na, and Si.
Glass %>%
pivot_longer(cols = -Type, names_to = "variable", values_to = "value") %>%
ggplot(aes(x = variable, y = value)) +
geom_boxplot(outlier.colour = "red", outlier.shape = 16, outlier.size = 2) +
theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
labs(title = "Boxplots to Identify Outliers in Glass Dataset")
A quick way to mitigate skewness would be to use log transformation. Log transformations will bring the higher values closer to zero while not affecting the lower values too much.
Glass %>%
mutate(across(c(Na, Mg, K, Ca), log1p)) %>%
select(Na, Mg, K, Ca) %>%
pivot_longer(cols = everything(), names_to = "variable", values_to = "value") %>%
ggplot(aes(x = value)) +
geom_histogram(bins = 30, fill = "green", alpha = 0.7) +
facet_wrap(~variable, scales = "free") +
theme_minimal()
data(Soybean)
Below are the frequency distributions. (Apologies for the cramped display, the commented out code displays all the plots more clearly but makes my presentation much longer! I also encountered some issues with some of the columns being ordered factors, I had to exclude them from the plot you see below, however I was able to plot them using the commented out loop).
The graphs show that all the variables have more than one unique value, and so we don’t see any zero variance degeneration here.
Soybean_cleaned <- Soybean %>%
mutate(across(everything(), as.factor))
freq_plot <- Soybean_cleaned %>%
select(-leaf.size, -germ, -plant.stand, -temp, -precip) %>%
pivot_longer(cols = everything(), names_to = "variable", values_to = "value") %>%
filter(!is.na(value)) %>% # Remove missing data for this step
group_by(variable, value) %>%
summarise(count = n()) %>%
ggplot(aes(x = value, y = count)) +
geom_bar(stat = "identity") +
facet_wrap(~ variable, scales = "free") +
theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
labs(title = "Frequency Distributions of Categorical Variables (Excluding Class and Date)",
x = "Value",
y = "Count")
# Display the plot
print(freq_plot)
# Soybean_cleaned <- Soybean %>%
# mutate(across(where(is.ordered), as.factor))
#
# categorical_vars <- names(Soybean_cleaned)
#
# for (var in categorical_vars) {
# if (is.factor(Soybean_cleaned[[var]])) {
# p <- ggplot(Soybean_cleaned, aes_string(x = var)) +
# geom_bar() +
# ggtitle(paste("Frequency of", var)) +
# theme(axis.text.x = element_text(angle = 90, hjust = 1))
# print(p)
# }
# }
Since there are no zero variance predictors, we should instead look at near-zero predictors, ie predictors whose most frequent value makes up more than 90% of the distribution and whose number of unique values is less than 5.
# identify near-zero variance predictors
find_near_zero_variance <- function(data, freq_cutoff = 90, unique_cutoff = 5) {
stats <- data.frame(
variable = names(data),
n_unique = sapply(data, function(x) length(unique(x))), # Number of unique values
max_freq = sapply(data, function(x) max(table(x)) / length(x) * 100) # Max frequency percentage
)
near_zero_variance <- stats %>%
filter(n_unique <= unique_cutoff & max_freq >= freq_cutoff)
return(near_zero_variance)
}
near_zero_variance <- find_near_zero_variance(Soybean)
print(near_zero_variance)
## variable n_unique max_freq
## mycelium mycelium 3 93.55783
## sclerotia sclerotia 3 91.50805
We can see that mycelium and sclerotia are two such predictors, we can probably remove them. We can change the thresholds above to be more or less strict with regards to which predictors we remove.
Ive decided to remove the predictors with more than 80& missing data. I then imputed the missing data with the most frequent value using mode since these are categorical variables.
Soybean_cleaned <- Soybean %>%
select(where(~ mean(is.na(.)) <= 0.20))
impute_mode <- function(x) {
if (is.factor(x)) {
return(factor(replace(x, is.na(x), names(sort(table(x), decreasing = TRUE))[1])))
} else {
return(x)
}
}
Soybean_imputed <- Soybean %>%
mutate(across(everything(), impute_mode))
head(Soybean_imputed, n=10)
## Class date plant.stand precip temp hail crop.hist area.dam
## 1 diaporthe-stem-canker 6 0 2 1 0 1 1
## 2 diaporthe-stem-canker 4 0 2 1 0 2 0
## 3 diaporthe-stem-canker 3 0 2 1 0 1 0
## 4 diaporthe-stem-canker 3 0 2 1 0 1 0
## 5 diaporthe-stem-canker 6 0 2 1 0 2 0
## 6 diaporthe-stem-canker 5 0 2 1 0 3 0
## 7 diaporthe-stem-canker 5 0 2 1 0 2 0
## 8 diaporthe-stem-canker 4 0 2 1 1 1 0
## 9 diaporthe-stem-canker 6 0 2 1 0 3 0
## 10 diaporthe-stem-canker 4 0 2 1 0 2 0
## sever seed.tmt germ plant.growth leaves leaf.halo leaf.marg leaf.size
## 1 1 0 0 1 1 0 2 2
## 2 2 1 1 1 1 0 2 2
## 3 2 1 2 1 1 0 2 2
## 4 2 0 1 1 1 0 2 2
## 5 1 0 2 1 1 0 2 2
## 6 1 0 1 1 1 0 2 2
## 7 1 1 0 1 1 0 2 2
## 8 1 0 2 1 1 0 2 2
## 9 1 1 1 1 1 0 2 2
## 10 2 0 2 1 1 0 2 2
## leaf.shread leaf.malf leaf.mild stem lodging stem.cankers canker.lesion
## 1 0 0 0 1 1 3 1
## 2 0 0 0 1 0 3 1
## 3 0 0 0 1 0 3 0
## 4 0 0 0 1 0 3 0
## 5 0 0 0 1 0 3 1
## 6 0 0 0 1 0 3 0
## 7 0 0 0 1 1 3 1
## 8 0 0 0 1 0 3 1
## 9 0 0 0 1 0 3 1
## 10 0 0 0 1 0 3 1
## fruiting.bodies ext.decay mycelium int.discolor sclerotia fruit.pods
## 1 1 1 0 0 0 0
## 2 1 1 0 0 0 0
## 3 1 1 0 0 0 0
## 4 1 1 0 0 0 0
## 5 1 1 0 0 0 0
## 6 1 1 0 0 0 0
## 7 1 1 0 0 0 0
## 8 1 1 0 0 0 0
## 9 1 1 0 0 0 0
## 10 1 1 0 0 0 0
## fruit.spots seed mold.growth seed.discolor seed.size shriveling roots
## 1 4 0 0 0 0 0 0
## 2 4 0 0 0 0 0 0
## 3 4 0 0 0 0 0 0
## 4 4 0 0 0 0 0 0
## 5 4 0 0 0 0 0 0
## 6 4 0 0 0 0 0 0
## 7 4 0 0 0 0 0 0
## 8 4 0 0 0 0 0 0
## 9 4 0 0 0 0 0 0
## 10 4 0 0 0 0 0 0