3.1

(a) Using visualizations, explore the predictor variables to understand their distributions as well as the relationships between predictors.

The predictors with the highest correlation are the Ri-K pair at 0.810 and Si-Ri at -0.542. Overall the other correlations seem to be weaker.

data(Glass)
pair_plot <- ggpairs(Glass[, -10]) 
pair_plot

(b) Do there appear to be any outliers in the data? Are any predictors skewed?

It looks like most of the predictors show slight skewness, but some like Mg, K, Ca, Ba, and Fe are more heavily right-skewed.

Glass %>%
  select(-Type) %>%
  pivot_longer(cols = everything(), names_to = "variable", values_to = "value") %>%
  ggplot(aes(x = value)) +
  geom_histogram(bins = 30, fill = "blue", alpha = 0.7) +
  facet_wrap(~variable, scales = "free") +
  theme_minimal()

It looks like a lot of the predictors have outliers- patricularly the elements Al, Ba, Ca, K, Na, and Si.

Glass %>%
  pivot_longer(cols = -Type, names_to = "variable", values_to = "value") %>%
  ggplot(aes(x = variable, y = value)) +
  geom_boxplot(outlier.colour = "red", outlier.shape = 16, outlier.size = 2) +
  theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
  labs(title = "Boxplots to Identify Outliers in Glass Dataset")

(c) Are there any relevant transformations of one or more predictors that might improve the classification model?

A quick way to mitigate skewness would be to use log transformation. Log transformations will bring the higher values closer to zero while not affecting the lower values too much.

Glass %>%
  mutate(across(c(Na, Mg, K, Ca), log1p)) %>%
  select(Na, Mg, K, Ca) %>%
  pivot_longer(cols = everything(), names_to = "variable", values_to = "value") %>%
  ggplot(aes(x = value)) +
  geom_histogram(bins = 30, fill = "green", alpha = 0.7) +
  facet_wrap(~variable, scales = "free") +
  theme_minimal()

The soybean data can also be found at the UC Irvine Machine Learning Repository. Data were collected to predict disease in 683 soybeans. The 35 predictors are mostly categorical and include information on the environmental conditions (e.g., temperature, precipitation) and plant conditions (e.g., left spots, mold growth). The outcome labels consist of 19 distinct classes.

Investigate the frequency distributions for the categorical predictors. Are any of the distributions degenerate in the ways discussed earlier in this chapter?

data(Soybean)

Below are the frequency distributions. (Apologies for the cramped display, the commented out code displays all the plots more clearly but makes my presentation much longer! I also encountered some issues with some of the columns being ordered factors, I had to exclude them from the plot you see below, however I was able to plot them using the commented out loop).

The graphs show that all the variables have more than one unique value, and so we don’t see any zero variance degeneration here.

Soybean_cleaned <- Soybean %>%
  mutate(across(everything(), as.factor))

freq_plot <- Soybean_cleaned %>%
  select(-leaf.size, -germ, -plant.stand, -temp, -precip) %>%  
  pivot_longer(cols = everything(), names_to = "variable", values_to = "value") %>%
  filter(!is.na(value)) %>%  # Remove missing data for this step
  group_by(variable, value) %>%
  summarise(count = n()) %>%
  ggplot(aes(x = value, y = count)) +
  geom_bar(stat = "identity") +
  facet_wrap(~ variable, scales = "free") +
  theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
  labs(title = "Frequency Distributions of Categorical Variables (Excluding Class and Date)", 
       x = "Value", 
       y = "Count")

# Display the plot
print(freq_plot)

# Soybean_cleaned <- Soybean %>%
#   mutate(across(where(is.ordered), as.factor))
# 
# categorical_vars <- names(Soybean_cleaned)
# 
# for (var in categorical_vars) {
#   if (is.factor(Soybean_cleaned[[var]])) {
#     p <- ggplot(Soybean_cleaned, aes_string(x = var)) +
#       geom_bar() +
#       ggtitle(paste("Frequency of", var)) +
#       theme(axis.text.x = element_text(angle = 90, hjust = 1))
#     print(p)
#   }
# }

Since there are no zero variance predictors, we should instead look at near-zero predictors, ie predictors whose most frequent value makes up more than 90% of the distribution and whose number of unique values is less than 5.

# identify near-zero variance predictors
find_near_zero_variance <- function(data, freq_cutoff = 90, unique_cutoff = 5) {
  stats <- data.frame(
    variable = names(data),
    n_unique = sapply(data, function(x) length(unique(x))),  # Number of unique values
    max_freq = sapply(data, function(x) max(table(x)) / length(x) * 100)  # Max frequency percentage
  )
  
  near_zero_variance <- stats %>%
    filter(n_unique <= unique_cutoff & max_freq >= freq_cutoff)
  
  return(near_zero_variance)
}

near_zero_variance <- find_near_zero_variance(Soybean)
print(near_zero_variance)
##            variable n_unique max_freq
## mycelium   mycelium        3 93.55783
## sclerotia sclerotia        3 91.50805

We can see that mycelium and sclerotia are two such predictors, we can probably remove them. We can change the thresholds above to be more or less strict with regards to which predictors we remove.

(c) Develop a strategy for handling missing data, either by eliminating predictors or imputation.

Ive decided to remove the predictors with more than 80& missing data. I then imputed the missing data with the most frequent value using mode since these are categorical variables.

Soybean_cleaned <- Soybean %>%
  select(where(~ mean(is.na(.)) <= 0.20))

impute_mode <- function(x) {
  if (is.factor(x)) {
    return(factor(replace(x, is.na(x), names(sort(table(x), decreasing = TRUE))[1])))
  } else {
    return(x)
  }
}

Soybean_imputed <- Soybean %>%
  mutate(across(everything(), impute_mode))

head(Soybean_imputed, n=10)
##                    Class date plant.stand precip temp hail crop.hist area.dam
## 1  diaporthe-stem-canker    6           0      2    1    0         1        1
## 2  diaporthe-stem-canker    4           0      2    1    0         2        0
## 3  diaporthe-stem-canker    3           0      2    1    0         1        0
## 4  diaporthe-stem-canker    3           0      2    1    0         1        0
## 5  diaporthe-stem-canker    6           0      2    1    0         2        0
## 6  diaporthe-stem-canker    5           0      2    1    0         3        0
## 7  diaporthe-stem-canker    5           0      2    1    0         2        0
## 8  diaporthe-stem-canker    4           0      2    1    1         1        0
## 9  diaporthe-stem-canker    6           0      2    1    0         3        0
## 10 diaporthe-stem-canker    4           0      2    1    0         2        0
##    sever seed.tmt germ plant.growth leaves leaf.halo leaf.marg leaf.size
## 1      1        0    0            1      1         0         2         2
## 2      2        1    1            1      1         0         2         2
## 3      2        1    2            1      1         0         2         2
## 4      2        0    1            1      1         0         2         2
## 5      1        0    2            1      1         0         2         2
## 6      1        0    1            1      1         0         2         2
## 7      1        1    0            1      1         0         2         2
## 8      1        0    2            1      1         0         2         2
## 9      1        1    1            1      1         0         2         2
## 10     2        0    2            1      1         0         2         2
##    leaf.shread leaf.malf leaf.mild stem lodging stem.cankers canker.lesion
## 1            0         0         0    1       1            3             1
## 2            0         0         0    1       0            3             1
## 3            0         0         0    1       0            3             0
## 4            0         0         0    1       0            3             0
## 5            0         0         0    1       0            3             1
## 6            0         0         0    1       0            3             0
## 7            0         0         0    1       1            3             1
## 8            0         0         0    1       0            3             1
## 9            0         0         0    1       0            3             1
## 10           0         0         0    1       0            3             1
##    fruiting.bodies ext.decay mycelium int.discolor sclerotia fruit.pods
## 1                1         1        0            0         0          0
## 2                1         1        0            0         0          0
## 3                1         1        0            0         0          0
## 4                1         1        0            0         0          0
## 5                1         1        0            0         0          0
## 6                1         1        0            0         0          0
## 7                1         1        0            0         0          0
## 8                1         1        0            0         0          0
## 9                1         1        0            0         0          0
## 10               1         1        0            0         0          0
##    fruit.spots seed mold.growth seed.discolor seed.size shriveling roots
## 1            4    0           0             0         0          0     0
## 2            4    0           0             0         0          0     0
## 3            4    0           0             0         0          0     0
## 4            4    0           0             0         0          0     0
## 5            4    0           0             0         0          0     0
## 6            4    0           0             0         0          0     0
## 7            4    0           0             0         0          0     0
## 8            4    0           0             0         0          0     0
## 9            4    0           0             0         0          0     0
## 10           4    0           0             0         0          0     0