624-HW4

3.1

(a) Using visualizations, explore the predictor variables to understand their distributions as well as the relationships between predictors.

The predictors with the highest correlation are the Ri-K pair at 0.810 and Si-Ri at -0.542. Overall the other correlations seem to be weaker.

data(Glass)

pair_plot <- ggpairs(Glass[, -10]) 
pair_plot

(b) Do there appear to be any outliers in the data? Are any predictors skewed?

It looks like most of the predictors show slight skewness, but some like Mg, K, Ca, Ba, and Fe are more heavily right-skewed.

Glass %>%
  select(-Type) %>%
  pivot_longer(cols = everything(), names_to = "variable", values_to = "value") %>%
  ggplot(aes(x = value)) +
  geom_histogram(bins = 30, fill = "blue", alpha = 0.7) +
  facet_wrap(~variable, scales = "free") +
  theme_minimal()

It looks like a lot of the predictors have outliers- patricularly the elements Al, Ba, Ca, K, Na, and Si.

Glass %>%
  pivot_longer(cols = -Type, names_to = "variable", values_to = "value") %>%
  ggplot(aes(x = variable, y = value)) +
  geom_boxplot(outlier.colour = "red", outlier.shape = 16, outlier.size = 2) +
  theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
  labs(title = "Boxplots to Identify Outliers in Glass Dataset")

(c) Are there any relevant transformations of one or more predictors that might improve the classification model?

A quick way to mitigate skewness would be to use log transformation. Log transformations will bring the higher values closer to zero while not affecting the lower values too much.

Glass %>%
  mutate(across(c(Na, Mg, K, Ca), log1p)) %>%
  select(Na, Mg, K, Ca) %>%
  pivot_longer(cols = everything(), names_to = "variable", values_to = "value") %>%
  ggplot(aes(x = value)) +
  geom_histogram(bins = 30, fill = "green", alpha = 0.7) +
  facet_wrap(~variable, scales = "free") +
  theme_minimal()

The soybean data can also be found at the UC Irvine Machine Learning Repository. Data were collected to predict disease in 683 soybeans. The 35 predictors are mostly categorical and include information on the environmental conditions (e.g., temperature, precipitation) and plant conditions (e.g., left spots, mold growth). The outcome labels consist of 19 distinct classes.

Investigate the frequency distributions for the categorical predictors. Are any of the distributions degenerate in the ways discussed earlier in this chapter?

data(Soybean)

Below are the frequency distributions. (Apologies for the cramped display, the commented out code displays all the plots more clearly but makes my presentation much longer! I also encountered some issues with some of the columns being ordered factors, I had to exclude them from the plot you see below, however I was able to plot them using the commented out loop).

The graphs show that all the variables have more than one unique value, and so we don’t see any zero variance degeneration here.

Soybean_cleaned <- Soybean %>%
  mutate(across(everything(), as.factor))

freq_plot <- Soybean_cleaned %>%
  select(-leaf.size, -germ, -plant.stand, -temp, -precip) %>%  
  pivot_longer(cols = everything(), names_to = "variable", values_to = "value") %>%
  filter(!is.na(value)) %>%  # Remove missing data for this step
  group_by(variable, value) %>%
  summarise(count = n()) %>%
  ggplot(aes(x = value, y = count)) +
  geom_bar(stat = "identity") +
  facet_wrap(~ variable, scales = "free") +
  theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
  labs(title = "Frequency Distributions of Categorical Variables (Excluding Class and Date)", 
       x = "Value", 
       y = "Count")

# Display the plot
print(freq_plot)

# Soybean_cleaned <- Soybean %>%
#   mutate(across(where(is.ordered), as.factor))
# 
# categorical_vars <- names(Soybean_cleaned)
# 
# for (var in categorical_vars) {
#   if (is.factor(Soybean_cleaned[[var]])) {
#     p <- ggplot(Soybean_cleaned, aes_string(x = var)) +
#       geom_bar() +
#       ggtitle(paste("Frequency of", var)) +
#       theme(axis.text.x = element_text(angle = 90, hjust = 1))
#     print(p)
#   }
# }

Since there are no zero variance predictors, we should instead look at near-zero predictors, ie predictors whose most frequent value makes up more than 90% of the distribution and whose number of unique values is less than 5.

# identify near-zero variance predictors
find_near_zero_variance <- function(data, freq_cutoff = 90, unique_cutoff = 5) {
  stats <- data.frame(
    variable = names(data),
    n_unique = sapply(data, function(x) length(unique(x))),  # Number of unique values
    max_freq = sapply(data, function(x) max(table(x)) / length(x) * 100)  # Max frequency percentage
  )
  
  near_zero_variance <- stats %>%
    filter(n_unique <= unique_cutoff & max_freq >= freq_cutoff)
  
  return(near_zero_variance)
}

near_zero_variance <- find_near_zero_variance(Soybean)
print(near_zero_variance)

##            variable n_unique max_freq
## mycelium   mycelium        3 93.55783
## sclerotia sclerotia        3 91.50805

We can see that mycelium and sclerotia are two such predictors, we can probably remove them. We can change the thresholds above to be more or less strict with regards to which predictors we remove.

(b) Roughly 18 % of the data are missing. Are there particular predictors that are more likely to be missing? Is the pattern of missing data related to the classes?

Hail, sever, seed.tmt, and lodging are tied for first place- they are the most likely predictors to have missing data.

missing_data <- Soybean %>%
  summarise(across(everything(), ~mean(is.na(.)) * 100)) %>%
  pivot_longer(cols = everything(), names_to = "variable", values_to = "percent_missing")

missing_data_filtered <- missing_data %>%
  filter(percent_missing > 0)

missing_data_ordered <- arrange(missing_data_filtered, desc(percent_missing))
missing_data_ordered

## # A tibble: 34 × 2
##    variable        percent_missing
##    <chr>                     <dbl>
##  1 hail                       17.7
##  2 sever                      17.7
##  3 seed.tmt                   17.7
##  4 lodging                    17.7
##  5 germ                       16.4
##  6 leaf.mild                  15.8
##  7 fruiting.bodies            15.5
##  8 fruit.spots                15.5
##  9 seed.discolor              15.5
## 10 shriveling                 15.5
## # ℹ 24 more rows

When broken down by class we can see that some classes have no data for some predictors. If we look at the 2-4-d-injury class, we see that for plant.stand, precip, and temp, there is 100% missing data, meaning all values for these predictors are missing in the that class. We might want to consider classes with low or no missing data, or otherwise impute the missing data in the other classes where it makes sense. With some knowledge of the classes, it might be the case where some of the predictors are not applicable and therefore meaningless to impute.

missing_by_class <- Soybean %>%
  group_by(Class) %>%
  summarise(across(everything(), ~mean(is.na(.)) * 100))
missing_by_class

## # A tibble: 19 × 36
##    Class   date plant.stand precip  temp  hail crop.hist area.dam sever seed.tmt
##    <fct>  <dbl>       <dbl>  <dbl> <dbl> <dbl>     <dbl>    <dbl> <dbl>    <dbl>
##  1 2-4-d…  6.25         100    100   100 100         100     6.25 100      100  
##  2 alter…  0              0      0     0   0           0     0      0        0  
##  3 anthr…  0              0      0     0   0           0     0      0        0  
##  4 bacte…  0              0      0     0   0           0     0      0        0  
##  5 bacte…  0              0      0     0   0           0     0      0        0  
##  6 brown…  0              0      0     0   0           0     0      0        0  
##  7 brown…  0              0      0     0   0           0     0      0        0  
##  8 charc…  0              0      0     0   0           0     0      0        0  
##  9 cyst-…  0            100    100   100 100           0     0    100      100  
## 10 diapo…  0             40      0     0 100           0     0    100      100  
## 11 diapo…  0              0      0     0   0           0     0      0        0  
## 12 downy…  0              0      0     0   0           0     0      0        0  
## 13 frog-…  0              0      0     0   0           0     0      0        0  
## 14 herbi…  0              0    100     0 100           0     0    100      100  
## 15 phyll…  0              0      0     0   0           0     0      0        0  
## 16 phyto…  0              0      0     0  77.3         0     0     77.3     77.3
## 17 powde…  0              0      0     0   0           0     0      0        0  
## 18 purpl…  0              0      0     0   0           0     0      0        0  
## 19 rhizo…  0              0      0     0   0           0     0      0        0  
## # ℹ 26 more variables: germ <dbl>, plant.growth <dbl>, leaves <dbl>,
## #   leaf.halo <dbl>, leaf.marg <dbl>, leaf.size <dbl>, leaf.shread <dbl>,
## #   leaf.malf <dbl>, leaf.mild <dbl>, stem <dbl>, lodging <dbl>,
## #   stem.cankers <dbl>, canker.lesion <dbl>, fruiting.bodies <dbl>,
## #   ext.decay <dbl>, mycelium <dbl>, int.discolor <dbl>, sclerotia <dbl>,
## #   fruit.pods <dbl>, fruit.spots <dbl>, seed <dbl>, mold.growth <dbl>,
## #   seed.discolor <dbl>, seed.size <dbl>, shriveling <dbl>, roots <dbl>

(c) Develop a strategy for handling missing data, either by eliminating predictors or imputation.

Ive decided to remove the predictors with more than 80& missing data. I then imputed the missing data with the most frequent value using mode since these are categorical variables.

Soybean_cleaned <- Soybean %>%
  select(where(~ mean(is.na(.)) <= 0.20))

impute_mode <- function(x) {
  if (is.factor(x)) {
    return(factor(replace(x, is.na(x), names(sort(table(x), decreasing = TRUE))[1])))
  } else {
    return(x)
  }
}

Soybean_imputed <- Soybean %>%
  mutate(across(everything(), impute_mode))

head(Soybean_imputed, n=10)

##                    Class date plant.stand precip temp hail crop.hist area.dam
## 1  diaporthe-stem-canker    6           0      2    1    0         1        1
## 2  diaporthe-stem-canker    4           0      2    1    0         2        0
## 3  diaporthe-stem-canker    3           0      2    1    0         1        0
## 4  diaporthe-stem-canker    3           0      2    1    0         1        0
## 5  diaporthe-stem-canker    6           0      2    1    0         2        0
## 6  diaporthe-stem-canker    5           0      2    1    0         3        0
## 7  diaporthe-stem-canker    5           0      2    1    0         2        0
## 8  diaporthe-stem-canker    4           0      2    1    1         1        0
## 9  diaporthe-stem-canker    6           0      2    1    0         3        0
## 10 diaporthe-stem-canker    4           0      2    1    0         2        0
##    sever seed.tmt germ plant.growth leaves leaf.halo leaf.marg leaf.size
## 1      1        0    0            1      1         0         2         2
## 2      2        1    1            1      1         0         2         2
## 3      2        1    2            1      1         0         2         2
## 4      2        0    1            1      1         0         2         2
## 5      1        0    2            1      1         0         2         2
## 6      1        0    1            1      1         0         2         2
## 7      1        1    0            1      1         0         2         2
## 8      1        0    2            1      1         0         2         2
## 9      1        1    1            1      1         0         2         2
## 10     2        0    2            1      1         0         2         2
##    leaf.shread leaf.malf leaf.mild stem lodging stem.cankers canker.lesion
## 1            0         0         0    1       1            3             1
## 2            0         0         0    1       0            3             1
## 3            0         0         0    1       0            3             0
## 4            0         0         0    1       0            3             0
## 5            0         0         0    1       0            3             1
## 6            0         0         0    1       0            3             0
## 7            0         0         0    1       1            3             1
## 8            0         0         0    1       0            3             1
## 9            0         0         0    1       0            3             1
## 10           0         0         0    1       0            3             1
##    fruiting.bodies ext.decay mycelium int.discolor sclerotia fruit.pods
## 1                1         1        0            0         0          0
## 2                1         1        0            0         0          0
## 3                1         1        0            0         0          0
## 4                1         1        0            0         0          0
## 5                1         1        0            0         0          0
## 6                1         1        0            0         0          0
## 7                1         1        0            0         0          0
## 8                1         1        0            0         0          0
## 9                1         1        0            0         0          0
## 10               1         1        0            0         0          0
##    fruit.spots seed mold.growth seed.discolor seed.size shriveling roots
## 1            4    0           0             0         0          0     0
## 2            4    0           0             0         0          0     0
## 3            4    0           0             0         0          0     0
## 4            4    0           0             0         0          0     0
## 5            4    0           0             0         0          0     0
## 6            4    0           0             0         0          0     0
## 7            4    0           0             0         0          0     0
## 8            4    0           0             0         0          0     0
## 9            4    0           0             0         0          0     0
## 10           4    0           0             0         0          0     0