3.1

The UC Irvine Machine Learning Repository contains a data set related to glass identification. The data consist of 214 glass samples labeled as one of seven class categories. There are nine predictors, including the refractive index and percentages of eight elements: Na, Mg, Al, Si, K, Ca, Ba, and Fe.

The data can be accessed via:

library(mlbench) data(Glass) str(Glass)

library(mlbench)
## Warning: package 'mlbench' was built under R version 4.5.2
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   4.0.0     ✔ tibble    3.3.0
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ✔ purrr     1.0.4     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(GGally)
library(ggcorrplot)
## Warning: package 'ggcorrplot' was built under R version 4.5.2
library(e1071)
## Warning: package 'e1071' was built under R version 4.5.2
## 
## Attaching package: 'e1071'
## 
## The following object is masked from 'package:ggplot2':
## 
##     element
library(forcats)
library(tidymodels)
## Warning: package 'tidymodels' was built under R version 4.5.2
## ── Attaching packages ────────────────────────────────────── tidymodels 1.4.1 ──
## ✔ broom        1.0.10     ✔ rsample      1.3.1 
## ✔ dials        1.4.2      ✔ tailor       0.1.0 
## ✔ infer        1.1.0      ✔ tune         2.0.1 
## ✔ modeldata    1.5.1      ✔ workflows    1.3.0 
## ✔ parsnip      1.4.1      ✔ workflowsets 1.1.1 
## ✔ recipes      1.3.1      ✔ yardstick    1.3.2
## Warning: package 'dials' was built under R version 4.5.2
## Warning: package 'infer' was built under R version 4.5.2
## Warning: package 'modeldata' was built under R version 4.5.2
## Warning: package 'parsnip' was built under R version 4.5.2
## Warning: package 'rsample' was built under R version 4.5.2
## Warning: package 'tailor' was built under R version 4.5.2
## Warning: package 'tune' was built under R version 4.5.2
## Warning: package 'workflows' was built under R version 4.5.2
## Warning: package 'workflowsets' was built under R version 4.5.2
## Warning: package 'yardstick' was built under R version 4.5.2
## ── Conflicts ───────────────────────────────────────── tidymodels_conflicts() ──
## ✖ scales::discard()       masks purrr::discard()
## ✖ e1071::element()        masks ggplot2::element()
## ✖ dplyr::filter()         masks stats::filter()
## ✖ recipes::fixed()        masks stringr::fixed()
## ✖ dplyr::lag()            masks stats::lag()
## ✖ rsample::permutations() masks e1071::permutations()
## ✖ yardstick::spec()       masks readr::spec()
## ✖ recipes::step()         masks stats::step()
## ✖ tune::tune()            masks parsnip::tune(), e1071::tune()
# load and quick structure check w/ data
data(Glass)
str(Glass)
## 'data.frame':    214 obs. of  10 variables:
##  $ RI  : num  1.52 1.52 1.52 1.52 1.52 ...
##  $ Na  : num  13.6 13.9 13.5 13.2 13.3 ...
##  $ Mg  : num  4.49 3.6 3.55 3.69 3.62 3.61 3.6 3.61 3.58 3.6 ...
##  $ Al  : num  1.1 1.36 1.54 1.29 1.24 1.62 1.14 1.05 1.37 1.36 ...
##  $ Si  : num  71.8 72.7 73 72.6 73.1 ...
##  $ K   : num  0.06 0.48 0.39 0.57 0.55 0.64 0.58 0.57 0.56 0.57 ...
##  $ Ca  : num  8.75 7.83 7.78 8.22 8.07 8.07 8.17 8.24 8.3 8.4 ...
##  $ Ba  : num  0 0 0 0 0 0 0 0 0 0 ...
##  $ Fe  : num  0 0 0 0 0 0.26 0 0 0 0.11 ...
##  $ Type: Factor w/ 6 levels "1","2","3","5",..: 1 1 1 1 1 1 1 1 1 1 ...
dim(Glass)
## [1] 214  10
summary(Glass)
##        RI              Na              Mg              Al       
##  Min.   :1.511   Min.   :10.73   Min.   :0.000   Min.   :0.290  
##  1st Qu.:1.517   1st Qu.:12.91   1st Qu.:2.115   1st Qu.:1.190  
##  Median :1.518   Median :13.30   Median :3.480   Median :1.360  
##  Mean   :1.518   Mean   :13.41   Mean   :2.685   Mean   :1.445  
##  3rd Qu.:1.519   3rd Qu.:13.82   3rd Qu.:3.600   3rd Qu.:1.630  
##  Max.   :1.534   Max.   :17.38   Max.   :4.490   Max.   :3.500  
##        Si              K                Ca               Ba       
##  Min.   :69.81   Min.   :0.0000   Min.   : 5.430   Min.   :0.000  
##  1st Qu.:72.28   1st Qu.:0.1225   1st Qu.: 8.240   1st Qu.:0.000  
##  Median :72.79   Median :0.5550   Median : 8.600   Median :0.000  
##  Mean   :72.65   Mean   :0.4971   Mean   : 8.957   Mean   :0.175  
##  3rd Qu.:73.09   3rd Qu.:0.6100   3rd Qu.: 9.172   3rd Qu.:0.000  
##  Max.   :75.41   Max.   :6.2100   Max.   :16.190   Max.   :3.150  
##        Fe          Type  
##  Min.   :0.00000   1:70  
##  1st Qu.:0.00000   2:76  
##  Median :0.00000   3:17  
##  Mean   :0.05701   5:13  
##  3rd Qu.:0.10000   6: 9  
##  Max.   :0.51000   7:29
# target distribution
table(Glass$Type)
## 
##  1  2  3  5  6  7 
## 70 76 17 13  9 29
# check missing values
colSums(is.na(Glass))
##   RI   Na   Mg   Al   Si    K   Ca   Ba   Fe Type 
##    0    0    0    0    0    0    0    0    0    0
# put predictors into a long format for easy faceted plots
glass_long <- Glass %>% 
  pivot_longer(cols = -Type, names_to = "predictor", values_to = "value")
  1. Using visualizations, explore the predictor variables to understand their distributions as well as the relationships between predictors.
# histograms for each predictor
ggplot(glass_long, aes(x = value)) +
  geom_histogram(bins = 30) +
  facet_wrap(~ predictor, scales = "free") +
  labs(
    title = "Glass predictors: Histograms", 
    x = "Value", 
    y = "COunt")

# density plots 
ggplot(glass_long, aes(x = value)) +
  geom_density() +
  facet_wrap(~ predictor, scales = "free") +
  labs(
    title = "Glass predictors: Density plots", 
    x = "Value", 
    y = "Density")

# boxplots
ggplot(glass_long, aes(x = predictor, y = value)) +
  geom_boxplot() +
  coord_flip() +
  labs(
    title = "Glass predictors: Boxplots (outlier check)", 
    x = "", 
    y = "Value")

# pairwise relationships
GGally::ggpairs(
  Glass,
  columns = 1:9,
  aes(alpha = 0.6)
)

# correlation heatmap
cor_mat <- cor(Glass %>%
                 select(-Type),
               use = "complete.obs")

ggcorrplot(
  cor_mat,
  hc.order = TRUE,
  type = "lower",
  lab = TRUE
) +
  labs(title = "Correlation heatmap (numeric predictors)")
## Warning: `aes_string()` was deprecated in ggplot2 3.0.0.
## ℹ Please use tidy evaluation idioms with `aes()`.
## ℹ See also `vignette("ggplot2-in-packages")` for more information.
## ℹ The deprecated feature was likely used in the ggcorrplot package.
##   Please report the issue at <https://github.com/kassambara/ggcorrplot/issues>.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

# visualize how distributions differ by class
ggplot(glass_long, aes(x = Type, y = value)) +
  geom_boxplot() +
  facet_wrap(~ predictor, scales = "free") +
  labs(title = "Predictors by Glass Type", x = "Type", y = "Value")

Using histograms and density plots, I saw that RI, Na, Al, and Si each have one main peak, so their values cluster in a single range. Some other predictors are clearly right skewed and have lots of values near zero. Ba and Fe are the clearest examples because most samples are exactly zero and only a few samples are above zero, which creates a long tail. From the boxplots, Ca and K also look like they have a few unusually large values compared to the rest of the data.

To look at how predictors relate to each other, I used a scatterplot matrix and a correlation heatmap. The heatmap shows a strong positive relationship between RI and Ca, around 0.81. It also shows a moderate negative relationship between Si and RI, around negative 0.54. Overall, the predictors are a mix of measurements that are somewhat related to each other, plus a few variables that are very skewed.

  1. Do there appear to be any outliers in the data? Are any predictors skewed?
# function to flag outliers using boxplot rule
is_outlier_iqr <- function(x) {
  q1 <- quantile(x, 0.25, na.rm = TRUE)
  q3 <- quantile(x, 0.75, na.rm = TRUE)
  iqr <- q3 - q1
  lower <- q1 - 1.5 * iqr
  upper <- q3 + 1.5 * iqr
  (x < lower) | (x > upper)
}

# compute skewness and number of outliers and number of zeros
outlier_skew_summary <- Glass %>%
  select(-Type) %>%  # exclude target class
  summarise(across(
    everything(),
    list(
      skewness = ~ e1071::skewness(.x, na.rm = TRUE, type = 2),
      outliers = ~ sum(is_outlier_iqr(.x), na.rm = TRUE),
      zeros    = ~ sum(.x == 0, na.rm = TRUE)
    ),
    .names = "{.col}_{.fn}"
  )) %>%
  pivot_longer(
    cols = everything(),
    names_to = c("predictor", "metric"),
    names_sep = "_",
    values_to = "value"
  ) %>%
  pivot_wider(names_from = metric, values_from = value) %>%
  arrange(desc(abs(skewness)))

outlier_skew_summary
## # A tibble: 9 × 4
##   predictor skewness outliers zeros
##   <chr>        <dbl>    <dbl> <dbl>
## 1 K            6.55         7    30
## 2 Ba           3.42        38   176
## 3 Ca           2.05        26     0
## 4 Fe           1.75        12   144
## 5 RI           1.63        17     0
## 6 Mg          -1.15         0    42
## 7 Al           0.907       18     0
## 8 Si          -0.730       12     0
## 9 Na           0.454        7     0
# show the top 3 most skewed predictors
top_skew <- outlier_skew_summary %>%
  slice_head(n = 3)
top_skew
## # A tibble: 3 × 4
##   predictor skewness outliers zeros
##   <chr>        <dbl>    <dbl> <dbl>
## 1 K             6.55        7    30
## 2 Ba            3.42       38   176
## 3 Ca            2.05       26     0

Yes, I do see outliers in the Glass predictors using the 1.5 times IQR rule. A few variables have quite a lot of values flagged as outliers, especially Ba with 38 outliers, Ca with 26 outliers, and Fe with 12 outliers. This matches what I saw in the boxplots, where most values are in a tight range but a small number of points sit much higher than the rest.

I also see clear skewness in several predictors. K is the most skewed, with skewness 6.55, so it has a strong right tail. Ba is also very right skewed with skewness 3.42, and it has 176 zeros, which means most samples have Ba equal to zero and only a few have larger values. Ca is right skewed too with skewness 2.05. On the other hand, Mg and Si are left skewed, with skewness values of negative 1.15 and negative 0.73. Overall, the heavy skew and the large number of zeros in some variables, especially Ba and Fe, help explain why the IQR rule flags outliers so often.

  1. are there any relevant transformations of one or more predictors that might improve the classification model?
# build a preprocessing recipe 
rec <- recipe(Type ~ ., data = Glass) %>%
  step_YeoJohnson(all_numeric_predictors()) %>%
  step_normalize(all_numeric_predictors())

rec
## 
## ── Recipe ──────────────────────────────────────────────────────────────────────
## 
## ── Inputs
## Number of variables by role
## outcome:   1
## predictor: 9
## 
## ── Operations
## • Yeo-Johnson transformation on: all_numeric_predictors()
## • Centering and scaling for: all_numeric_predictors()
# apply the preprocessing to see the transformed data 
prep_rec <- prep(rec)
Glass_processed <- bake(prep_rec, new_data = NULL)
summary(Glass_processed)
##        RI                Na                 Mg                Al          
##  Min.   :-2.3759   Min.   :-3.68296   Min.   :-1.7352   Min.   :-3.09581  
##  1st Qu.:-0.6068   1st Qu.:-0.59533   1st Qu.:-0.7782   1st Qu.:-0.45123  
##  Median :-0.2257   Median :-0.09993   Median : 0.5240   Median :-0.07732  
##  Mean   : 0.0000   Mean   : 0.00000   Mean   : 0.0000   Mean   : 0.00000  
##  3rd Qu.: 0.2608   3rd Qu.: 0.53801   3rd Qu.: 0.6666   3rd Qu.: 0.46462  
##  Max.   : 5.1252   Max.   : 4.25332   Max.   : 1.8719   Max.   : 3.15444  
##        Si                K                 Ca                Ba         
##  Min.   :-3.6679   Min.   :-1.6158   Min.   :-4.6093   Min.   :-0.3521  
##  1st Qu.:-0.4789   1st Qu.:-0.9812   1st Qu.:-0.4686   1st Qu.:-0.3521  
##  Median : 0.1795   Median : 0.4648   Median :-0.1394   Median :-0.3521  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.5636   3rd Qu.: 0.5934   3rd Qu.: 0.3281   3rd Qu.:-0.3521  
##  Max.   : 3.5622   Max.   : 3.4444   Max.   : 3.2396   Max.   : 5.9832  
##        Fe          Type  
##  Min.   :-0.5851   1:70  
##  1st Qu.:-0.5851   2:76  
##  Median :-0.5851   3:17  
##  Mean   : 0.0000   5:13  
##  3rd Qu.: 0.4412   6: 9  
##  Max.   : 4.6490   7:29

From the skewness and outlier results in part b, a few predictors are very skewed, especially K with skewness 6.55 and Ba with skewness 3.42. Ba also has a lot of zeros, with 176 samples at zero, so most values pile up at the bottom and only a few are much larger. That creates a strong right tail and makes the IQR rule flag a lot of points as outliers. To make the data easier for a classifier to learn from, it helps to transform the most skewed variables. Using log1p on K, Ba, and Fe is a simple option because it works with zeros and it shrinks the extreme high values.

For a more general approach, I would apply a Yeo Johnson transformation to all numeric predictors since it reduces skewness and can handle zeros. After that, I would normalize the predictors so they are on the same scale, which is important for models like KNN and SVM. Since some predictors are correlated, like RI and Ca, PCA could also be used to reduce overlap between variables and make the model more stable.

3.2

The soybean data can also be found at the UC Irvine Machine Learning Repository. Data were collected to predict disease in 683 soybeans. The 35 predictors are mostly categorical and include information on the environmental conditions (e.g., temperature, precipitation) and plant conditions (e.g., left spots, mold growth). The outcome labels consist of 19 distinct classes.

The data can be loaded via: library(mlbench) data(Soybean)

data(Soybean)
str(Soybean)
## 'data.frame':    683 obs. of  36 variables:
##  $ Class          : Factor w/ 19 levels "2-4-d-injury",..: 11 11 11 11 11 11 11 11 11 11 ...
##  $ date           : Factor w/ 7 levels "0","1","2","3",..: 7 5 4 4 7 6 6 5 7 5 ...
##  $ plant.stand    : Ord.factor w/ 2 levels "0"<"1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ precip         : Ord.factor w/ 3 levels "0"<"1"<"2": 3 3 3 3 3 3 3 3 3 3 ...
##  $ temp           : Ord.factor w/ 3 levels "0"<"1"<"2": 2 2 2 2 2 2 2 2 2 2 ...
##  $ hail           : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 2 1 1 ...
##  $ crop.hist      : Factor w/ 4 levels "0","1","2","3": 2 3 2 2 3 4 3 2 4 3 ...
##  $ area.dam       : Factor w/ 4 levels "0","1","2","3": 2 1 1 1 1 1 1 1 1 1 ...
##  $ sever          : Factor w/ 3 levels "0","1","2": 2 3 3 3 2 2 2 2 2 3 ...
##  $ seed.tmt       : Factor w/ 3 levels "0","1","2": 1 2 2 1 1 1 2 1 2 1 ...
##  $ germ           : Ord.factor w/ 3 levels "0"<"1"<"2": 1 2 3 2 3 2 1 3 2 3 ...
##  $ plant.growth   : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
##  $ leaves         : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
##  $ leaf.halo      : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
##  $ leaf.marg      : Factor w/ 3 levels "0","1","2": 3 3 3 3 3 3 3 3 3 3 ...
##  $ leaf.size      : Ord.factor w/ 3 levels "0"<"1"<"2": 3 3 3 3 3 3 3 3 3 3 ...
##  $ leaf.shread    : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ leaf.malf      : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ leaf.mild      : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
##  $ stem           : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
##  $ lodging        : Factor w/ 2 levels "0","1": 2 1 1 1 1 1 2 1 1 1 ...
##  $ stem.cankers   : Factor w/ 4 levels "0","1","2","3": 4 4 4 4 4 4 4 4 4 4 ...
##  $ canker.lesion  : Factor w/ 4 levels "0","1","2","3": 2 2 1 1 2 1 2 2 2 2 ...
##  $ fruiting.bodies: Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
##  $ ext.decay      : Factor w/ 3 levels "0","1","2": 2 2 2 2 2 2 2 2 2 2 ...
##  $ mycelium       : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ int.discolor   : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
##  $ sclerotia      : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ fruit.pods     : Factor w/ 4 levels "0","1","2","3": 1 1 1 1 1 1 1 1 1 1 ...
##  $ fruit.spots    : Factor w/ 4 levels "0","1","2","4": 4 4 4 4 4 4 4 4 4 4 ...
##  $ seed           : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ mold.growth    : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ seed.discolor  : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ seed.size      : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ shriveling     : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ roots          : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
dim(Soybean)
## [1] 683  36
table(Soybean$Class)
## 
##                2-4-d-injury         alternarialeaf-spot 
##                          16                          91 
##                 anthracnose            bacterial-blight 
##                          44                          20 
##           bacterial-pustule                  brown-spot 
##                          20                          92 
##              brown-stem-rot                charcoal-rot 
##                          44                          20 
##               cyst-nematode diaporthe-pod-&-stem-blight 
##                          14                          15 
##       diaporthe-stem-canker                downy-mildew 
##                          20                          20 
##          frog-eye-leaf-spot            herbicide-injury 
##                          91                           8 
##      phyllosticta-leaf-spot            phytophthora-rot 
##                          20                          88 
##              powdery-mildew           purple-seed-stain 
##                          20                          20 
##        rhizoctonia-root-rot 
##                          20
# confirm missing values are NA
soybean_clean <- Soybean %>%
  mutate(across(
    where(is.factor),
    ~ {
      x <- as.character(.x)
      x[x == "?"] <- NA
      factor(x)
    }
  ))

# define predictor names
predictors <- setdiff(names(soybean_clean), "Class")
  1. Investigate the frequency distributions for the categorical predictors. Are any of the distributions degenerate in the ways discussed earlier in this chapter?
# build a frequency table for every predictor
freq_df <- map_dfr(predictors, function(v) {
  tab <- table(soybean_clean[[v]], useNA = "ifany")
  tibble(
    predictor = v,
    level = names(tab),
    n = as.integer(tab),
    pct = as.integer(tab) / sum(tab)
  )
})

# show one predictor's frequency distribution
freq_df %>% 
  filter(predictor == predictors[1]) %>% 
  arrange(desc(n))
## # A tibble: 8 × 4
##   predictor level     n     pct
##   <chr>     <chr> <int>   <dbl>
## 1 date      5       149 0.218  
## 2 date      4       131 0.192  
## 3 date      3       118 0.173  
## 4 date      2        93 0.136  
## 5 date      6        90 0.132  
## 6 date      1        75 0.110  
## 7 date      0        26 0.0381 
## 8 date      <NA>      1 0.00146
deg_summary <- map_dfr(predictors, function(v) {
  x <- soybean_clean[[v]]
  x_nm <- x[!is.na(x)]
  n_nm <- length(x_nm)

  tab <- sort(table(x_nm), decreasing = TRUE)
  n_levels <- length(tab)

  top_n   <- if (n_levels >= 1) as.integer(tab[1]) else 0
  top_pct <- if (n_nm > 0) top_n / n_nm else NA_real_

  second_n <- if (n_levels >= 2) as.integer(tab[2]) else 0
  freq_ratio <- if (second_n == 0) Inf else top_n / second_n

  pct_unique <- if (n_nm > 0) (n_levels / n_nm) * 100 else NA_real_

  zero_var <- (n_levels <= 1)

  near_zero <- (!zero_var) && (freq_ratio > 19) && (pct_unique < 10)

  tibble(
    predictor    = v,
    missing_pct  = mean(is.na(x)),
    n_nonmissing = n_nm,
    n_levels     = n_levels,
    top_level    = if (n_levels >= 1) names(tab)[1] else NA_character_,
    top_pct      = top_pct,
    freq_ratio   = freq_ratio,
    pct_unique   = pct_unique,
    zero_var     = zero_var,
    near_zero    = near_zero
  )
}) %>%
  arrange(desc(zero_var), desc(near_zero), desc(missing_pct), desc(top_pct))

deg_summary
## # A tibble: 35 × 10
##    predictor     missing_pct n_nonmissing n_levels top_level top_pct freq_ratio
##    <chr>               <dbl>        <int>    <int> <chr>       <dbl>      <dbl>
##  1 leaf.mild          0.158           575        3 0           0.930      26.8 
##  2 mycelium           0.0556          645        2 0           0.991     106.  
##  3 sclerotia          0.0556          645        2 0           0.969      31.2 
##  4 lodging            0.177           562        2 0           0.925      12.4 
##  5 hail               0.177           562        2 0           0.774       3.43
##  6 sever              0.177           562        3 1           0.573       1.65
##  7 seed.tmt           0.177           562        3 0           0.543       1.37
##  8 germ               0.164           571        3 1           0.373       1.10
##  9 shriveling         0.155           577        2 0           0.934      14.2 
## 10 seed.discolor      0.155           577        2 0           0.889       8.02
## # ℹ 25 more rows
## # ℹ 3 more variables: pct_unique <dbl>, zero_var <lgl>, near_zero <lgl>
degenerate_vars <- deg_summary %>%
  filter(zero_var | near_zero) %>%
  pull(predictor)

degenerate_vars
## [1] "leaf.mild" "mycelium"  "sclerotia"

I looked at the frequency distributions for all categorical predictors by making frequency tables and including the NA counts. Most variables only have a few levels, usually around two to four, which makes sense since they describe symptoms and environment conditions. Using a near zero variance style check, I found that leaf.mild, mycelium, and sclerotia look degenerate because one category shows up in almost all the rows and there is very little variation. Since these variables do not change much across the dataset, they probably will not add much useful information to a classification model, so they are good candidates to remove or to combine levels during preprocessing.

  1. Roughly 18% of the data are missing. Are there particular predictors that are more likely to be missing? Is the pattern of missing data related to the classes?
missing_by_var <- tibble(
  predictor = predictors,
  missing_pct = map_dbl(predictors, ~ mean(is.na(soybean_clean[[.x]])))
) %>%
  arrange(desc(missing_pct))

missing_by_var
## # A tibble: 35 × 2
##    predictor       missing_pct
##    <chr>                 <dbl>
##  1 hail                  0.177
##  2 sever                 0.177
##  3 seed.tmt              0.177
##  4 lodging               0.177
##  5 germ                  0.164
##  6 leaf.mild             0.158
##  7 fruiting.bodies       0.155
##  8 fruit.spots           0.155
##  9 seed.discolor         0.155
## 10 shriveling            0.155
## # ℹ 25 more rows
# plot missing % by predictor
ggplot(missing_by_var, aes(x = reorder(predictor, missing_pct), y = missing_pct)) +
  geom_col() +
  coord_flip() +
  labs(title = "Percent missing by predictor", x = NULL, y = "Missing proportion")

# heatmap for each predictor and missing % within each class
missing_by_class <- map_dfr(predictors, function(v) {
  soybean_clean %>%
    group_by(Class) %>%
    summarise(missing_pct = mean(is.na(.data[[v]])), .groups = "drop") %>%
    mutate(predictor = v)
})

ggplot(missing_by_class, aes(x = predictor, y = Class, fill = missing_pct)) +
  geom_tile() +
  labs(title = "Missingness pattern by Class (heatmap)", x = "Predictor", y = "Class") +
  theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust = 1))

chi_results <- map_dfr(predictors, function(v) {
  miss <- is.na(soybean_clean[[v]])
  tbl <- table(miss, soybean_clean$Class)

  test <- suppressWarnings(chisq.test(tbl))

  tibble(
    predictor = v,
    p_value = test$p.value
  )
}) %>%
  mutate(p_adj = p.adjust(p_value, method = "BH")) %>%
  arrange(p_adj)

chi_results
## # A tibble: 35 × 3
##    predictor       p_value     p_adj
##    <chr>             <dbl>     <dbl>
##  1 precip        2.29e-133 6.69e-133
##  2 temp          2.29e-133 6.69e-133
##  3 crop.hist     2.29e-133 6.69e-133
##  4 plant.growth  2.29e-133 6.69e-133
##  5 stem          2.29e-133 6.69e-133
##  6 stem.cankers  2.29e-133 6.69e-133
##  7 canker.lesion 2.29e-133 6.69e-133
##  8 ext.decay     2.29e-133 6.69e-133
##  9 mycelium      2.29e-133 6.69e-133
## 10 int.discolor  2.29e-133 6.69e-133
## # ℹ 25 more rows

Some predictors have more missing values than others. The most missing ones are around 17.7 percent, including hail, sever, seed.tmt, and lodging. Germ is also fairly high at about 16.4 percent, and leaf.mild is about 15.8 percent. A lot of the other variables fall in the middle range with around 10 to 15 percent missing, and a few have very little missing data.

The missing pattern also seems connected to the outcome classes. In the heatmap, some classes show clearly higher missing rates for certain predictors. The chi square tests support this too, since many predictors have extremely small adjusted p values, around 10 to the minus 133 level. This suggests the missingness is not random across classes, so missing values may carry information and should be handled carefully when preprocessing the data.

  1. Develop a strategy for handling missing data, either by eliminating predictors or imputation.
# identify predictors with very high missingness
high_missing_vars <- missing_by_var %>%
  filter(missing_pct > 0.40) %>%  
  pull(predictor)

# combine degenerate predictors 
to_drop <- unique(c(degenerate_vars, high_missing_vars))

to_drop
## [1] "leaf.mild" "mycelium"  "sclerotia"
# recipe strategy for categorical missing data
soy_rec <- recipe(Class ~ ., data = soybean_clean) %>%
  step_nzv(all_predictors()) %>%      
  step_other(all_nominal_predictors(), threshold = 0.01) %>% 
  step_unknown(all_nominal_predictors()) %>%             
  step_impute_mode(all_nominal_predictors())       

soy_rec
## 
## ── Recipe ──────────────────────────────────────────────────────────────────────
## 
## ── Inputs
## Number of variables by role
## outcome:    1
## predictor: 35
## 
## ── Operations
## • Sparse, unbalanced variable filter on: all_predictors()
## • Collapsing factor levels for: all_nominal_predictors()
## • Unknown factor level assignment for: all_nominal_predictors()
## • Mode imputation for: all_nominal_predictors()
prep_rec <- prep(soy_rec)
Soybean_processed <- bake(prep_rec, new_data = NULL)
glimpse(Soybean_processed)
## Rows: 683
## Columns: 33
## $ date            <fct> 6, 4, 3, 3, 6, 5, 5, 4, 6, 4, 6, 4, 3, 6, 6, 5, 6, 4, …
## $ plant.stand     <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ precip          <fct> 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ temp            <fct> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 1, 1, 1, 2, 2, 2, 1, …
## $ hail            <fct> 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 1, 0, 1, 1, 0, …
## $ crop.hist       <fct> 1, 2, 1, 1, 2, 3, 2, 1, 3, 2, 1, 1, 1, 3, 1, 3, 0, 2, …
## $ area.dam        <fct> 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 3, 3, 2, 3, 3, 3, 2, 2, …
## $ sever           <fct> 1, 2, 2, 2, 1, 1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ seed.tmt        <fct> 0, 1, 1, 0, 0, 0, 1, 0, 1, 0, 1, 1, 0, 1, 1, 1, 0, 0, …
## $ germ            <fct> 0, 1, 2, 1, 2, 1, 0, 2, 1, 2, 0, 1, 0, 0, 1, 2, 0, 1, …
## $ plant.growth    <fct> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ leaves          <fct> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ leaf.halo       <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ leaf.marg       <fct> 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, …
## $ leaf.size       <fct> 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, …
## $ leaf.shread     <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ leaf.malf       <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ stem            <fct> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ lodging         <fct> 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, …
## $ stem.cankers    <fct> 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ canker.lesion   <fct> 1, 1, 0, 0, 1, 0, 1, 1, 1, 1, 3, 3, 3, 3, 3, 3, 3, 3, …
## $ fruiting.bodies <fct> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ ext.decay       <fct> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ int.discolor    <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 2, 2, 2, 2, 2, 2, 2, 2, …
## $ fruit.pods      <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ fruit.spots     <fct> 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, 4, …
## $ seed            <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ mold.growth     <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ seed.discolor   <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ seed.size       <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ shriveling      <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ roots           <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ Class           <fct> diaporthe-stem-canker, diaporthe-stem-canker, diaporth…

To handle missing data in the Soybean dataset, I used a simple preprocessing plan that fits categorical variables. First, I removed predictors that had almost no variation because they do not add much information for classification. From the frequency check, leaf.mild, mycelium, and sclerotia were flagged as near zero variance, so I removed them.

Next, since missingness looked related to the class labels in part b, I treated missing values as meaningful by turning NA into an explicit unknown category. This keeps the missingness pattern instead of replacing missing values with the most common level. I also grouped very rare levels into an other category to reduce sparsity and make the model more stable. As a backup, I used mode imputation in case any missing values still remain after adding the unknown level. If the model needs numeric inputs, I would also use dummy encoding to convert the categorical predictors into binary columns.