3.2

The UC Irvine Machine Learning Repository6 contains a data set related to glass identification. The data consist of 214 glass samples labeled as one of seven class categories. There are nine predictors, including the refractive index and percentages of eight elements: Na, Mg, Al, Si, K, Ca, Ba, and Fe. The data can be accessed via:

library(mlbench)
## Warning: package 'mlbench' was built under R version 4.5.2
data(Glass)
(Glass)

a & b: Using visualizations, explore the predictor variables to understand their distributions as well as the relationships between predictors. Do there appear to be any outliers in the data? Are any predictors skewed?

cor_matrix <- cor(Glass[, 1:9])

ggcorrplot(cor_matrix, 
           method = "square",
           type = "lower",
           lab = TRUE,
           lab_size = 3,
           colors = c("blue", "white", "red"),
           title = "Predictors Correlation Matrix",
           legend.title = "Correlation")
## Warning: `aes_string()` was deprecated in ggplot2 3.0.0.
## ℹ Please use tidy evaluation idioms with `aes()`.
## ℹ See also `vignette("ggplot2-in-packages")` for more information.
## ℹ The deprecated feature was likely used in the ggcorrplot package.
##   Please report the issue at <https://github.com/kassambara/ggcorrplot/issues>.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

Glass %>%
  select(-Type) %>%
  pivot_longer(cols = everything(), names_to = "Variable", values_to = "Value") %>%
  ggplot(aes(x = Value, y = Variable)) +
  geom_boxplot(fill = "pink", outlier.color = "purple", outlier.shape = 16) +
  facet_wrap(~Variable, scales = "free", ncol = 3) +
  labs(title = "Boxplots of Predictors",
       x = "Value",
       y = NULL) +
  theme_minimal() +
  theme(strip.text = element_text(face = "bold"),
        axis.text.y = element_blank(),
        axis.ticks.y = element_blank())

Glass %>%
  pivot_longer(cols = -Type, names_to = "Variable", values_to = "Value") %>%
  ggplot(aes(x = Value)) +
  geom_histogram(aes(y = after_stat(density)), 
                 bins = 30, fill = "purple", color = "black", alpha = 0.7) +
  facet_wrap(~Variable, scales = "free", ncol = 3) +
  labs(title = "Histograms of Predictors",
       x = "Value", 
       y = "Density") +
  theme_minimal()

From the histogram, we can see that there is some highly skewed predictors to the right such as K, Ba and Fe. While Mg is left skewed. The other predictors: Al, Ca, Na, RI, and Si have relatively normal symmetry. In our box plot, all predictors outside of Mg have a noticable amount of outliers. In our Correlation Matrix, one noticeably high correlation is with Ca and RI.

c: Are there any relevant transformations of one or more predictors that might improve the classification model?

There are many outliers that are right skewed and would need to be transformed. However, we must also take into account the amount of zeros. Because of this, I chose to apply a cube root transformation since it accounts for the zeros.

glass_cbrt <- Glass %>%
  mutate(
    K  = K^(1/3),
    Ba = Ba^(1/3),
    Fe = Fe^(1/3)
  ) %>%
  mutate(across(-Type, scale))

glass_cbrt %>%
  pivot_longer(cols = -Type, names_to = "Variable", values_to = "Value") %>%
  ggplot(aes(x = Value)) +
  geom_histogram(fill = "coral", alpha = 0.6) +
  facet_wrap(~ Variable, scales = "free", ncol = 3) +
  labs(title = "Cube Root Transformed Predictors Histogram",
       x = "Transformed Value", 
       y = "Density") +
  theme_minimal()
## `stat_bin()` using `bins = 30`. Pick better value `binwidth`.

3.2

The soybean data can also be found at the UC Irvine Machine Learning Repository. Data were collected to predict disease in 683 soybeans. The 35 predictors are mostly categorical and include information on the environmen- tal conditions (e.g., temperature, precipitation) and plant conditions (e.g., left spots, mold growth). The outcome labels consist of 19 distinct classes.

The data can be loaded via:

library(mlbench)
data(Soybean)
str(Soybean)
## 'data.frame':    683 obs. of  36 variables:
##  $ Class          : Factor w/ 19 levels "2-4-d-injury",..: 11 11 11 11 11 11 11 11 11 11 ...
##  $ date           : Factor w/ 7 levels "0","1","2","3",..: 7 5 4 4 7 6 6 5 7 5 ...
##  $ plant.stand    : Ord.factor w/ 2 levels "0"<"1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ precip         : Ord.factor w/ 3 levels "0"<"1"<"2": 3 3 3 3 3 3 3 3 3 3 ...
##  $ temp           : Ord.factor w/ 3 levels "0"<"1"<"2": 2 2 2 2 2 2 2 2 2 2 ...
##  $ hail           : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 2 1 1 ...
##  $ crop.hist      : Factor w/ 4 levels "0","1","2","3": 2 3 2 2 3 4 3 2 4 3 ...
##  $ area.dam       : Factor w/ 4 levels "0","1","2","3": 2 1 1 1 1 1 1 1 1 1 ...
##  $ sever          : Factor w/ 3 levels "0","1","2": 2 3 3 3 2 2 2 2 2 3 ...
##  $ seed.tmt       : Factor w/ 3 levels "0","1","2": 1 2 2 1 1 1 2 1 2 1 ...
##  $ germ           : Ord.factor w/ 3 levels "0"<"1"<"2": 1 2 3 2 3 2 1 3 2 3 ...
##  $ plant.growth   : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
##  $ leaves         : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
##  $ leaf.halo      : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
##  $ leaf.marg      : Factor w/ 3 levels "0","1","2": 3 3 3 3 3 3 3 3 3 3 ...
##  $ leaf.size      : Ord.factor w/ 3 levels "0"<"1"<"2": 3 3 3 3 3 3 3 3 3 3 ...
##  $ leaf.shread    : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ leaf.malf      : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ leaf.mild      : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
##  $ stem           : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
##  $ lodging        : Factor w/ 2 levels "0","1": 2 1 1 1 1 1 2 1 1 1 ...
##  $ stem.cankers   : Factor w/ 4 levels "0","1","2","3": 4 4 4 4 4 4 4 4 4 4 ...
##  $ canker.lesion  : Factor w/ 4 levels "0","1","2","3": 2 2 1 1 2 1 2 2 2 2 ...
##  $ fruiting.bodies: Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
##  $ ext.decay      : Factor w/ 3 levels "0","1","2": 2 2 2 2 2 2 2 2 2 2 ...
##  $ mycelium       : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ int.discolor   : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
##  $ sclerotia      : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ fruit.pods     : Factor w/ 4 levels "0","1","2","3": 1 1 1 1 1 1 1 1 1 1 ...
##  $ fruit.spots    : Factor w/ 4 levels "0","1","2","4": 4 4 4 4 4 4 4 4 4 4 ...
##  $ seed           : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ mold.growth    : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ seed.discolor  : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ seed.size      : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ shriveling     : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ roots          : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...

a: Investigate the frequency distributions for the categorical predictors. Are any of the distributions degenerate in the ways discussed earlier in this chapter?

#get predictos
predictor_cols <- names(Soybean)[-1]

#func examining frequency distributions
freq_dist_exam <- function(data, predictor) {
  freq_table <- table(data[[predictor]], useNA = "ifany")
  prop_table <- prop.table(freq_table) * 100
  
  data.frame(
    Predictor = predictor,
    Level = names(freq_table),
    Count = as.vector(freq_table),
    Percentage = round(as.vector(prop_table), 2)
  )
}

#apply to predictors
all_frequencies <- do.call(rbind, 
  lapply(predictor_cols, function(p) freq_dist_exam(Soybean, p)))

#identify degenerate frequenct
degenerate_checker <- all_frequencies %>%
  group_by(Predictor) %>%
  summarise(
    n_levels = n(),
    max_pct = max(Percentage),
    min_pct = min(Percentage),
    single_dominant = max_pct > 95
  ) %>%
  arrange(desc(max_pct))

print(degenerate_checker)
## # A tibble: 35 × 5
##    Predictor    n_levels max_pct min_pct single_dominant
##    <chr>           <int>   <dbl>   <dbl> <lgl>          
##  1 mycelium            3    93.6    0.88 FALSE          
##  2 sclerotia           3    91.5    2.93 FALSE          
##  3 leaves              2    88.7   11.3  FALSE          
##  4 int.discolor        4    85.1    2.93 FALSE          
##  5 leaf.malf           3    81.1    6.59 FALSE          
##  6 roots               4    80.7    2.2  FALSE          
##  7 shriveling          3    78.9    5.56 FALSE          
##  8 leaf.mild           4    78.3    2.93 FALSE          
##  9 seed.size           3    77.9    8.64 FALSE          
## 10 mold.growth         3    76.7    9.81 FALSE          
## # ℹ 25 more rows

While all of the predictors aren’t single dominant (having a max percentage greater than 95% for an occurrence), I would say that mycellium and sclerotia could be considered degenerate distributed as they have a very high percentage of 0s. However, many other predictors do also fall into having a highly skewed imbalanced distribution towards zero.

  1. Roughly 18 % of the data are missing. Are there particular predictors that are more likely to be missing? Is the pattern of missing data related to the classes?

From the table, we can see that predictors that are more likely to be missing are hail, sever, seed.tmt and lodging. But many other predictors have similar missing data percentages.

data_missing_pred <- Soybean %>%
  summarise(across(everything(), ~sum(is.na(.)))) %>%
  pivot_longer(everything(), names_to = "Predictor", values_to = "Missing_Count") %>%
  mutate(Missing_Percent = round(Missing_Count / nrow(Soybean) * 100, 2)) %>%
  arrange(desc(Missing_Percent))

print(data_missing_pred)
## # A tibble: 36 × 3
##    Predictor       Missing_Count Missing_Percent
##    <chr>                   <int>           <dbl>
##  1 hail                      121            17.7
##  2 sever                     121            17.7
##  3 seed.tmt                  121            17.7
##  4 lodging                   121            17.7
##  5 germ                      112            16.4
##  6 leaf.mild                 108            15.8
##  7 fruiting.bodies           106            15.5
##  8 fruit.spots               106            15.5
##  9 seed.discolor             106            15.5
## 10 shriveling                106            15.5
## # ℹ 26 more rows

From the below visualization, we can see that the pattern of missing data is related to the class. In particular, there are five classes that account for much of the missing data: 2-4-d-injury, charcoal-rot, diaporthe-pod-&-stem-blight, herbicide-injury and phytophthora-rot.

missing_by_class <- Soybean %>%
  group_by(Class) %>%
  summarise(across(everything(), ~mean(is.na(.)) * 100)) %>%
  pivot_longer(-Class, names_to = "Predictor", values_to = "Missing_Pct")

#I used the top predictors with most missing percentage
missing_by_class %>%
  filter(Predictor %in% c("hail", "sever", "lodging", "seed.tmt")) %>%
  ggplot(aes(x = Class, y = Missing_Pct, fill = Predictor)) +
  geom_col(position = "dodge") +
  labs(title = "Missingness by Class for Selected Predictors",
       x = "Class", y = "Missing Percentage") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 90, hjust = 1))

c. Develop a strategy for handling missing data, either by eliminating predictors or imputation.

Because the dataset uses mostly categorical predictors, has multiple class classifications, and a low percentage of missing data, I feel that eliminating the predictors would not be the answer. Instead, we can try imputation to handle this task. One way I feel would be appropriate is the use of K-nearest neightors imputation through the VIM package, that can estimate missing values from similar complete cases.

Since the KNN estimated vals for the missing cases, there are no missing values in the cleaned soybean dataset.

#find total missing values in df
sum(is.na(cleaned_soy))
## [1] 0
#look at missing vals per column
colSums(is.na(cleaned_soy))
##               Class                date         plant.stand              precip 
##                   0                   0                   0                   0 
##                temp                hail           crop.hist            area.dam 
##                   0                   0                   0                   0 
##               sever            seed.tmt                germ        plant.growth 
##                   0                   0                   0                   0 
##              leaves           leaf.halo           leaf.marg           leaf.size 
##                   0                   0                   0                   0 
##         leaf.shread           leaf.malf           leaf.mild                stem 
##                   0                   0                   0                   0 
##             lodging        stem.cankers       canker.lesion     fruiting.bodies 
##                   0                   0                   0                   0 
##           ext.decay            mycelium        int.discolor           sclerotia 
##                   0                   0                   0                   0 
##          fruit.pods         fruit.spots                seed         mold.growth 
##                   0                   0                   0                   0 
##       seed.discolor           seed.size          shriveling               roots 
##                   0                   0                   0                   0 
##           Class_imp            date_imp     plant.stand_imp          precip_imp 
##                   0                   0                   0                   0 
##            temp_imp            hail_imp       crop.hist_imp        area.dam_imp 
##                   0                   0                   0                   0 
##           sever_imp        seed.tmt_imp            germ_imp    plant.growth_imp 
##                   0                   0                   0                   0 
##          leaves_imp       leaf.halo_imp       leaf.marg_imp       leaf.size_imp 
##                   0                   0                   0                   0 
##     leaf.shread_imp       leaf.malf_imp       leaf.mild_imp            stem_imp 
##                   0                   0                   0                   0 
##         lodging_imp    stem.cankers_imp   canker.lesion_imp fruiting.bodies_imp 
##                   0                   0                   0                   0 
##       ext.decay_imp        mycelium_imp    int.discolor_imp       sclerotia_imp 
##                   0                   0                   0                   0 
##      fruit.pods_imp     fruit.spots_imp            seed_imp     mold.growth_imp 
##                   0                   0                   0                   0 
##   seed.discolor_imp       seed.size_imp      shriveling_imp           roots_imp 
##                   0                   0                   0                   0