Required libraries & the data can be accessed via::

library(mlbench)
library(ggplot2)
library(tidyr)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(corrplot)
## corrplot 0.95 loaded
data(Glass)

(a) Using visualizations, explore the predictor variables to understand their distributions as well as the relationships between predictors.

Predictors:

predictors <- Glass |>
  select(-Type)

head(predictors)
##        RI    Na   Mg   Al    Si    K   Ca Ba   Fe
## 1 1.52101 13.64 4.49 1.10 71.78 0.06 8.75  0 0.00
## 2 1.51761 13.89 3.60 1.36 72.73 0.48 7.83  0 0.00
## 3 1.51618 13.53 3.55 1.54 72.99 0.39 7.78  0 0.00
## 4 1.51766 13.21 3.69 1.29 72.61 0.57 8.22  0 0.00
## 5 1.51742 13.27 3.62 1.24 73.08 0.55 8.07  0 0.00
## 6 1.51596 12.79 3.61 1.62 72.97 0.64 8.07  0 0.26

Visualizations:

par(mfrow=c(3,3))
par(mai=c(.3,.3,.3,.3))
for (predictor in names(predictors)) {
  hist(predictors[[predictor]], main = predictor, col='lightblue')
}

Comment: Al and Ca look approximately normal. Na and RI are skewed right. Ba, Fe, and K have many 0 values with some outliers. Mg seems to be bimodal with peaks around 0 and 3.5. The bar plot reveals class imbalance.

par(mfrow=c(3,3))
par(mai=c(.25,.25,.25,.25))
for (predictor in names(predictors)) {
  boxplot(predictors[[predictor]], 
          main = predictor, 
          col='lightblue',
          horizontal=T)
}

Corelation Plot:

corrplot(cor(predictors), 
         method="color",
         diag=FALSE,
         type="lower",
         addCoef.col = "black",
         number.cex=0.70)

Comment: According to the correlation plot:

There is a strong positive correlation between Ca and RI

There is a significant positive correlation between the following:

Ba and Al

Ba and Na

K and Al

There is a significant negative correlation between the following:

Si and RI

Ba and Mg

Al and Mg

Ca and Mg

(b) Do there appear to be any outliers in the data? Are any predictors skewed?

Answers: Na appears to be mostly normally distributed with a slight right skew. Al, RI, and Ca also appear to have a right skews. Fe, Ba, and K are all severely right skewed. Si has a left skew and Mg is bimodal and also left skewed. From the boxplots, we see a number of outliers for all but Mg.

(c) Are there any relevant transformations of one or more predictors that might improve the classification model?

Answers: Since RI, K, Ca, Ba, and Fe are all right-skewed, a log transformation or Box-Cox transform could help reduce skewness and make the distributions more symmetric.

For Na, Al, and Si, I believe no transformation is extremely necessary since the distributions are already approximately normal. However, there is a slight right-skewness for Na and Al and a slight left-skewness for Si, so a log transform or Box-Cox transformation may be beneficial.

Since the predictors are on different scales, it would be good to standardize them by applying z-score standardization.

Exercise 3.2: The soybean data can also be found at the UC Irvine Machine Learning Repository. Data were collected to predict disease in 683 soybeans. The 35 predictors are mostly categorical and include information on the environmental conditions (e.g., temperature, precipitation) and plant conditions (e.g., left spots, mold growth). The outcome labels consist of 19 distinct classes.

(a) Investigate the frequency distributions for the categorical predictors. Are any of the distributions degenerate in the ways discussed earlier in this chapter?

Data Set:

data(Soybean)
str(Soybean)
## 'data.frame':    683 obs. of  36 variables:
##  $ Class          : Factor w/ 19 levels "2-4-d-injury",..: 11 11 11 11 11 11 11 11 11 11 ...
##  $ date           : Factor w/ 7 levels "0","1","2","3",..: 7 5 4 4 7 6 6 5 7 5 ...
##  $ plant.stand    : Ord.factor w/ 2 levels "0"<"1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ precip         : Ord.factor w/ 3 levels "0"<"1"<"2": 3 3 3 3 3 3 3 3 3 3 ...
##  $ temp           : Ord.factor w/ 3 levels "0"<"1"<"2": 2 2 2 2 2 2 2 2 2 2 ...
##  $ hail           : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 2 1 1 ...
##  $ crop.hist      : Factor w/ 4 levels "0","1","2","3": 2 3 2 2 3 4 3 2 4 3 ...
##  $ area.dam       : Factor w/ 4 levels "0","1","2","3": 2 1 1 1 1 1 1 1 1 1 ...
##  $ sever          : Factor w/ 3 levels "0","1","2": 2 3 3 3 2 2 2 2 2 3 ...
##  $ seed.tmt       : Factor w/ 3 levels "0","1","2": 1 2 2 1 1 1 2 1 2 1 ...
##  $ germ           : Ord.factor w/ 3 levels "0"<"1"<"2": 1 2 3 2 3 2 1 3 2 3 ...
##  $ plant.growth   : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
##  $ leaves         : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
##  $ leaf.halo      : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
##  $ leaf.marg      : Factor w/ 3 levels "0","1","2": 3 3 3 3 3 3 3 3 3 3 ...
##  $ leaf.size      : Ord.factor w/ 3 levels "0"<"1"<"2": 3 3 3 3 3 3 3 3 3 3 ...
##  $ leaf.shread    : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ leaf.malf      : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ leaf.mild      : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
##  $ stem           : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
##  $ lodging        : Factor w/ 2 levels "0","1": 2 1 1 1 1 1 2 1 1 1 ...
##  $ stem.cankers   : Factor w/ 4 levels "0","1","2","3": 4 4 4 4 4 4 4 4 4 4 ...
##  $ canker.lesion  : Factor w/ 4 levels "0","1","2","3": 2 2 1 1 2 1 2 2 2 2 ...
##  $ fruiting.bodies: Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
##  $ ext.decay      : Factor w/ 3 levels "0","1","2": 2 2 2 2 2 2 2 2 2 2 ...
##  $ mycelium       : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ int.discolor   : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
##  $ sclerotia      : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ fruit.pods     : Factor w/ 4 levels "0","1","2","3": 1 1 1 1 1 1 1 1 1 1 ...
##  $ fruit.spots    : Factor w/ 4 levels "0","1","2","4": 4 4 4 4 4 4 4 4 4 4 ...
##  $ seed           : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ mold.growth    : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ seed.discolor  : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ seed.size      : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ shriveling     : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
##  $ roots          : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...

Bar Plot of each predictors:

predictors <- Soybean |>
  select(-Class)

for (predictor in names(predictors)) {
  print(
  ggplot(data = predictors, aes(x = predictors[[predictor]])) +
    geom_bar() +
    labs(title = paste("Bar plot of", predictor), x=predictor)
  )
}
## Warning: Use of `predictors[[predictor]]` is discouraged.
## ℹ Use `.data[[predictor]]` instead.

## Warning: Use of `predictors[[predictor]]` is discouraged.
## ℹ Use `.data[[predictor]]` instead.

## Warning: Use of `predictors[[predictor]]` is discouraged.
## ℹ Use `.data[[predictor]]` instead.

## Warning: Use of `predictors[[predictor]]` is discouraged.
## ℹ Use `.data[[predictor]]` instead.

## Warning: Use of `predictors[[predictor]]` is discouraged.
## ℹ Use `.data[[predictor]]` instead.

## Warning: Use of `predictors[[predictor]]` is discouraged.
## ℹ Use `.data[[predictor]]` instead.

## Warning: Use of `predictors[[predictor]]` is discouraged.
## ℹ Use `.data[[predictor]]` instead.

## Warning: Use of `predictors[[predictor]]` is discouraged.
## ℹ Use `.data[[predictor]]` instead.

## Warning: Use of `predictors[[predictor]]` is discouraged.
## ℹ Use `.data[[predictor]]` instead.

## Warning: Use of `predictors[[predictor]]` is discouraged.
## ℹ Use `.data[[predictor]]` instead.

## Warning: Use of `predictors[[predictor]]` is discouraged.
## ℹ Use `.data[[predictor]]` instead.

## Warning: Use of `predictors[[predictor]]` is discouraged.
## ℹ Use `.data[[predictor]]` instead.

## Warning: Use of `predictors[[predictor]]` is discouraged.
## ℹ Use `.data[[predictor]]` instead.

## Warning: Use of `predictors[[predictor]]` is discouraged.
## ℹ Use `.data[[predictor]]` instead.

## Warning: Use of `predictors[[predictor]]` is discouraged.
## ℹ Use `.data[[predictor]]` instead.

## Warning: Use of `predictors[[predictor]]` is discouraged.
## ℹ Use `.data[[predictor]]` instead.

## Warning: Use of `predictors[[predictor]]` is discouraged.
## ℹ Use `.data[[predictor]]` instead.

## Warning: Use of `predictors[[predictor]]` is discouraged.
## ℹ Use `.data[[predictor]]` instead.

## Warning: Use of `predictors[[predictor]]` is discouraged.
## ℹ Use `.data[[predictor]]` instead.

## Warning: Use of `predictors[[predictor]]` is discouraged.
## ℹ Use `.data[[predictor]]` instead.

## Warning: Use of `predictors[[predictor]]` is discouraged.
## ℹ Use `.data[[predictor]]` instead.

## Warning: Use of `predictors[[predictor]]` is discouraged.
## ℹ Use `.data[[predictor]]` instead.

## Warning: Use of `predictors[[predictor]]` is discouraged.
## ℹ Use `.data[[predictor]]` instead.

## Warning: Use of `predictors[[predictor]]` is discouraged.
## ℹ Use `.data[[predictor]]` instead.

## Warning: Use of `predictors[[predictor]]` is discouraged.
## ℹ Use `.data[[predictor]]` instead.

## Warning: Use of `predictors[[predictor]]` is discouraged.
## ℹ Use `.data[[predictor]]` instead.

## Warning: Use of `predictors[[predictor]]` is discouraged.
## ℹ Use `.data[[predictor]]` instead.

## Warning: Use of `predictors[[predictor]]` is discouraged.
## ℹ Use `.data[[predictor]]` instead.

## Warning: Use of `predictors[[predictor]]` is discouraged.
## ℹ Use `.data[[predictor]]` instead.

## Warning: Use of `predictors[[predictor]]` is discouraged.
## ℹ Use `.data[[predictor]]` instead.

## Warning: Use of `predictors[[predictor]]` is discouraged.
## ℹ Use `.data[[predictor]]` instead.

## Warning: Use of `predictors[[predictor]]` is discouraged.
## ℹ Use `.data[[predictor]]` instead.

## Warning: Use of `predictors[[predictor]]` is discouraged.
## ℹ Use `.data[[predictor]]` instead.

## Warning: Use of `predictors[[predictor]]` is discouraged.
## ℹ Use `.data[[predictor]]` instead.

## Warning: Use of `predictors[[predictor]]` is discouraged.
## ℹ Use `.data[[predictor]]` instead.

Comment: Many of the predictors are missing values. A few of the predictors are also very imbalanced, with almost all of the observations being accounted for in a single variable, such as leaf.malf, leaf.mild, lodging, mycelium, int.discolor, sclerotia, mold.growth, seed.discolor, seed.size, and shriveling.

Missing percentage of variables:

We can calculate the percentage of data missing from each variable.

missing_table <- Soybean %>%
  summarise(across(everything(), ~ mean(is.na(.)) * 100)) %>%
  pivot_longer(
    cols = everything(),
    names_to = "Variable",
    values_to = "Missing_Percent"
  )

missing_table <- missing_table %>%
  arrange(desc(Missing_Percent))

missing_table
## # A tibble: 36 Ă— 2
##    Variable        Missing_Percent
##    <chr>                     <dbl>
##  1 hail                       17.7
##  2 sever                      17.7
##  3 seed.tmt                   17.7
##  4 lodging                    17.7
##  5 germ                       16.4
##  6 leaf.mild                  15.8
##  7 fruiting.bodies            15.5
##  8 fruit.spots                15.5
##  9 seed.discolor              15.5
## 10 shriveling                 15.5
## # ℹ 26 more rows

Missing values of predictor:

Soybean %>%
  summarise(across(everything(), ~ sum(is.na(.)))) %>%
  pivot_longer(everything(), names_to = "Variable", values_to = "Missing") %>%
  ggplot(aes(x = reorder(Variable, -Missing), y = Missing)) +
  geom_col(fill = "Lightblue") +
  coord_flip() +
  labs(title = "Missing Values by Predictor",
       x = "Predictor", y = "Number of Missing Values") +
  theme(
    plot.title = element_text(hjust = 0.5)
  )

Missingness by predictor + class

Soybean %>%
  group_by(Class) %>%
  summarise(across(everything(), ~ mean(is.na(.))), .groups = "drop") %>%
  pivot_longer(-Class, names_to = "Variable", values_to = "PropMissing") %>%
  ggplot(aes(x = Variable, y = Class, fill = PropMissing)) +
  geom_tile() +
  scale_fill_gradient(low = "blue", high = "white") +
  labs(title = "Proportion Missing by Predictor and Class",
       x = "Predictor", y = "Class", fill = "Proportion Missing") +
  theme(
    plot.title = element_text(hjust = 0.5),
    axis.text.x = element_text(angle = 90, vjust = 0.5, hjust = 1)
  )

Comment:The proportion of missing values by class + predictor plot is very helpful as it shows that the missing values only occur in a few classes: 2-4-d-injury, phytophthora-rot, herbicide-injury, diaporthe-pod-&-stem-blight, and cyst-nematode. This means that it’s unlikely that the values are missing at random and the missingness corresponds to the class.

(c) Develop a strategy for handling missing data, either by eliminating predictors or imputation.

Comment: Missing data were first quantified for each predictor. Variables with more than 50% missing values were removed due to high information loss. For the remaining categorical predictors, missing values were imputed using class-conditional mode imputation to preserve disease-specific structure. After imputation, near-zero variance predictors were removed. This strategy balances bias reduction and variance preservation while maintaining predictive information.