Do problems 3.1 and 3.2 in the Kuhn and Johnson book Applied Predictive Modeling

(a) Using visualizations, explore the predictor variables to understand their distributions as well as the relationships between predictors.

glass <- as_tibble(Glass)

#Histograms by predictor
glass |>
  pivot_longer(-Type, names_to = "Predictor", values_to = "Value") |>
  ggplot(aes(x = Value)) +
  geom_histogram(bins = 30, fill = "grey70", color = "white") +
  facet_wrap(~ Predictor, scales = "free_x") +
  labs(title = "Distributions of Glass Predictors")

#Boxplots by class for each predictor
glass |>
  pivot_longer(-Type, names_to = "Predictor", values_to = "Value") |>
  ggplot(aes(x = Type, y = Value)) +
  geom_boxplot(outlier.alpha = 0.6) +
  facet_wrap(~ Predictor, scales = "free_y") +
  labs(title = "Predictors by Glass Type")

# A correlation heatmap
GGally::ggpairs(glass, columns = 1:9, aes(color = Type, alpha = 0.6))

(b) Do there appear to be any outliers in the data? Are any predictors skewed?

Yes. Several predictors show potential outliers, with values lying noticeably far from the main bulk of the data—especially for K, Ba, Fe, and in some cases Mg and Al—and these extremes likely reflect genuine differences between glass types rather than obvious data errors. In terms of shape, multiple variables are right-skewed, particularly Ba, K, and Fe, which have many observations near zero and a small number of much larger values, while others such as RI and Si are comparatively more symmetric though still not perfectly normal.

(c) Are there any relevant transformations of one or more predictors that

might improve the classification model?

Yes. Centering and scaling all predictors would help prevent large-scale variables from dominating the model, and applying log or Box-Cox/Yeo-Johnson transformations to right-skewed variables such as Ba, K, and Fe can reduce skewness and the impact of extremes, potentially improving classification while preserving informative high values.

3.2. The soybean data can also be found at the UC Irvine Machine Learning Repository. Data were collected to predict disease in 683 soybeans. The 35 predictors are mostly categorical and include information on the environmen- tal conditions (e.g., temperature, precipitation) and plant conditions (e.g., left spots, mold growth). The outcome labels consist of 19 distinct classes.

library(mlbench)
data(Soybean)
## See ?Soybean for details

(a) Investigate the frequency distributions for the categorical predictors. Are any of the distributions degenerate in the ways discussed earlier in this chapter?

Soybean <- as.data.frame(Soybean)

lapply(Soybean, table)
## $Class
## 
##                2-4-d-injury         alternarialeaf-spot 
##                          16                          91 
##                 anthracnose            bacterial-blight 
##                          44                          20 
##           bacterial-pustule                  brown-spot 
##                          20                          92 
##              brown-stem-rot                charcoal-rot 
##                          44                          20 
##               cyst-nematode diaporthe-pod-&-stem-blight 
##                          14                          15 
##       diaporthe-stem-canker                downy-mildew 
##                          20                          20 
##          frog-eye-leaf-spot            herbicide-injury 
##                          91                           8 
##      phyllosticta-leaf-spot            phytophthora-rot 
##                          20                          88 
##              powdery-mildew           purple-seed-stain 
##                          20                          20 
##        rhizoctonia-root-rot 
##                          20 
## 
## $date
## 
##   0   1   2   3   4   5   6 
##  26  75  93 118 131 149  90 
## 
## $plant.stand
## 
##   0   1 
## 354 293 
## 
## $precip
## 
##   0   1   2 
##  74 112 459 
## 
## $temp
## 
##   0   1   2 
##  80 374 199 
## 
## $hail
## 
##   0   1 
## 435 127 
## 
## $crop.hist
## 
##   0   1   2   3 
##  65 165 219 218 
## 
## $area.dam
## 
##   0   1   2   3 
## 123 227 145 187 
## 
## $sever
## 
##   0   1   2 
## 195 322  45 
## 
## $seed.tmt
## 
##   0   1   2 
## 305 222  35 
## 
## $germ
## 
##   0   1   2 
## 165 213 193 
## 
## $plant.growth
## 
##   0   1 
## 441 226 
## 
## $leaves
## 
##   0   1 
##  77 606 
## 
## $leaf.halo
## 
##   0   1   2 
## 221  36 342 
## 
## $leaf.marg
## 
##   0   1   2 
## 357  21 221 
## 
## $leaf.size
## 
##   0   1   2 
##  51 327 221 
## 
## $leaf.shread
## 
##   0   1 
## 487  96 
## 
## $leaf.malf
## 
##   0   1 
## 554  45 
## 
## $leaf.mild
## 
##   0   1   2 
## 535  20  20 
## 
## $stem
## 
##   0   1 
## 296 371 
## 
## $lodging
## 
##   0   1 
## 520  42 
## 
## $stem.cankers
## 
##   0   1   2   3 
## 379  39  36 191 
## 
## $canker.lesion
## 
##   0   1   2   3 
## 320  83 177  65 
## 
## $fruiting.bodies
## 
##   0   1 
## 473 104 
## 
## $ext.decay
## 
##   0   1   2 
## 497 135  13 
## 
## $mycelium
## 
##   0   1 
## 639   6 
## 
## $int.discolor
## 
##   0   1   2 
## 581  44  20 
## 
## $sclerotia
## 
##   0   1 
## 625  20 
## 
## $fruit.pods
## 
##   0   1   2   3 
## 407 130  14  48 
## 
## $fruit.spots
## 
##   0   1   2   4 
## 345  75  57 100 
## 
## $seed
## 
##   0   1 
## 476 115 
## 
## $mold.growth
## 
##   0   1 
## 524  67 
## 
## $seed.discolor
## 
##   0   1 
## 513  64 
## 
## $seed.size
## 
##   0   1 
## 532  59 
## 
## $shriveling
## 
##   0   1 
## 539  38 
## 
## $roots
## 
##   0   1   2 
## 551  86  15

Many of the categorical predictors have highly unbalanced distributions, with one level dominating or some levels occurring extremely rarely, so several predictors are effectively near-degenerate as described in the chapter.

(c) Develop a strategy for handling missing data, either by eliminating predictors or imputation.

First, remove predictors that are almost always missing or effectively constant, since they add noise but little information. For the remaining variables, use imputation tailored to type and class: for categorical predictors, replace missing values with the most frequent level within the same class when possible (otherwise global mode); for any ordered/numeric predictors, use simple methods like median (optionally within class). This preserves data size, respects class structure, and avoids overfitting complex imputation to such a small sample.