Exercise 3.1. The UC Irvine Machine Learning Repository6 contains a
data set related to glass identification. The data consist of 214 glass
samples labeled as one of seven class categories. There are nine
predictors, including the refractive index and percentages of eight
elements: Na, Mg, Al, Si, K, Ca, Ba, and Fe.
Required libraries & the data can be accessed via::
library(mlbench)
library(ggplot2)
library(tidyr)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(corrplot)
## corrplot 0.95 loaded
data(Glass)
(a) Using visualizations, explore the predictor variables to
understand their distributions as well as the relationships between
predictors.
Predictors:
predictors <- Glass |>
select(-Type)
head(predictors)
## RI Na Mg Al Si K Ca Ba Fe
## 1 1.52101 13.64 4.49 1.10 71.78 0.06 8.75 0 0.00
## 2 1.51761 13.89 3.60 1.36 72.73 0.48 7.83 0 0.00
## 3 1.51618 13.53 3.55 1.54 72.99 0.39 7.78 0 0.00
## 4 1.51766 13.21 3.69 1.29 72.61 0.57 8.22 0 0.00
## 5 1.51742 13.27 3.62 1.24 73.08 0.55 8.07 0 0.00
## 6 1.51596 12.79 3.61 1.62 72.97 0.64 8.07 0 0.26
Visualizations:
par(mfrow=c(3,3))
par(mai=c(.3,.3,.3,.3))
for (predictor in names(predictors)) {
hist(predictors[[predictor]], main = predictor, col='lightblue')
}

Corelation Plot:
corrplot(cor(predictors),
method="color",
diag=FALSE,
type="lower",
addCoef.col = "black",
number.cex=0.70)

There is a strong positive correlation between Ca and RI
There is a significant positive correlation between the
following:
Ba and Al
Ba and Na
K and Al
There is a significant negative correlation between the
following:
Si and RI
Ba and Mg
Al and Mg
Ca and Mg
(b) Do there appear to be any outliers in the data? Are any
predictors skewed?
Answers: Na appears to be mostly normally distributed with a slight
right skew. Al, RI, and Ca also appear to have a right skews. Fe, Ba,
and K are all severely right skewed. Si has a left skew and Mg is
bimodal and also left skewed. From the boxplots, we see a number of
outliers for all but Mg.
(c) Are there any relevant transformations of one or more predictors
that might improve the classification model?
Answers: Since RI, K, Ca, Ba, and Fe are all right-skewed, a log
transformation or Box-Cox transform could help reduce skewness and make
the distributions more symmetric.
For Na, Al, and Si, I believe no transformation is extremely
necessary since the distributions are already approximately normal.
However, there is a slight right-skewness for Na and Al and a slight
left-skewness for Si, so a log transform or Box-Cox transformation may
be beneficial.
Since the predictors are on different scales, it would be good to
standardize them by applying z-score standardization.
Transformations are not recommended for bimodal distributions, so Mg
does not require a transformation.
Exercise 3.2: The soybean data can also be found at the UC Irvine
Machine Learning Repository. Data were collected to predict disease in
683 soybeans. The 35 predictors are mostly categorical and include
information on the environmental conditions (e.g., temperature,
precipitation) and plant conditions (e.g., left spots, mold growth). The
outcome labels consist of 19 distinct classes.
(a) Investigate the frequency distributions for the categorical
predictors. Are any of the distributions degenerate in the ways
discussed earlier in this chapter?
Data Set:
data(Soybean)
str(Soybean)
## 'data.frame': 683 obs. of 36 variables:
## $ Class : Factor w/ 19 levels "2-4-d-injury",..: 11 11 11 11 11 11 11 11 11 11 ...
## $ date : Factor w/ 7 levels "0","1","2","3",..: 7 5 4 4 7 6 6 5 7 5 ...
## $ plant.stand : Ord.factor w/ 2 levels "0"<"1": 1 1 1 1 1 1 1 1 1 1 ...
## $ precip : Ord.factor w/ 3 levels "0"<"1"<"2": 3 3 3 3 3 3 3 3 3 3 ...
## $ temp : Ord.factor w/ 3 levels "0"<"1"<"2": 2 2 2 2 2 2 2 2 2 2 ...
## $ hail : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 2 1 1 ...
## $ crop.hist : Factor w/ 4 levels "0","1","2","3": 2 3 2 2 3 4 3 2 4 3 ...
## $ area.dam : Factor w/ 4 levels "0","1","2","3": 2 1 1 1 1 1 1 1 1 1 ...
## $ sever : Factor w/ 3 levels "0","1","2": 2 3 3 3 2 2 2 2 2 3 ...
## $ seed.tmt : Factor w/ 3 levels "0","1","2": 1 2 2 1 1 1 2 1 2 1 ...
## $ germ : Ord.factor w/ 3 levels "0"<"1"<"2": 1 2 3 2 3 2 1 3 2 3 ...
## $ plant.growth : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
## $ leaves : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
## $ leaf.halo : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
## $ leaf.marg : Factor w/ 3 levels "0","1","2": 3 3 3 3 3 3 3 3 3 3 ...
## $ leaf.size : Ord.factor w/ 3 levels "0"<"1"<"2": 3 3 3 3 3 3 3 3 3 3 ...
## $ leaf.shread : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ leaf.malf : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ leaf.mild : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
## $ stem : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
## $ lodging : Factor w/ 2 levels "0","1": 2 1 1 1 1 1 2 1 1 1 ...
## $ stem.cankers : Factor w/ 4 levels "0","1","2","3": 4 4 4 4 4 4 4 4 4 4 ...
## $ canker.lesion : Factor w/ 4 levels "0","1","2","3": 2 2 1 1 2 1 2 2 2 2 ...
## $ fruiting.bodies: Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
## $ ext.decay : Factor w/ 3 levels "0","1","2": 2 2 2 2 2 2 2 2 2 2 ...
## $ mycelium : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ int.discolor : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
## $ sclerotia : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ fruit.pods : Factor w/ 4 levels "0","1","2","3": 1 1 1 1 1 1 1 1 1 1 ...
## $ fruit.spots : Factor w/ 4 levels "0","1","2","4": 4 4 4 4 4 4 4 4 4 4 ...
## $ seed : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ mold.growth : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ seed.discolor : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ seed.size : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ shriveling : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ roots : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
Bar Plot of each predictors:
predictors <- Soybean |>
select(-Class)
for (predictor in names(predictors)) {
print(
ggplot(data = predictors, aes(x = predictors[[predictor]])) +
geom_bar() +
labs(title = paste("Bar plot of", predictor), x=predictor)
)
}
## Warning: Use of `predictors[[predictor]]` is discouraged.
## ℹ Use `.data[[predictor]]` instead.

## Warning: Use of `predictors[[predictor]]` is discouraged.
## ℹ Use `.data[[predictor]]` instead.

## Warning: Use of `predictors[[predictor]]` is discouraged.
## ℹ Use `.data[[predictor]]` instead.

## Warning: Use of `predictors[[predictor]]` is discouraged.
## ℹ Use `.data[[predictor]]` instead.

## Warning: Use of `predictors[[predictor]]` is discouraged.
## ℹ Use `.data[[predictor]]` instead.

## Warning: Use of `predictors[[predictor]]` is discouraged.
## ℹ Use `.data[[predictor]]` instead.

## Warning: Use of `predictors[[predictor]]` is discouraged.
## ℹ Use `.data[[predictor]]` instead.

## Warning: Use of `predictors[[predictor]]` is discouraged.
## ℹ Use `.data[[predictor]]` instead.

## Warning: Use of `predictors[[predictor]]` is discouraged.
## ℹ Use `.data[[predictor]]` instead.

## Warning: Use of `predictors[[predictor]]` is discouraged.
## ℹ Use `.data[[predictor]]` instead.

## Warning: Use of `predictors[[predictor]]` is discouraged.
## ℹ Use `.data[[predictor]]` instead.

## Warning: Use of `predictors[[predictor]]` is discouraged.
## ℹ Use `.data[[predictor]]` instead.

## Warning: Use of `predictors[[predictor]]` is discouraged.
## ℹ Use `.data[[predictor]]` instead.

## Warning: Use of `predictors[[predictor]]` is discouraged.
## ℹ Use `.data[[predictor]]` instead.

## Warning: Use of `predictors[[predictor]]` is discouraged.
## ℹ Use `.data[[predictor]]` instead.

## Warning: Use of `predictors[[predictor]]` is discouraged.
## ℹ Use `.data[[predictor]]` instead.

## Warning: Use of `predictors[[predictor]]` is discouraged.
## ℹ Use `.data[[predictor]]` instead.

## Warning: Use of `predictors[[predictor]]` is discouraged.
## ℹ Use `.data[[predictor]]` instead.

## Warning: Use of `predictors[[predictor]]` is discouraged.
## ℹ Use `.data[[predictor]]` instead.

## Warning: Use of `predictors[[predictor]]` is discouraged.
## ℹ Use `.data[[predictor]]` instead.

## Warning: Use of `predictors[[predictor]]` is discouraged.
## ℹ Use `.data[[predictor]]` instead.

## Warning: Use of `predictors[[predictor]]` is discouraged.
## ℹ Use `.data[[predictor]]` instead.

## Warning: Use of `predictors[[predictor]]` is discouraged.
## ℹ Use `.data[[predictor]]` instead.

## Warning: Use of `predictors[[predictor]]` is discouraged.
## ℹ Use `.data[[predictor]]` instead.

## Warning: Use of `predictors[[predictor]]` is discouraged.
## ℹ Use `.data[[predictor]]` instead.

## Warning: Use of `predictors[[predictor]]` is discouraged.
## ℹ Use `.data[[predictor]]` instead.

## Warning: Use of `predictors[[predictor]]` is discouraged.
## ℹ Use `.data[[predictor]]` instead.

## Warning: Use of `predictors[[predictor]]` is discouraged.
## ℹ Use `.data[[predictor]]` instead.

## Warning: Use of `predictors[[predictor]]` is discouraged.
## ℹ Use `.data[[predictor]]` instead.

## Warning: Use of `predictors[[predictor]]` is discouraged.
## ℹ Use `.data[[predictor]]` instead.

## Warning: Use of `predictors[[predictor]]` is discouraged.
## ℹ Use `.data[[predictor]]` instead.

## Warning: Use of `predictors[[predictor]]` is discouraged.
## ℹ Use `.data[[predictor]]` instead.

## Warning: Use of `predictors[[predictor]]` is discouraged.
## ℹ Use `.data[[predictor]]` instead.

## Warning: Use of `predictors[[predictor]]` is discouraged.
## ℹ Use `.data[[predictor]]` instead.

## Warning: Use of `predictors[[predictor]]` is discouraged.
## ℹ Use `.data[[predictor]]` instead.

Comment: Many of the predictors are missing values. A few of the
predictors are also very imbalanced, with almost all of the observations
being accounted for in a single variable, such as leaf.malf, leaf.mild,
lodging, mycelium, int.discolor, sclerotia, mold.growth, seed.discolor,
seed.size, and shriveling.
(b) Roughly 18 % of the data are missing. Are there particular
predictors that are more likely to be missing? Is the pattern of missing
data related to the classes?
Missing percentage of variables:
We can calculate the percentage of data missing from each
variable.
missing_table <- Soybean %>%
summarise(across(everything(), ~ mean(is.na(.)) * 100)) %>%
pivot_longer(
cols = everything(),
names_to = "Variable",
values_to = "Missing_Percent"
)
missing_table <- missing_table %>%
arrange(desc(Missing_Percent))
missing_table
## # A tibble: 36 Ă— 2
## Variable Missing_Percent
## <chr> <dbl>
## 1 hail 17.7
## 2 sever 17.7
## 3 seed.tmt 17.7
## 4 lodging 17.7
## 5 germ 16.4
## 6 leaf.mild 15.8
## 7 fruiting.bodies 15.5
## 8 fruit.spots 15.5
## 9 seed.discolor 15.5
## 10 shriveling 15.5
## # ℹ 26 more rows
Missing values of predictor:
Soybean %>%
summarise(across(everything(), ~ sum(is.na(.)))) %>%
pivot_longer(everything(), names_to = "Variable", values_to = "Missing") %>%
ggplot(aes(x = reorder(Variable, -Missing), y = Missing)) +
geom_col(fill = "Lightblue") +
coord_flip() +
labs(title = "Missing Values by Predictor",
x = "Predictor", y = "Number of Missing Values") +
theme(
plot.title = element_text(hjust = 0.5)
)

Missingness by predictor + class
Soybean %>%
group_by(Class) %>%
summarise(across(everything(), ~ mean(is.na(.))), .groups = "drop") %>%
pivot_longer(-Class, names_to = "Variable", values_to = "PropMissing") %>%
ggplot(aes(x = Variable, y = Class, fill = PropMissing)) +
geom_tile() +
scale_fill_gradient(low = "blue", high = "white") +
labs(title = "Proportion Missing by Predictor and Class",
x = "Predictor", y = "Class", fill = "Proportion Missing") +
theme(
plot.title = element_text(hjust = 0.5),
axis.text.x = element_text(angle = 90, vjust = 0.5, hjust = 1)
)

Comment:The proportion of missing values by class + predictor plot
is very helpful as it shows that the missing values only occur in a few
classes: 2-4-d-injury, phytophthora-rot, herbicide-injury,
diaporthe-pod-&-stem-blight, and cyst-nematode. This means that it’s
unlikely that the values are missing at random and the missingness
corresponds to the class.
(c) Develop a strategy for handling missing data, either by
eliminating predictors or imputation.
Comment: Missing data were first quantified for each predictor.
Variables with more than 50% missing values were removed due to high
information loss. For the remaining categorical predictors, missing
values were imputed using class-conditional mode imputation to preserve
disease-specific structure. After imputation, near-zero variance
predictors were removed. This strategy balances bias reduction and
variance preservation while maintaining predictive information.
Comment: Al and Ca look approximately normal. Na and RI are skewed right. Ba, Fe, and K have many 0 values with some outliers. Mg seems to be bimodal with peaks around 0 and 3.5. The bar plot reveals class imbalance.