library(fpp3)
library(tidyverse)
library(ggrepel)
library(seasonal)
library(fabletools)
library(mlbench)
library(corrplot)
library(AppliedPredictiveModeling)
library(GGally)
library(patchwork)
library(caret)
library(e1071)
theme_set(theme_bw())
3.1 The UC Irvine Machine Learning Repository6 contains a data set related to glass identification. The data consist of 214 glass samples labeled as one of seven class categories. There are nine predictors, including the refractive index and percentages of eight elements: Na, Mg, Al, Si, K, Ca, Ba, and Fe. The data can be accessed via:
library(mlbench)
data(Glass)
str(Glass)
## 'data.frame': 214 obs. of 10 variables:
## $ RI : num 1.52 1.52 1.52 1.52 1.52 ...
## $ Na : num 13.6 13.9 13.5 13.2 13.3 ...
## $ Mg : num 4.49 3.6 3.55 3.69 3.62 3.61 3.6 3.61 3.58 3.6 ...
## $ Al : num 1.1 1.36 1.54 1.29 1.24 1.62 1.14 1.05 1.37 1.36 ...
## $ Si : num 71.8 72.7 73 72.6 73.1 ...
## $ K : num 0.06 0.48 0.39 0.57 0.55 0.64 0.58 0.57 0.56 0.57 ...
## $ Ca : num 8.75 7.83 7.78 8.22 8.07 8.07 8.17 8.24 8.3 8.4 ...
## $ Ba : num 0 0 0 0 0 0 0 0 0 0 ...
## $ Fe : num 0 0 0 0 0 0.26 0 0 0 0.11 ...
## $ Type: Factor w/ 6 levels "1","2","3","5",..: 1 1 1 1 1 1 1 1 1 1 ...
corr_glass = cor(Glass[0:9])
corrplot(corr_glass, order = "hclust")
Glass |>
select(-Type) |>
ggpairs()
Looking at the GGPairs plot generated above, the predictor variables have a variety of distributions. These are discussed in more detail in part (b) below.
With regards to any relationships between predictors, there
appears to be a strong positive relationship between RI and
Ca. There also appears to be a positive relationship
between Ba and Al, and a negative relationship
between the following pairs of variables: RI and
Al, RI and Si, Al
and Mg, Ca and Mg,
Ba and Mg. These identified relationships are
based on the correlation values being greater than 0.4 (or less than
-0.4 for the negative correlation).
ri_box = Glass |>
ggplot(aes(RI)) +
geom_boxplot() +
scale_y_continuous(labels = NULL) +
theme(axis.ticks.x = element_blank()) +
coord_flip()
na_box = Glass |>
ggplot(aes(Na)) +
geom_boxplot() +
scale_y_continuous(labels = NULL) +
theme(axis.ticks.x = element_blank()) +
coord_flip()
mg_box = Glass |>
ggplot(aes(Mg)) +
geom_boxplot() +
scale_y_continuous(labels = NULL) +
theme(axis.ticks.x = element_blank()) +
coord_flip()
al_box = Glass |>
ggplot(aes(Al)) +
geom_boxplot() +
scale_y_continuous(labels = NULL) +
theme(axis.ticks.x = element_blank()) +
coord_flip()
si_box = Glass |>
ggplot(aes(Si)) +
geom_boxplot() +
scale_y_continuous(labels = NULL) +
theme(axis.ticks.x = element_blank()) +
coord_flip()
k_box = Glass |>
ggplot(aes(K)) +
geom_boxplot() +
scale_y_continuous(labels = NULL) +
theme(axis.ticks.x = element_blank()) +
coord_flip()
ca_box = Glass |>
ggplot(aes(Ca)) +
geom_boxplot() +
scale_y_continuous(labels = NULL) +
theme(axis.ticks.x = element_blank()) +
coord_flip()
ba_box = Glass |>
ggplot(aes(Ba)) +
geom_boxplot() +
scale_y_continuous(labels = NULL) +
theme(axis.ticks.x = element_blank()) +
coord_flip()
fe_box = Glass |>
ggplot(aes(Fe)) +
geom_boxplot() +
scale_y_continuous(labels = NULL) +
theme(axis.ticks.x = element_blank()) +
coord_flip()
(ri_box + na_box + mg_box) / (al_box + si_box + k_box) / (ca_box + ba_box + fe_box)
Based on the histograms above, all but the Mg
predictor variable appear to have outliers. The predictor variables with
the most significant outliers appear to be K,
Ba, and Fe.
Looking at the GGPairs plot generated above, the following
predictor variables have nearly normal distributions: Na,
Al, Si, and Ca. The following
predictor variables appear to have a right skewed
distribution:RI,K,Ba,
andFe. The remaining predictor variable,Mg,
appears to have a left skewed distribution.
Glass |>
select(-Type) |>
summarize(
across(RI:Fe, \(x) BoxCoxTrans(x)$lambda)
)
Based on the list above, which lists the approximate lambda values that would be used for a Box-Cox Transformation, I would do the following:
Take the negative square inverse of
RI
Take the log of Na
Take the square root of Al
Take the square of Si
Take the inverse of Ca
The remaining variables would not need to be transformed.
3.2 The soybean data can also be found at the UC Irvine Machine Learning Repository. Data were collected to predict disease in 683 soybeans. The 35 predictors are mostly categorical and include information on the environmental conditions (e.g., temperature, precipitation) and plant conditions (e.g., left spots, mold growth). The outcome labels consist of 19 distinct classes.
The data can be loaded via:
data(Soybean)
## See ?Soybean for details
Before looking into the frequency distributions, I wanted to
ensure that all of the factors listed when using ?Soybean
were accounted for in each of the variables. Below is a summary of the
Soybean dataframe, which includes the number of factors for each
variable.
str(Soybean)
## 'data.frame': 683 obs. of 36 variables:
## $ Class : Factor w/ 19 levels "2-4-d-injury",..: 11 11 11 11 11 11 11 11 11 11 ...
## $ date : Factor w/ 7 levels "0","1","2","3",..: 7 5 4 4 7 6 6 5 7 5 ...
## $ plant.stand : Ord.factor w/ 2 levels "0"<"1": 1 1 1 1 1 1 1 1 1 1 ...
## $ precip : Ord.factor w/ 3 levels "0"<"1"<"2": 3 3 3 3 3 3 3 3 3 3 ...
## $ temp : Ord.factor w/ 3 levels "0"<"1"<"2": 2 2 2 2 2 2 2 2 2 2 ...
## $ hail : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 2 1 1 ...
## $ crop.hist : Factor w/ 4 levels "0","1","2","3": 2 3 2 2 3 4 3 2 4 3 ...
## $ area.dam : Factor w/ 4 levels "0","1","2","3": 2 1 1 1 1 1 1 1 1 1 ...
## $ sever : Factor w/ 3 levels "0","1","2": 2 3 3 3 2 2 2 2 2 3 ...
## $ seed.tmt : Factor w/ 3 levels "0","1","2": 1 2 2 1 1 1 2 1 2 1 ...
## $ germ : Ord.factor w/ 3 levels "0"<"1"<"2": 1 2 3 2 3 2 1 3 2 3 ...
## $ plant.growth : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
## $ leaves : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
## $ leaf.halo : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
## $ leaf.marg : Factor w/ 3 levels "0","1","2": 3 3 3 3 3 3 3 3 3 3 ...
## $ leaf.size : Ord.factor w/ 3 levels "0"<"1"<"2": 3 3 3 3 3 3 3 3 3 3 ...
## $ leaf.shread : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ leaf.malf : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ leaf.mild : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
## $ stem : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
## $ lodging : Factor w/ 2 levels "0","1": 2 1 1 1 1 1 2 1 1 1 ...
## $ stem.cankers : Factor w/ 4 levels "0","1","2","3": 4 4 4 4 4 4 4 4 4 4 ...
## $ canker.lesion : Factor w/ 4 levels "0","1","2","3": 2 2 1 1 2 1 2 2 2 2 ...
## $ fruiting.bodies: Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
## $ ext.decay : Factor w/ 3 levels "0","1","2": 2 2 2 2 2 2 2 2 2 2 ...
## $ mycelium : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ int.discolor : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
## $ sclerotia : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ fruit.pods : Factor w/ 4 levels "0","1","2","3": 1 1 1 1 1 1 1 1 1 1 ...
## $ fruit.spots : Factor w/ 4 levels "0","1","2","4": 4 4 4 4 4 4 4 4 4 4 ...
## $ seed : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ mold.growth : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ seed.discolor : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ seed.size : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ shriveling : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ roots : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
When comparing this to the number of factors listed when
using ?Soybean, all of the counts matched except for the
predictor variable fruit.spots. The summary above shows it
only having 4 factors, whereas the description of the
Soybean database shows 5 factors listed for
fruit.spots. To ensure this is accounted for, the
Soybean dataframe was modified below to include this factor
level.
Soybean = Soybean |>
mutate(fruit.spots = fct_expand(fruit.spots, "3", after = 3))
str(Soybean)
## 'data.frame': 683 obs. of 36 variables:
## $ Class : Factor w/ 19 levels "2-4-d-injury",..: 11 11 11 11 11 11 11 11 11 11 ...
## $ date : Factor w/ 7 levels "0","1","2","3",..: 7 5 4 4 7 6 6 5 7 5 ...
## $ plant.stand : Ord.factor w/ 2 levels "0"<"1": 1 1 1 1 1 1 1 1 1 1 ...
## $ precip : Ord.factor w/ 3 levels "0"<"1"<"2": 3 3 3 3 3 3 3 3 3 3 ...
## $ temp : Ord.factor w/ 3 levels "0"<"1"<"2": 2 2 2 2 2 2 2 2 2 2 ...
## $ hail : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 2 1 1 ...
## $ crop.hist : Factor w/ 4 levels "0","1","2","3": 2 3 2 2 3 4 3 2 4 3 ...
## $ area.dam : Factor w/ 4 levels "0","1","2","3": 2 1 1 1 1 1 1 1 1 1 ...
## $ sever : Factor w/ 3 levels "0","1","2": 2 3 3 3 2 2 2 2 2 3 ...
## $ seed.tmt : Factor w/ 3 levels "0","1","2": 1 2 2 1 1 1 2 1 2 1 ...
## $ germ : Ord.factor w/ 3 levels "0"<"1"<"2": 1 2 3 2 3 2 1 3 2 3 ...
## $ plant.growth : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
## $ leaves : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
## $ leaf.halo : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
## $ leaf.marg : Factor w/ 3 levels "0","1","2": 3 3 3 3 3 3 3 3 3 3 ...
## $ leaf.size : Ord.factor w/ 3 levels "0"<"1"<"2": 3 3 3 3 3 3 3 3 3 3 ...
## $ leaf.shread : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ leaf.malf : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ leaf.mild : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
## $ stem : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
## $ lodging : Factor w/ 2 levels "0","1": 2 1 1 1 1 1 2 1 1 1 ...
## $ stem.cankers : Factor w/ 4 levels "0","1","2","3": 4 4 4 4 4 4 4 4 4 4 ...
## $ canker.lesion : Factor w/ 4 levels "0","1","2","3": 2 2 1 1 2 1 2 2 2 2 ...
## $ fruiting.bodies: Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
## $ ext.decay : Factor w/ 3 levels "0","1","2": 2 2 2 2 2 2 2 2 2 2 ...
## $ mycelium : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ int.discolor : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
## $ sclerotia : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ fruit.pods : Factor w/ 4 levels "0","1","2","3": 1 1 1 1 1 1 1 1 1 1 ...
## $ fruit.spots : Factor w/ 5 levels "0","1","2","3",..: 5 5 5 5 5 5 5 5 5 5 ...
## $ seed : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ mold.growth : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ seed.discolor : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ seed.size : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ shriveling : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ roots : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
for (v in colnames(Soybean |> select(-Class, -date))){
print(Soybean |>
ggplot(aes(x = Soybean[,v])) + geom_bar() +
scale_x_discrete(drop = FALSE) + labs(x = v))
}
Based on the histograms above, there are no predictor variables that have degenerate distributions. That is, there are no predictor variables that have only 1 value throughout the entire series, which would result in a zero variance. There are some variables that do have a large difference in counts, however. The below table shows the predictor variables with the greatest frequency ratios, defined as the fraction of the value with the highest frequency to the value of the lowest frequency.
soybean_df = data.frame(variable = character(), factor_ratio = numeric(), frequency_ratio = numeric())
for (v in colnames(Soybean |> select(-Class, -date))){
soybean_count = Soybean |>
count(Soybean[,v], .drop = FALSE)
colnames(soybean_count)[1] = v
soybean_fact_ratio = n_distinct(soybean_count[,v]) / sum(soybean_count$n)
soybean_max = max(soybean_count[!is.na(soybean_count[1]),]$n)
soybean_min = min(soybean_count[!is.na(soybean_count[1]),]$n)
soybean_ratio = soybean_max / soybean_min
soybean_df = soybean_df |>
add_row(variable = v, factor_ratio = soybean_fact_ratio, frequency_ratio = soybean_ratio)
}
soybean_df |> arrange(desc(frequency_ratio))
na_count = Soybean |>
summarize(across(everything(), \(x) sum(is.na(x)))) |>
pivot_longer(cols = everything(), names_to = "variable", values_to = "count") |>
arrange(desc(count))
na_count
Based on the table above, the predictors hail,
sever, seed.tmt, and lodging have
the highest number of data points missing at 121 each, followed by
germ at 112.
Soybean |>
group_by(Class, hail) |>
count() |>
filter(is.na(hail)) |>
arrange(desc(n))
Soybean |>
group_by(Class, sever) |>
count() |>
filter(is.na(sever)) |>
arrange(desc(n))
Soybean |>
group_by(Class, seed.tmt) |>
count() |>
filter(is.na(seed.tmt)) |>
arrange(desc(n))
Soybean |>
group_by(Class, lodging) |>
count() |>
filter(is.na(lodging)) |>
arrange(desc(n))
Soybean |>
group_by(Class, germ) |>
count() |>
filter(is.na(germ)) |>
arrange(desc(n))
Looking at the counts above, the same 5 classes appear for
the missing data: phytophthora-rot,
2-4-d-injury, cyst-nematode,
herbicide-injury, and
diaporthe-pod-&-stem-blight, suggesting there’s a
pattern between missing data and these 5 classes.
The first thing I would do with regards to handling any missing data is looking at any predictors that have a large discrepancy with frequencies of values. The table below has a filtered set of frequency ratios that’s at least 20. This, combined with the fraction of unique values to the total observations being less than 0.1% for all variables, suggests that these predictor variables should be removed from the model.
soybean_df |>
filter(frequency_ratio >= 20) |>
arrange(desc(frequency_ratio))
After removing the above predictor variables, the correlations between predictor variables was analyzed to see if there are any relationships that could be used for imputation. The below code calculates the correlation across all predictor variables after the removal of the above predictors.
high_freq_ratio = soybean_df |>
filter(frequency_ratio >= 20) |>
arrange(desc(frequency_ratio)) |>
select(variable) |>
as.list()
na_count |>
filter(!(variable %in% high_freq_ratio$variable))
soybean_cor = Soybean |>
select(-high_freq_ratio$variable) |>
select(-Class, -date) |>
mutate(across(everything(), as.numeric)) |>
cor(use = "na.or.complete") |>
round(2)
soybean_cor[abs(soybean_cor) <= 0.5] = NA
soybean_cor
## plant.stand precip temp hail crop.hist area.dam sever seed.tmt
## plant.stand 1 NA NA NA NA NA NA NA
## precip NA 1 NA NA NA NA NA NA
## temp NA NA 1 NA NA NA NA NA
## hail NA NA NA 1 NA NA NA NA
## crop.hist NA NA NA NA 1 NA NA NA
## area.dam NA NA NA NA NA 1 NA NA
## sever NA NA NA NA NA NA 1 NA
## seed.tmt NA NA NA NA NA NA NA 1
## germ NA NA NA NA NA NA NA NA
## plant.growth NA NA NA NA NA NA NA NA
## leaves NA NA NA NA NA NA NA NA
## leaf.halo NA NA NA NA NA NA NA NA
## leaf.marg NA NA NA NA NA NA NA NA
## leaf.size NA NA NA NA NA NA NA NA
## leaf.shread NA NA NA NA NA NA NA NA
## leaf.malf NA NA NA NA NA NA NA NA
## stem NA NA NA NA NA NA NA NA
## lodging NA NA NA NA NA NA NA NA
## stem.cankers NA NA NA NA NA NA NA NA
## canker.lesion NA NA NA NA NA NA NA NA
## fruiting.bodies NA NA NA NA NA NA NA NA
## seed NA NA NA NA NA NA NA NA
## mold.growth NA NA NA NA NA NA NA NA
## seed.discolor NA NA NA NA NA NA NA NA
## seed.size NA NA NA NA NA NA NA NA
## shriveling NA NA NA NA NA NA NA NA
## germ plant.growth leaves leaf.halo leaf.marg leaf.size
## plant.stand NA NA NA NA NA NA
## precip NA NA NA NA NA NA
## temp NA NA NA NA NA NA
## hail NA NA NA NA NA NA
## crop.hist NA NA NA NA NA NA
## area.dam NA NA NA NA NA NA
## sever NA NA NA NA NA NA
## seed.tmt NA NA NA NA NA NA
## germ 1 NA NA NA NA NA
## plant.growth NA 1 NA NA NA NA
## leaves NA NA 1 NA NA NA
## leaf.halo NA NA NA 1.00 -0.98 -0.78
## leaf.marg NA NA NA -0.98 1.00 0.82
## leaf.size NA NA NA -0.78 0.82 1.00
## leaf.shread NA NA NA NA NA NA
## leaf.malf NA NA NA NA NA NA
## stem NA NA NA NA NA 0.52
## lodging NA NA NA NA NA NA
## stem.cankers NA NA NA NA NA NA
## canker.lesion NA NA NA NA NA NA
## fruiting.bodies NA NA NA NA NA NA
## seed NA NA NA NA NA NA
## mold.growth NA NA NA NA NA NA
## seed.discolor NA NA NA NA NA NA
## seed.size NA NA NA NA NA NA
## shriveling NA NA NA NA NA NA
## leaf.shread leaf.malf stem lodging stem.cankers canker.lesion
## plant.stand NA NA NA NA NA NA
## precip NA NA NA NA NA NA
## temp NA NA NA NA NA NA
## hail NA NA NA NA NA NA
## crop.hist NA NA NA NA NA NA
## area.dam NA NA NA NA NA NA
## sever NA NA NA NA NA NA
## seed.tmt NA NA NA NA NA NA
## germ NA NA NA NA NA NA
## plant.growth NA NA NA NA NA NA
## leaves NA NA NA NA NA NA
## leaf.halo NA NA NA NA NA NA
## leaf.marg NA NA NA NA NA NA
## leaf.size NA NA 0.52 NA NA NA
## leaf.shread 1 NA NA NA NA NA
## leaf.malf NA 1 NA NA NA NA
## stem NA NA 1.00 NA 0.70 0.7
## lodging NA NA NA 1 NA NA
## stem.cankers NA NA 0.70 NA 1.00 NA
## canker.lesion NA NA 0.70 NA NA 1.0
## fruiting.bodies NA NA NA NA 0.61 NA
## seed NA NA NA NA NA NA
## mold.growth NA NA NA NA NA NA
## seed.discolor NA NA NA NA NA NA
## seed.size NA NA NA NA NA NA
## shriveling NA NA NA NA NA NA
## fruiting.bodies seed mold.growth seed.discolor seed.size
## plant.stand NA NA NA NA NA
## precip NA NA NA NA NA
## temp NA NA NA NA NA
## hail NA NA NA NA NA
## crop.hist NA NA NA NA NA
## area.dam NA NA NA NA NA
## sever NA NA NA NA NA
## seed.tmt NA NA NA NA NA
## germ NA NA NA NA NA
## plant.growth NA NA NA NA NA
## leaves NA NA NA NA NA
## leaf.halo NA NA NA NA NA
## leaf.marg NA NA NA NA NA
## leaf.size NA NA NA NA NA
## leaf.shread NA NA NA NA NA
## leaf.malf NA NA NA NA NA
## stem NA NA NA NA NA
## lodging NA NA NA NA NA
## stem.cankers 0.61 NA NA NA NA
## canker.lesion NA NA NA NA NA
## fruiting.bodies 1.00 NA NA NA NA
## seed NA 1.00 0.74 0.7 NA
## mold.growth NA 0.74 1.00 NA 0.55
## seed.discolor NA 0.70 NA 1.0 NA
## seed.size NA NA 0.55 NA 1.00
## shriveling NA NA NA NA 0.79
## shriveling
## plant.stand NA
## precip NA
## temp NA
## hail NA
## crop.hist NA
## area.dam NA
## sever NA
## seed.tmt NA
## germ NA
## plant.growth NA
## leaves NA
## leaf.halo NA
## leaf.marg NA
## leaf.size NA
## leaf.shread NA
## leaf.malf NA
## stem NA
## lodging NA
## stem.cankers NA
## canker.lesion NA
## fruiting.bodies NA
## seed NA
## mold.growth NA
## seed.discolor NA
## seed.size 0.79
## shriveling 1.00
Looking at possible relationships between predictors, there are several that have strong relationships tht could be used to impute missing values. The correlation matrix above had a filter applied where the absolute value of the correlation was at least 0.5. These relationships can be used to impute any missing values from the other predictor that doesn’t have missing values. Based on the matrix above, the following relationships could be used for imputations:
leaf.halo &
leaf.marg
leaf.halo &
leaf.size
leaf.size &
leaf.marg
leaf.size &
stem
stem &
stem.cankers
stem &
canker.lesion
stem.cankers &
fruiting.bodies
seed &
mold.growth
seed &
seed.discolor
seed.size &
mold.growth
seed.size &
shriveling
For the remainder of the missing values, if computational power is a concern, then replacing any missing values with the mode of that predictor variable is a solution for imputation. If computational power is not a concern, then a more advanced method such as Multiple Imputation by Chained Equations (MICE) could be used for better accuracy within the model.