3.2
library(mlbench)
library(ggplot2)
library(corrplot)
## corrplot 0.95 loaded
library(e1071)
##
## Attaching package: 'e1071'
## The following object is masked from 'package:ggplot2':
##
## element
library(tidyr)
library(caret)
## Loading required package: lattice
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
data(Glass)
str(Glass)
## 'data.frame': 214 obs. of 10 variables:
## $ RI : num 1.52 1.52 1.52 1.52 1.52 ...
## $ Na : num 13.6 13.9 13.5 13.2 13.3 ...
## $ Mg : num 4.49 3.6 3.55 3.69 3.62 3.61 3.6 3.61 3.58 3.6 ...
## $ Al : num 1.1 1.36 1.54 1.29 1.24 1.62 1.14 1.05 1.37 1.36 ...
## $ Si : num 71.8 72.7 73 72.6 73.1 ...
## $ K : num 0.06 0.48 0.39 0.57 0.55 0.64 0.58 0.57 0.56 0.57 ...
## $ Ca : num 8.75 7.83 7.78 8.22 8.07 8.07 8.17 8.24 8.3 8.4 ...
## $ Ba : num 0 0 0 0 0 0 0 0 0 0 ...
## $ Fe : num 0 0 0 0 0 0.26 0 0 0 0.11 ...
## $ Type: Factor w/ 6 levels "1","2","3","5",..: 1 1 1 1 1 1 1 1 1 1 ...
#(a)
#Looking at the distributions
#RI - right skewed, has outliers
Glass |>
ggplot(aes(x = RI)) + geom_histogram(bins = 15)
#Na - closer to normal, more centered but still has a right tail
Glass |>
ggplot(aes(x = Na)) + geom_histogram(bins = 15)
#Mg - extreme outliers, left-skewed, but different from the others (bimodal)
Glass |>
ggplot(aes(x = Mg)) + geom_histogram(bins = 15)
#Al - still has some outliers but closer to normal, if a bit right-skewed
Glass |>
ggplot(aes(x = Al)) + geom_histogram(bins = 10)
#Si - closest to normal distribution so far
Glass |>
ggplot(aes(x = Si)) + geom_histogram(bins = 10)
#K - extreme right-skewness, outliers
Glass |>
ggplot(aes(x = K)) + geom_histogram(bins = 15)
#Ca - skewed to the right, a long tail of outliers on the right
Glass |>
ggplot(aes(x = Ca)) + geom_histogram(bins = 15)
#Ba - extreme right-skewness, outliers
Glass |>
ggplot(aes(x = Ba)) + geom_histogram(bins = 15)
#Fe - extreme right-skewness, outliers
Glass |>
ggplot(aes(x = Fe)) + geom_histogram(bins = 15)
#Correlation matrix
glass_corr <- cor(Glass[, -10])
corrplot(glass_corr, method = "color", addCoef.col = "black")
My observations on the distributions are in the comments of the code
chunk. As for the correlations, there is high correlation between RI and
Ca, meaning they have a positive relationship. For the most part, there
aren’t any other positive relationships besides that one, and Al and Ba.
Interestingly, there are some negative correlations that imply there is
an inverse relationship, such as between Si and RI, or Mg and Al, Ba and
Mg, for example.
#(b)
#Getting the skewness
apply(Glass[, -10], 2, skewness)
## RI Na Mg Al Si K Ca
## 1.6027151 0.4478343 -1.1364523 0.8946104 -0.7202392 6.4600889 2.0184463
## Ba Fe
## 3.3686800 1.7298107
#Visualize outliers with boxplots - better for outliers than histograms
Glass |>
pivot_longer(-Type, names_to = "predictor", values_to = "value") |>
ggplot(aes(x = predictor, y = value)) +
geom_boxplot() +
facet_wrap(~ predictor, scales = "free")
Yes, there are outliers in almost pretty much every predictor. Ba and K
are very extreme, for example. I consider this an example of why one
should use different types of visualizations. That’s because in the
histograms, Mg appeared to have outliers, while the boxplot reveals that
it doesn’t really have them - it’s more that we have two clusters of
different values, like a bimodal distribution. I have included a more
descriptions of each distribution in part (a) already.
As for the skewness, the textbook advises that the skewness values will be close to 0 if the distribution is symmetric, and larger if it’s right, and negative if left-skewed. Based on this, we can see that the Na predictor actually has the least skewed distribution despite the outliers, so perhaps all other predictors could benefit from a transformation, depending on that the Box-Cox tests show.
#(c)
lambda_K <- BoxCoxTrans(Glass$K)
lambda_Ca <- BoxCoxTrans(Glass$Ca)
lambda_RI <- BoxCoxTrans(Glass$RI)
lambda_Al <- BoxCoxTrans(Glass$Al)
lambda_Ba <- BoxCoxTrans(Glass$Ba)
lambda_Fe <- BoxCoxTrans(Glass$Fe)
lambda_Mg <- BoxCoxTrans(Glass$Mg)
lambda_Si <- BoxCoxTrans(Glass$Si)
lambda_Na <- BoxCoxTrans(Glass$Na)
print("Box-Cox results for K")
## [1] "Box-Cox results for K"
lambda_K
## Box-Cox Transformation
##
## 214 data points used to estimate Lambda
##
## Input data summary:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.0000 0.1225 0.5550 0.4971 0.6100 6.2100
##
## Lambda could not be estimated; no transformation is applied
print("Box-Cox results for Ca")
## [1] "Box-Cox results for Ca"
lambda_Ca
## Box-Cox Transformation
##
## 214 data points used to estimate Lambda
##
## Input data summary:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 5.430 8.240 8.600 8.957 9.172 16.190
##
## Largest/Smallest: 2.98
## Sample Skewness: 2.02
##
## Estimated Lambda: -1.1
print("Box-Cox results for RI")
## [1] "Box-Cox results for RI"
lambda_RI
## Box-Cox Transformation
##
## 214 data points used to estimate Lambda
##
## Input data summary:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.511 1.517 1.518 1.518 1.519 1.534
##
## Largest/Smallest: 1.02
## Sample Skewness: 1.6
##
## Estimated Lambda: -2
print("Box-Cox results for Al")
## [1] "Box-Cox results for Al"
lambda_Al
## Box-Cox Transformation
##
## 214 data points used to estimate Lambda
##
## Input data summary:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.290 1.190 1.360 1.445 1.630 3.500
##
## Largest/Smallest: 12.1
## Sample Skewness: 0.895
##
## Estimated Lambda: 0.5
print("Box-Cox results for Ba")
## [1] "Box-Cox results for Ba"
lambda_Ba
## Box-Cox Transformation
##
## 214 data points used to estimate Lambda
##
## Input data summary:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 0.000 0.000 0.175 0.000 3.150
##
## Lambda could not be estimated; no transformation is applied
print("Box-Cox results for Fe")
## [1] "Box-Cox results for Fe"
lambda_Fe
## Box-Cox Transformation
##
## 214 data points used to estimate Lambda
##
## Input data summary:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00000 0.00000 0.00000 0.05701 0.10000 0.51000
##
## Lambda could not be estimated; no transformation is applied
print("Box-Cox results for Mg")
## [1] "Box-Cox results for Mg"
lambda_Mg
## Box-Cox Transformation
##
## 214 data points used to estimate Lambda
##
## Input data summary:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 2.115 3.480 2.685 3.600 4.490
##
## Lambda could not be estimated; no transformation is applied
print("Box-Cox results for Si")
## [1] "Box-Cox results for Si"
lambda_Si
## Box-Cox Transformation
##
## 214 data points used to estimate Lambda
##
## Input data summary:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 69.81 72.28 72.79 72.65 73.09 75.41
##
## Largest/Smallest: 1.08
## Sample Skewness: -0.72
##
## Estimated Lambda: 2
print("Box-Cox results for Na")
## [1] "Box-Cox results for Na"
lambda_Na
## Box-Cox Transformation
##
## 214 data points used to estimate Lambda
##
## Input data summary:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 10.73 12.91 13.30 13.41 13.82 17.38
##
## Largest/Smallest: 1.62
## Sample Skewness: 0.448
##
## Estimated Lambda: -0.1
## With fudge factor, Lambda = 0 will be used for transformations
Based on these results: Ca has a lambda of -1.1, meaning an inverse transformation can be used here. RI has a lambda of -2, and the largest/smallest ratio is nowhere close to 20, so perhaps it isn’t the best candidate for a transformation. Al has a lambda of 0.5, meaning a square root transformation can be used. K, Ba, Fe, and Mg could not be transformed using Box-Cox because they many zero values in their distributions. For these predictors, the spatial sign transformation discussed in the textbook may be a more appropriate way to reduce the influence of extreme values. Si has a lambda of 2, which suggests a square transformation, though its skewness of -0.72 and a largest/smallest ratio of 1.08 are both well within the acceptable range based on the textbook, making a transformation less critical. Na has a lambda of -0.1, which can indicate a log transform, but because it showed no meaningful skewness, no transformation is recommended.
data(Soybean)
str(Soybean)
## 'data.frame': 683 obs. of 36 variables:
## $ Class : Factor w/ 19 levels "2-4-d-injury",..: 11 11 11 11 11 11 11 11 11 11 ...
## $ date : Factor w/ 7 levels "0","1","2","3",..: 7 5 4 4 7 6 6 5 7 5 ...
## $ plant.stand : Ord.factor w/ 2 levels "0"<"1": 1 1 1 1 1 1 1 1 1 1 ...
## $ precip : Ord.factor w/ 3 levels "0"<"1"<"2": 3 3 3 3 3 3 3 3 3 3 ...
## $ temp : Ord.factor w/ 3 levels "0"<"1"<"2": 2 2 2 2 2 2 2 2 2 2 ...
## $ hail : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 2 1 1 ...
## $ crop.hist : Factor w/ 4 levels "0","1","2","3": 2 3 2 2 3 4 3 2 4 3 ...
## $ area.dam : Factor w/ 4 levels "0","1","2","3": 2 1 1 1 1 1 1 1 1 1 ...
## $ sever : Factor w/ 3 levels "0","1","2": 2 3 3 3 2 2 2 2 2 3 ...
## $ seed.tmt : Factor w/ 3 levels "0","1","2": 1 2 2 1 1 1 2 1 2 1 ...
## $ germ : Ord.factor w/ 3 levels "0"<"1"<"2": 1 2 3 2 3 2 1 3 2 3 ...
## $ plant.growth : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
## $ leaves : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
## $ leaf.halo : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
## $ leaf.marg : Factor w/ 3 levels "0","1","2": 3 3 3 3 3 3 3 3 3 3 ...
## $ leaf.size : Ord.factor w/ 3 levels "0"<"1"<"2": 3 3 3 3 3 3 3 3 3 3 ...
## $ leaf.shread : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ leaf.malf : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ leaf.mild : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
## $ stem : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
## $ lodging : Factor w/ 2 levels "0","1": 2 1 1 1 1 1 2 1 1 1 ...
## $ stem.cankers : Factor w/ 4 levels "0","1","2","3": 4 4 4 4 4 4 4 4 4 4 ...
## $ canker.lesion : Factor w/ 4 levels "0","1","2","3": 2 2 1 1 2 1 2 2 2 2 ...
## $ fruiting.bodies: Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
## $ ext.decay : Factor w/ 3 levels "0","1","2": 2 2 2 2 2 2 2 2 2 2 ...
## $ mycelium : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ int.discolor : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
## $ sclerotia : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ fruit.pods : Factor w/ 4 levels "0","1","2","3": 1 1 1 1 1 1 1 1 1 1 ...
## $ fruit.spots : Factor w/ 4 levels "0","1","2","4": 4 4 4 4 4 4 4 4 4 4 ...
## $ seed : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ mold.growth : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ seed.discolor : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ seed.size : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ shriveling : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
## $ roots : Factor w/ 3 levels "0","1","2": 1 1 1 1 1 1 1 1 1 1 ...
#(a)
nearZeroVar(Soybean, saveMetrics = TRUE)
## freqRatio percentUnique zeroVar nzv
## Class 1.010989 2.7818448 FALSE FALSE
## date 1.137405 1.0248902 FALSE FALSE
## plant.stand 1.208191 0.2928258 FALSE FALSE
## precip 4.098214 0.4392387 FALSE FALSE
## temp 1.879397 0.4392387 FALSE FALSE
## hail 3.425197 0.2928258 FALSE FALSE
## crop.hist 1.004587 0.5856515 FALSE FALSE
## area.dam 1.213904 0.5856515 FALSE FALSE
## sever 1.651282 0.4392387 FALSE FALSE
## seed.tmt 1.373874 0.4392387 FALSE FALSE
## germ 1.103627 0.4392387 FALSE FALSE
## plant.growth 1.951327 0.2928258 FALSE FALSE
## leaves 7.870130 0.2928258 FALSE FALSE
## leaf.halo 1.547511 0.4392387 FALSE FALSE
## leaf.marg 1.615385 0.4392387 FALSE FALSE
## leaf.size 1.479638 0.4392387 FALSE FALSE
## leaf.shread 5.072917 0.2928258 FALSE FALSE
## leaf.malf 12.311111 0.2928258 FALSE FALSE
## leaf.mild 26.750000 0.4392387 FALSE TRUE
## stem 1.253378 0.2928258 FALSE FALSE
## lodging 12.380952 0.2928258 FALSE FALSE
## stem.cankers 1.984293 0.5856515 FALSE FALSE
## canker.lesion 1.807910 0.5856515 FALSE FALSE
## fruiting.bodies 4.548077 0.2928258 FALSE FALSE
## ext.decay 3.681481 0.4392387 FALSE FALSE
## mycelium 106.500000 0.2928258 FALSE TRUE
## int.discolor 13.204545 0.4392387 FALSE FALSE
## sclerotia 31.250000 0.2928258 FALSE TRUE
## fruit.pods 3.130769 0.5856515 FALSE FALSE
## fruit.spots 3.450000 0.5856515 FALSE FALSE
## seed 4.139130 0.2928258 FALSE FALSE
## mold.growth 7.820896 0.2928258 FALSE FALSE
## seed.discolor 8.015625 0.2928258 FALSE FALSE
## seed.size 9.016949 0.2928258 FALSE FALSE
## shriveling 14.184211 0.2928258 FALSE FALSE
## roots 6.406977 0.4392387 FALSE FALSE
Yes we have some degenerate distributions: leaf.mild - frequency ratio of 26.75, meaning the most common value appears 26.75x more than the second most common. mycelium — frequency ratio of 106.5, extremely degenerate, one value dominates almost entirely. sclerotia — frequency ratio of 31.25.
These near-zero variance predictors are problematic because one category overwhelmingly dominates, meaning they are unlikely to help the model.
#b
#Which predictors have the most missing values
missing_by_pred <- colSums(is.na(Soybean))
missing_by_pred[missing_by_pred > 0]
## date plant.stand precip temp hail
## 1 36 38 30 121
## crop.hist area.dam sever seed.tmt germ
## 16 1 121 121 112
## plant.growth leaf.halo leaf.marg leaf.size leaf.shread
## 16 84 84 84 100
## leaf.malf leaf.mild stem lodging stem.cankers
## 84 108 16 121 38
## canker.lesion fruiting.bodies ext.decay mycelium int.discolor
## 38 106 38 38 38
## sclerotia fruit.pods fruit.spots seed mold.growth
## 38 84 106 92 92
## seed.discolor seed.size shriveling roots
## 106 92 106 31
#Is missingness related to class
missing_by_class <- aggregate(is.na(Soybean[, -1]),
by = list(Class = Soybean$Class),
FUN = sum)
missing_by_class
## Class date plant.stand precip temp hail crop.hist
## 1 2-4-d-injury 1 16 16 16 16 16
## 2 alternarialeaf-spot 0 0 0 0 0 0
## 3 anthracnose 0 0 0 0 0 0
## 4 bacterial-blight 0 0 0 0 0 0
## 5 bacterial-pustule 0 0 0 0 0 0
## 6 brown-spot 0 0 0 0 0 0
## 7 brown-stem-rot 0 0 0 0 0 0
## 8 charcoal-rot 0 0 0 0 0 0
## 9 cyst-nematode 0 14 14 14 14 0
## 10 diaporthe-pod-&-stem-blight 0 6 0 0 15 0
## 11 diaporthe-stem-canker 0 0 0 0 0 0
## 12 downy-mildew 0 0 0 0 0 0
## 13 frog-eye-leaf-spot 0 0 0 0 0 0
## 14 herbicide-injury 0 0 8 0 8 0
## 15 phyllosticta-leaf-spot 0 0 0 0 0 0
## 16 phytophthora-rot 0 0 0 0 68 0
## 17 powdery-mildew 0 0 0 0 0 0
## 18 purple-seed-stain 0 0 0 0 0 0
## 19 rhizoctonia-root-rot 0 0 0 0 0 0
## area.dam sever seed.tmt germ plant.growth leaves leaf.halo leaf.marg
## 1 1 16 16 16 16 0 0 0
## 2 0 0 0 0 0 0 0 0
## 3 0 0 0 0 0 0 0 0
## 4 0 0 0 0 0 0 0 0
## 5 0 0 0 0 0 0 0 0
## 6 0 0 0 0 0 0 0 0
## 7 0 0 0 0 0 0 0 0
## 8 0 0 0 0 0 0 0 0
## 9 0 14 14 14 0 0 14 14
## 10 0 15 15 6 0 0 15 15
## 11 0 0 0 0 0 0 0 0
## 12 0 0 0 0 0 0 0 0
## 13 0 0 0 0 0 0 0 0
## 14 0 8 8 8 0 0 0 0
## 15 0 0 0 0 0 0 0 0
## 16 0 68 68 68 0 0 55 55
## 17 0 0 0 0 0 0 0 0
## 18 0 0 0 0 0 0 0 0
## 19 0 0 0 0 0 0 0 0
## leaf.size leaf.shread leaf.malf leaf.mild stem lodging stem.cankers
## 1 0 16 0 16 16 16 16
## 2 0 0 0 0 0 0 0
## 3 0 0 0 0 0 0 0
## 4 0 0 0 0 0 0 0
## 5 0 0 0 0 0 0 0
## 6 0 0 0 0 0 0 0
## 7 0 0 0 0 0 0 0
## 8 0 0 0 0 0 0 0
## 9 14 14 14 14 0 14 14
## 10 15 15 15 15 0 15 0
## 11 0 0 0 0 0 0 0
## 12 0 0 0 0 0 0 0
## 13 0 0 0 0 0 0 0
## 14 0 0 0 8 0 8 8
## 15 0 0 0 0 0 0 0
## 16 55 55 55 55 0 68 0
## 17 0 0 0 0 0 0 0
## 18 0 0 0 0 0 0 0
## 19 0 0 0 0 0 0 0
## canker.lesion fruiting.bodies ext.decay mycelium int.discolor sclerotia
## 1 16 16 16 16 16 16
## 2 0 0 0 0 0 0
## 3 0 0 0 0 0 0
## 4 0 0 0 0 0 0
## 5 0 0 0 0 0 0
## 6 0 0 0 0 0 0
## 7 0 0 0 0 0 0
## 8 0 0 0 0 0 0
## 9 14 14 14 14 14 14
## 10 0 0 0 0 0 0
## 11 0 0 0 0 0 0
## 12 0 0 0 0 0 0
## 13 0 0 0 0 0 0
## 14 8 8 8 8 8 8
## 15 0 0 0 0 0 0
## 16 0 68 0 0 0 0
## 17 0 0 0 0 0 0
## 18 0 0 0 0 0 0
## 19 0 0 0 0 0 0
## fruit.pods fruit.spots seed mold.growth seed.discolor seed.size shriveling
## 1 16 16 16 16 16 16 16
## 2 0 0 0 0 0 0 0
## 3 0 0 0 0 0 0 0
## 4 0 0 0 0 0 0 0
## 5 0 0 0 0 0 0 0
## 6 0 0 0 0 0 0 0
## 7 0 0 0 0 0 0 0
## 8 0 0 0 0 0 0 0
## 9 0 14 0 0 14 0 14
## 10 0 0 0 0 0 0 0
## 11 0 0 0 0 0 0 0
## 12 0 0 0 0 0 0 0
## 13 0 0 0 0 0 0 0
## 14 0 8 8 8 8 8 8
## 15 0 0 0 0 0 0 0
## 16 68 68 68 68 68 68 68
## 17 0 0 0 0 0 0 0
## 18 0 0 0 0 0 0 0
## 19 0 0 0 0 0 0 0
## roots
## 1 16
## 2 0
## 3 0
## 4 0
## 5 0
## 6 0
## 7 0
## 8 0
## 9 0
## 10 15
## 11 0
## 12 0
## 13 0
## 14 0
## 15 0
## 16 0
## 17 0
## 18 0
## 19 0
Soybean |>
mutate(across(everything(), is.na)) |>
pivot_longer(everything(), names_to = "variables", values_to = "missing") |>
count(variables, missing) |>
ggplot(aes(y = variables, x = n, fill = missing)) +
geom_col(position = "fill") +
labs(title = "Proportion of missing Values", x = "Proportion") +
scale_fill_manual(values = c("grey", "black"))
Soybean |>
mutate(total_missing = rowSums(is.na(Soybean))) |>
group_by(Class) |>
summarise(missing = sum(total_missing)) |>
ggplot(aes(y = reorder(Class, missing), x = missing)) +
geom_col() +
labs(title = "Total Missing Values by Class", x = "Total Missing", y = "Class")
Yes, certain predictors are more likely to be missing than others. Hail, sever, seed.tmt, and lodging all have 121 missing values, suggesting they tend to be missing together rather than independently. Similarly, mycelium, int.discolor, stem.cankers, canker.lesion and sclerotia share exactly 38 missing values, pointing to another cluster of co-missing predictors.
These missing values are clearly related to specific classes. Only five disease classes account for all missing data. They are phytophthora-rot, 2-4-d-injury, cyst-nematode, diaporthe-pod-&-stem-blight, and herbicide-injury, meanwhile the remaining classes have no missing values whatsoever. This is informative missingness as described in the textbook, meaning the pattern of missing data is directly tied to the outcome and is not missing at random, which is the most problematic kind of missing data to deal with.
Given that the data is MNAR, imputation would be very problematic here because filling in the missing values with estimates from other predictors will ignore the fact that the missingness itself carries information. A more appropriate strategy would be to remove the samples with missing data entirely if the goal is a clean dataset, since those five classes are the only source of missingness. However, this comes at a big cost because removing samples means losing all observations from those five disease classes entirely. This would obviously make the model unable to predict them at all. Alternatively, predictors with high missingness concentrated in those classes could be removed instead, which would be less destructive than losing the entire classes, though we still risk losing possibly useful information. As another option, if it’s important to keep all the classes, the missing values could be treated as a separate category. For example, coding them as an additional level like “unknown” will keep the informative nature of the missingness rather than obscure it through imputation.