#library(fpp3)
#library(seasonal)
library(magrittr)
library(tidyverse)
library(corrplot)
library(patchwork)
library(forecast) # for boxcox
library(caret) # for spatial sign transformation
library(mlbench)
library(dlookr)
library(naniar) #missingness
library(flextable)
From Applied Predictive Modeling. 2016. Kuhn and Johnson. Exercise 3.1 The UC Irvine Machine Learning Repository contains a data set related to glass identification. The data consist of 214 glass samples labeled as one of seven class categories. There are nine predictors, including the refractive index and percentages of eight elements: Na, Mg, Al, Si, K, Ca, Ba, and Fe.
# read in glass data
data(Glass)
# save in working file
<-Glass
glass
# review data
%>%head(3)%>%
glassflextable()%>%
set_caption('Glass Dataset: Subset of Observations')
RI | Na | Mg | Al | Si | K | Ca | Ba | Fe | Type |
1.52101 | 13.64 | 4.49 | 1.10 | 71.78 | 0.06 | 8.75 | 0 | 0 | 1 |
1.51761 | 13.89 | 3.60 | 1.36 | 72.73 | 0.48 | 7.83 | 0 | 0 | 1 |
1.51618 | 13.53 | 3.55 | 1.54 | 72.99 | 0.39 | 7.78 | 0 | 0 | 1 |
3.1a. Using visualizations, explore the predictor variables to understand their distributions as well as the relationships between predictors.
# plot distributions using dlookr
%>%plot_normality() glass
Pairwise comparisons to evaluate relationships between predictors.
The following two figures display pairwise correlations (p < .05) for covariates with and without zero value observations, respectively. A zero/negative value for a chemical concentration generally indicates that the measurement lies below the instrumental detection limit. It is interesting to note that, absent such observations, the pairwise correlations are higher in this dataset across a range of covariates (Figure 11). And in either case, Ba and Ca show a high positive correlation and Ba and Si show a high negative correlation.
An argument can be made for dropping either Ba or Ca for modeling purposes given that they account for similar variance. As Ba also correlates with Si, it may be prudent to retain Ca.
#create correlation matrix that includes all numerical values
<-glass%>%
glass_cor select(-Type)%>%
cor()
#set colors
<- colorRampPalette(c("#BB4444", "#EE9988", "#FFFFFF", "#77AADD", "#4477AA"))
col
# plot pairwise correlations with zero values included
corrplot(glass_cor, method = "shade",
shade.col = NA,
diag=FALSE,
type='upper',
tl.col = "black",
tl.srt = 45,
addCoef.col = "black",
cl.pos = "n",
order = "hclust",
col = col(200),
title ='Glass: Pairwise Correlations of Predictors - Zero Values Included' ,
mar=c(0,0,1,0))
#plot pairwise correlations removing the zeros
<-glass%>%
less_corselect(-Type)%>%
filter_if(is.numeric, all_vars((.) != 0))%>%
cor()
corrplot(less_cor, method = "shade",
shade.col = NA,
diag=FALSE,
type='upper',
tl.col = "black",
tl.srt = 45,
addCoef.col = "black",
cl.pos = "n",
order = "hclust",
col = col(200),
title ='Glass: Pairwise Correlations of Predictors - Zero Values Removed' ,
mar=c(0,0,1,0))
3.1b. Do there appear to be any outliers in the data? Are any predictors skewed?
Yes, the covariates with skewed distributions include: “RI” “Mg” “K” “Ca” “Ba” “Fe”. From this list, Ba and Ca also have the highest proportion of outliers (17% and 12%, respectively). The following table and plots provide a summary of these measures.
# Identify predictors that are highly skewed
<-glass%>%find_skewness(index=FALSE, thres=TRUE)
skew
print(paste(cat(skew), "display a skewed distributions."))
## RI Mg K Ca Ba Fe[1] " display a skewed distributions."
# identify outliers -- above q0.75+1.5*IQR and below q0.25+1.5*IQR
diagnose_outlier(glass)%>%arrange(desc(outliers_cnt)) %>%
mutate_if(is.numeric, round , digits=3)%>%
flextable()%>%
set_caption("Glass: Outlier Statistics")
variables | outliers_cnt | outliers_ratio | outliers_mean | with_mean | without_mean |
Ba | 38 | 17.757 | 0.986 | 0.175 | 0.000 |
Ca | 26 | 12.150 | 11.173 | 8.957 | 8.651 |
Al | 18 | 8.411 | 2.088 | 1.445 | 1.386 |
RI | 17 | 7.944 | 1.524 | 1.518 | 1.518 |
Si | 12 | 5.607 | 71.824 | 72.651 | 72.700 |
Fe | 12 | 5.607 | 0.324 | 0.057 | 0.041 |
Na | 7 | 3.271 | 12.661 | 13.408 | 13.433 |
K | 7 | 3.271 | 3.061 | 0.497 | 0.410 |
Mg | 0 | 0.000 | 2.685 | 2.685 |
# assess change to distributions based on outlier removal
%>%
glass select(find_outliers(glass, index = FALSE)) %>%
plot_outlier()
3.1c. Are there any relevant transformations of one or more predictors that might improve the classification model?
Given that few of the covariates have near normal distributions, we can employ a Box_Cox estimation of lambda to select an appropriate transformation method. On this basis, the following covariates are listed with the ‘best-in-class’ transformation.
A spatialSign transformation may also be appropriate for covariates that have a significant number of outliers (particularly if they are influential observations). In this Glass dataset, the covariates Ca and Ba are good candidates for the spatialSign transform. The plots below compare untransformed, spatialSign transformed, and lamba transformed distributions for these covariates. The choice of one transformation vs. another may be best evaluated from other model diagnostics and measures of fit.
#using boxcox from forecast pkg (we could use MASS as alternative)
<-glass%>%select(-Type)
g<-glass%>%select(Type)
type
%>%map(BoxCox.lambda)%>%
gas.data.frame()%>%
flextable()%>%
set_caption('Box-Cox Lambdas for Glass Predictors')
RI | Na | Mg | Al | Si | K | Ca | Ba | Fe |
-0.9999242 | -0.7425332 | 0.9999895 | 0.3616362 | -0.9999242 | 0.0634853 | -0.3850341 | 0.08844884 | 0.1314834 |
# Compare transformations on Ca and Ba distributions
<- preProcess(glass[, -10], method=c('center', 'scale'))
spatial<- predict(spatial, glass[,-10])
spatial_2<-spatialSign(spatial_2)
spatial_3<-as.data.frame(spatial_3)
spatial_3
<-glass%>%ggplot(aes(x=Ca))+
ca1geom_histogram(fill='steelblue')+
theme_classic()+
labs(title='Untransformed')
<-spatial_3%>%ggplot(aes(x=Ca))+
ca2geom_histogram(fill='steelblue')+
theme_classic()+
labs(title='Spatial Transform')
<-glass%>%ggplot(aes(x=1/sqrt(Ca)))+
ca3geom_histogram(fill='steelblue')+
theme_classic()+
labs(title='1/sqrt Transform')
<-glass%>%ggplot(aes(x=Ba))+
ba1geom_histogram(fill='steelblue')+
theme_classic()+
labs(title='Untransformed')
<-spatial_3%>%ggplot(aes(x=Ba))+
ba2geom_histogram(fill='steelblue')+
theme_classic()+
labs(title='Spatial Transform')
<-glass%>%ggplot(aes(x=log(Ba)))+
ba3geom_histogram(fill='steelblue')+
theme_classic()+
labs(title='Log Transform')
# plot comparisons
|ca2|ca3 ca1
|ba2|ba3 ba1
The following scatter plot matrices are included to show pair-wise comparisons between untransformed and spatialSign transformed covariates. The first matrix displays the untransformed observations.
#based on http://rismyhammer.com/ml/OutliersSpatialSign.html#:~:text=Use%20spatialSign%20%28%29%20in%20caret%20to%20conduct%20spatial,myTransformed2%20%3C-%20spatialSign%28myTransformed%29%20myTransformed2%20%3C-%20as.data.frame%28myTransformed2%29%20head%28myTransformed2%2C%2010%29
#Plot spatial sign transformation using caret. Col 10 = 'Type'
trellis.par.set(theme = col.whitebg(), warn = FALSE)
featurePlot(glass[,-10], glass[,10], "pairs", auto.key = list(columns = 10))
featurePlot(spatialSign(scale(glass[,-10])), glass[,10], "pairs", auto.key = list(columns = 10))
3.2. Soybean data can also be found at the UC Irvine Machine Learning Repository. Data were collected to predict disease in 683 soybeans. The 35 predictors are mostly categorical and include information on the environmental conditions (e.g., temperature, precipitation) and plant conditions (e.g., left spots, mold growth). The outcome labels consist of 19 distinct classes.
#load soybeen data
data(Soybean)
<-Soybean
soybean
# print selection of dataset
%>%head(3)%>%
soybeanflextable()%>%
set_caption("Soybean Dataset: Subset of Observations")
Class | date | plant.stand | precip | temp | hail | crop.hist | area.dam | sever | seed.tmt | germ | plant.growth | leaves | leaf.halo | leaf.marg | leaf.size | leaf.shread | leaf.malf | leaf.mild | stem | lodging | stem.cankers | canker.lesion | fruiting.bodies | ext.decay | mycelium | int.discolor | sclerotia | fruit.pods | fruit.spots | seed | mold.growth | seed.discolor | seed.size | shriveling | roots |
diaporthe-stem-canker | 6 | 0 | 2 | 1 | 0 | 1 | 1 | 1 | 0 | 0 | 1 | 1 | 0 | 2 | 2 | 0 | 0 | 0 | 1 | 1 | 3 | 1 | 1 | 1 | 0 | 0 | 0 | 0 | 4 | 0 | 0 | 0 | 0 | 0 | 0 |
diaporthe-stem-canker | 4 | 0 | 2 | 1 | 0 | 2 | 0 | 2 | 1 | 1 | 1 | 1 | 0 | 2 | 2 | 0 | 0 | 0 | 1 | 0 | 3 | 1 | 1 | 1 | 0 | 0 | 0 | 0 | 4 | 0 | 0 | 0 | 0 | 0 | 0 |
diaporthe-stem-canker | 3 | 0 | 2 | 1 | 0 | 1 | 0 | 2 | 1 | 2 | 1 | 1 | 0 | 2 | 2 | 0 | 0 | 0 | 1 | 0 | 3 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 4 | 0 | 0 | 0 | 0 | 0 | 0 |
3.2a. Investigate the frequency distributions for the categorical predictors.
The following table and plots provide a summary of the frequency distributions. They provide a quick means to screen for low/zero variance covariates (discussed in the next prompt) as well as class imbalance.
# calculate frequency distributions of categorical predictor variables
%>%
soybeanselect(!Class)%>%
diagnose_category()%>%
flextable()%>%
set_caption("Soybean: Frequency Statistics for Predictor Variables")
variables | levels | N | freq | ratio | rank |
date | 5 | 683 | 149 | 21.8155198 | 1 |
date | 4 | 683 | 131 | 19.1800878 | 2 |
date | 3 | 683 | 118 | 17.2767204 | 3 |
date | 2 | 683 | 93 | 13.6163982 | 4 |
date | 6 | 683 | 90 | 13.1771596 | 5 |
date | 1 | 683 | 75 | 10.9809663 | 6 |
date | 0 | 683 | 26 | 3.8067350 | 7 |
date | 683 | 1 | 0.1464129 | 8 | |
plant.stand | 0 | 683 | 354 | 51.8301611 | 1 |
plant.stand | 1 | 683 | 293 | 42.8989751 | 2 |
plant.stand | 683 | 36 | 5.2708638 | 3 | |
precip | 2 | 683 | 459 | 67.2035139 | 1 |
precip | 1 | 683 | 112 | 16.3982430 | 2 |
precip | 0 | 683 | 74 | 10.8345534 | 3 |
precip | 683 | 38 | 5.5636896 | 4 | |
temp | 1 | 683 | 374 | 54.7584187 | 1 |
temp | 2 | 683 | 199 | 29.1361640 | 2 |
temp | 0 | 683 | 80 | 11.7130307 | 3 |
temp | 683 | 30 | 4.3923865 | 4 | |
hail | 0 | 683 | 435 | 63.6896047 | 1 |
hail | 1 | 683 | 127 | 18.5944363 | 2 |
hail | 683 | 121 | 17.7159590 | 3 | |
crop.hist | 2 | 683 | 219 | 32.0644217 | 1 |
crop.hist | 3 | 683 | 218 | 31.9180088 | 2 |
crop.hist | 1 | 683 | 165 | 24.1581259 | 3 |
crop.hist | 0 | 683 | 65 | 9.5168375 | 4 |
crop.hist | 683 | 16 | 2.3426061 | 5 | |
area.dam | 1 | 683 | 227 | 33.2357247 | 1 |
area.dam | 3 | 683 | 187 | 27.3792094 | 2 |
area.dam | 2 | 683 | 145 | 21.2298682 | 3 |
area.dam | 0 | 683 | 123 | 18.0087848 | 4 |
area.dam | 683 | 1 | 0.1464129 | 5 | |
sever | 1 | 683 | 322 | 47.1449488 | 1 |
sever | 0 | 683 | 195 | 28.5505124 | 2 |
sever | 683 | 121 | 17.7159590 | 3 | |
sever | 2 | 683 | 45 | 6.5885798 | 4 |
seed.tmt | 0 | 683 | 305 | 44.6559297 | 1 |
seed.tmt | 1 | 683 | 222 | 32.5036603 | 2 |
seed.tmt | 683 | 121 | 17.7159590 | 3 | |
seed.tmt | 2 | 683 | 35 | 5.1244510 | 4 |
germ | 1 | 683 | 213 | 31.1859444 | 1 |
germ | 2 | 683 | 193 | 28.2576867 | 2 |
germ | 0 | 683 | 165 | 24.1581259 | 3 |
germ | 683 | 112 | 16.3982430 | 4 | |
plant.growth | 0 | 683 | 441 | 64.5680820 | 1 |
plant.growth | 1 | 683 | 226 | 33.0893119 | 2 |
plant.growth | 683 | 16 | 2.3426061 | 3 | |
leaves | 1 | 683 | 606 | 88.7262079 | 1 |
leaves | 0 | 683 | 77 | 11.2737921 | 2 |
leaf.halo | 2 | 683 | 342 | 50.0732064 | 1 |
leaf.halo | 0 | 683 | 221 | 32.3572474 | 2 |
leaf.halo | 683 | 84 | 12.2986823 | 3 | |
leaf.halo | 1 | 683 | 36 | 5.2708638 | 4 |
leaf.marg | 0 | 683 | 357 | 52.2693997 | 1 |
leaf.marg | 2 | 683 | 221 | 32.3572474 | 2 |
leaf.marg | 683 | 84 | 12.2986823 | 3 | |
leaf.marg | 1 | 683 | 21 | 3.0746706 | 4 |
leaf.size | 1 | 683 | 327 | 47.8770132 | 1 |
leaf.size | 2 | 683 | 221 | 32.3572474 | 2 |
leaf.size | 683 | 84 | 12.2986823 | 3 | |
leaf.size | 0 | 683 | 51 | 7.4670571 | 4 |
leaf.shread | 0 | 683 | 487 | 71.3030747 | 1 |
leaf.shread | 683 | 100 | 14.6412884 | 2 | |
leaf.shread | 1 | 683 | 96 | 14.0556369 | 3 |
leaf.malf | 0 | 683 | 554 | 81.1127379 | 1 |
leaf.malf | 683 | 84 | 12.2986823 | 2 | |
leaf.malf | 1 | 683 | 45 | 6.5885798 | 3 |
leaf.mild | 0 | 683 | 535 | 78.3308931 | 1 |
leaf.mild | 683 | 108 | 15.8125915 | 2 | |
leaf.mild | 1 | 683 | 20 | 2.9282577 | 3 |
leaf.mild | 2 | 683 | 20 | 2.9282577 | 3 |
stem | 1 | 683 | 371 | 54.3191801 | 1 |
stem | 0 | 683 | 296 | 43.3382138 | 2 |
stem | 683 | 16 | 2.3426061 | 3 | |
lodging | 0 | 683 | 520 | 76.1346999 | 1 |
lodging | 683 | 121 | 17.7159590 | 2 | |
lodging | 1 | 683 | 42 | 6.1493411 | 3 |
stem.cankers | 0 | 683 | 379 | 55.4904832 | 1 |
stem.cankers | 3 | 683 | 191 | 27.9648609 | 2 |
stem.cankers | 1 | 683 | 39 | 5.7101025 | 3 |
stem.cankers | 683 | 38 | 5.5636896 | 4 | |
stem.cankers | 2 | 683 | 36 | 5.2708638 | 5 |
canker.lesion | 0 | 683 | 320 | 46.8521230 | 1 |
canker.lesion | 2 | 683 | 177 | 25.9150805 | 2 |
canker.lesion | 1 | 683 | 83 | 12.1522694 | 3 |
canker.lesion | 3 | 683 | 65 | 9.5168375 | 4 |
canker.lesion | 683 | 38 | 5.5636896 | 5 | |
fruiting.bodies | 0 | 683 | 473 | 69.2532943 | 1 |
fruiting.bodies | 683 | 106 | 15.5197657 | 2 | |
fruiting.bodies | 1 | 683 | 104 | 15.2269400 | 3 |
ext.decay | 0 | 683 | 497 | 72.7672035 | 1 |
ext.decay | 1 | 683 | 135 | 19.7657394 | 2 |
ext.decay | 683 | 38 | 5.5636896 | 3 | |
ext.decay | 2 | 683 | 13 | 1.9033675 | 4 |
mycelium | 0 | 683 | 639 | 93.5578331 | 1 |
mycelium | 683 | 38 | 5.5636896 | 2 | |
mycelium | 1 | 683 | 6 | 0.8784773 | 3 |
int.discolor | 0 | 683 | 581 | 85.0658858 | 1 |
int.discolor | 1 | 683 | 44 | 6.4421669 | 2 |
int.discolor | 683 | 38 | 5.5636896 | 3 | |
int.discolor | 2 | 683 | 20 | 2.9282577 | 4 |
sclerotia | 0 | 683 | 625 | 91.5080527 | 1 |
sclerotia | 683 | 38 | 5.5636896 | 2 | |
sclerotia | 1 | 683 | 20 | 2.9282577 | 3 |
fruit.pods | 0 | 683 | 407 | 59.5900439 | 1 |
fruit.pods | 1 | 683 | 130 | 19.0336750 | 2 |
fruit.pods | 683 | 84 | 12.2986823 | 3 | |
fruit.pods | 3 | 683 | 48 | 7.0278184 | 4 |
fruit.pods | 2 | 683 | 14 | 2.0497804 | 5 |
fruit.spots | 0 | 683 | 345 | 50.5124451 | 1 |
fruit.spots | 683 | 106 | 15.5197657 | 2 | |
fruit.spots | 4 | 683 | 100 | 14.6412884 | 3 |
fruit.spots | 1 | 683 | 75 | 10.9809663 | 4 |
fruit.spots | 2 | 683 | 57 | 8.3455344 | 5 |
seed | 0 | 683 | 476 | 69.6925329 | 1 |
seed | 1 | 683 | 115 | 16.8374817 | 2 |
seed | 683 | 92 | 13.4699854 | 3 | |
mold.growth | 0 | 683 | 524 | 76.7203514 | 1 |
mold.growth | 683 | 92 | 13.4699854 | 2 | |
mold.growth | 1 | 683 | 67 | 9.8096633 | 3 |
seed.discolor | 0 | 683 | 513 | 75.1098097 | 1 |
seed.discolor | 683 | 106 | 15.5197657 | 2 | |
seed.discolor | 1 | 683 | 64 | 9.3704246 | 3 |
seed.size | 0 | 683 | 532 | 77.8916545 | 1 |
seed.size | 683 | 92 | 13.4699854 | 2 | |
seed.size | 1 | 683 | 59 | 8.6383602 | 3 |
shriveling | 0 | 683 | 539 | 78.9165447 | 1 |
shriveling | 683 | 106 | 15.5197657 | 2 | |
shriveling | 1 | 683 | 38 | 5.5636896 | 3 |
roots | 0 | 683 | 551 | 80.6734993 | 1 |
roots | 1 | 683 | 86 | 12.5915081 | 2 |
roots | 683 | 31 | 4.5387994 | 3 | |
roots | 2 | 683 | 15 | 2.1961933 | 4 |
# plot frequency distributions for categorical predictors
%>%
soybean select(c(,1))%>%
plot_bar_category(typographic = FALSE, each=FALSE)
%>%
soybean select(c(,2:12))%>%
plot_bar_category(typographic = FALSE, each=FALSE)
%>%
soybean select(c(,13:24))%>%
plot_bar_category(typographic = FALSE, each=FALSE)
%>%
soybean select(c(,25:36))%>%
plot_bar_category(typographic = FALSE, each=FALSE)
Are any of the distributions degenerate in the ways discussed earlier in this chapter?
A degenerate distribution is comprised of a single random variable with a single value. The Soybean dataset does not include covariates that have strictly degenerate distributions. However, there are several covariates that have near zero variance (nearly degenerate) - i.e. almost all observations belong to one level of the covariate. Included are the covariates ‘mycelium’ and ‘sclerotia’.
%>%
soybeandiagnose_category()%>%
filter(ratio>90)%>%
arrange(desc = TRUE)%>%
flextable()
variables | levels | N | freq | ratio | rank |
mycelium | 0 | 683 | 639 | 93.55783 | 1 |
sclerotia | 0 | 683 | 625 | 91.50805 | 1 |
3.2b. Roughly 18 % of the data are missing. Are there particular predictors that are more likely to be missing? Is the pattern of missing data related to the classes?
The propensity for missingess among predictors owes more to aspects of data collection than what we might infer from the available data. Variables that are difficult/costly to collect are more apt to have high proportions of missingness.
In the soybeans dataset there is a pattern of missing data that relates to the Class variable. For example, the following class levels have very high levels of missingness: 2-4-d-injury, cyst-nematode, diaporthe-pod-&-stem-blight, herbicide injury, and phytophthora-rot.
# calculate missingness statistics across variables
%>%
soybeandiagnose()%>%
::select(-unique_count, -unique_rate)%>%
dplyrfilter(missing_count>0)%>%
arrange(desc(missing_count))%>%
flextable()%>%
set_caption("Missing Data Summary: Soybean")
variables | types | missing_count | missing_percent |
hail | factor | 121 | 17.7159590 |
sever | factor | 121 | 17.7159590 |
seed.tmt | factor | 121 | 17.7159590 |
lodging | factor | 121 | 17.7159590 |
germ | ordered | 112 | 16.3982430 |
leaf.mild | factor | 108 | 15.8125915 |
fruiting.bodies | factor | 106 | 15.5197657 |
fruit.spots | factor | 106 | 15.5197657 |
seed.discolor | factor | 106 | 15.5197657 |
shriveling | factor | 106 | 15.5197657 |
leaf.shread | factor | 100 | 14.6412884 |
seed | factor | 92 | 13.4699854 |
mold.growth | factor | 92 | 13.4699854 |
seed.size | factor | 92 | 13.4699854 |
leaf.halo | factor | 84 | 12.2986823 |
leaf.marg | factor | 84 | 12.2986823 |
leaf.size | ordered | 84 | 12.2986823 |
leaf.malf | factor | 84 | 12.2986823 |
fruit.pods | factor | 84 | 12.2986823 |
precip | ordered | 38 | 5.5636896 |
stem.cankers | factor | 38 | 5.5636896 |
canker.lesion | factor | 38 | 5.5636896 |
ext.decay | factor | 38 | 5.5636896 |
mycelium | factor | 38 | 5.5636896 |
int.discolor | factor | 38 | 5.5636896 |
sclerotia | factor | 38 | 5.5636896 |
plant.stand | ordered | 36 | 5.2708638 |
roots | factor | 31 | 4.5387994 |
temp | ordered | 30 | 4.3923865 |
crop.hist | factor | 16 | 2.3426061 |
plant.growth | factor | 16 | 2.3426061 |
stem | factor | 16 | 2.3426061 |
date | factor | 1 | 0.1464129 |
area.dam | factor | 1 | 0.1464129 |
#plot missingness in relation to Class variable -- from Naniar package
gg_miss_fct(soybean, fct = Class)+labs(title='Proportion of Missing Data in Relation to Class Variable')
3.2c. Develop a strategy for handling missing data, either by eliminating predictors or imputation.
Given that the proportion of missing data with any of our covariates is relatively low (< ~ 17%), I would be inclined to impute values using multivariate imputation with chained equations. This method often works well for categorical variables. I would also evaluate other options to reduce missingness prior to impution. These options could include dropping any covariate(s) with a degenerate distribution and/or near zero variance (e.g., mycelium, sclerotia) and/or dropping a covariate(s) that is highly correlated with other predictors. The latter can be evaluated via pairwise correlation and/or variance inflation factors.
There are yet other options for identifying/dropping covariates with low predictive value (e.g., PCA, Information Gain). I might defer to one or more these if other options don’t yield satisfactory results.